diff --git a/.travis.yml b/.travis.yml index cb7c2ea..82fc349 100644 --- a/.travis.yml +++ b/.travis.yml @@ -130,17 +130,17 @@ script: cd .. fi -deploy: - provider: pypi - user: searchivarius - # https://docs.travis-ci.com/user/environment-variables/#defining-variables-in-repository-settings - password: ${PYPI_PASSWORD} - distributions: sdist bdist_wheel - skip_existing: true - skip_cleanup: true - on: - tags: true - branch: master +#deploy: +# provider: pypi +# user: searchivarius +# # https://docs.travis-ci.com/user/environment-variables/#defining-variables-in-repository-settings +# password: ${PYPI_PASSWORD} +# distributions: sdist bdist_wheel +# skip_existing: true +# skip_cleanup: true +# on: +# tags: true +# branch: master cache: - apt diff --git a/manual/latex/benchmarking.tex b/manual/latex/benchmarking.tex deleted file mode 100644 index 7283d98..0000000 --- a/manual/latex/benchmarking.tex +++ /dev/null @@ -1,599 +0,0 @@ -\section{Benchmarking on Linux/Mac}\label{SectionLinuxBenchmark} - -\subsection{Quick Start on Linux/Mac}\label{QuickStartLinux} -To build the project, go to the directory \href{\replocdir similarity_search}{similarity\_search} and type: -\begin{verbatim} -cmake . -make -\end{verbatim} -This creates several binaries in the directory \ttt{similarity\_search/release}, -most importantly, -a benchmarking utility \ttt{experiment}, which carries out experiments, -and testing utilities \ttt{bunit}, \ttt{test\_integer}, and \ttt{bench\_distfunc}. -A more detailed description of the build process on Linux is given in \S~\ref{SectionBuildLinux}. - -\subsection{Building on Linux/MAC (more details)}\label{SectionBuildLinux} - -A build process creates several important binaries, which include: - -\begin{itemize} -\item NMSLIB library (on Linux \ttt{libNonMetricSpaceLib.a}), -which can be used in external applications; -\item The main benchmarking utility \ttt{experiment} (\ttt{experiment.exe} on Windows) -that carries out experiments and saves evaluation results; -\item A semi unit test utility \ttt{bunit} (\ttt{bunit.exe} on Windows); -\item A utility \ttt{bench\_distfunc} that carries out integration tests (\ttt{bench\_distfunc.exe} on Windows); -\end{itemize} - -Implementation of similarity search methods is in the directory \ttt{similarity\_search}. -The code is built using a \ttt{cmake}, which works on top of the GNU make. -Before creating the makefiles, we need to ensure that a right compiler is used. -This is done by setting two environment variables: \ttt{CXX} and \ttt{CC}. -In the case of GNU C++ (version 4.7), you may need to type: -\begin{verbatim} -export CCX=g++-4.7 CC=gcc-4.7 -\end{verbatim} -If you do not set variables \ttt{CXX} and \ttt{CC}, -the default C++ compiler is used (which can be fine, if it is the right compiler already). - - -To create makefiles for a release version of the code, type: -\begin{verbatim} -cmake -DCMAKE_BUILD_TYPE=Release . -\end{verbatim} -If you did not create any makefiles before, you can shortcut by typing: -\begin{verbatim} -cmake . -\end{verbatim} -To create makefiles for a debug version of the code, type: -\begin{verbatim} -cmake -DCMAKE_BUILD_TYPE=Debug . -\end{verbatim} -When makefiles are created, just type: -\begin{verbatim} -make -\end{verbatim} -If \ttt{cmake} complains about the wrong version of the GCC, -it is most likely that you forgot to set the environment variables \ttt{CXX} and \ttt{CC} (as described above). -If this is the case, make these variables point to the correction version of the compiler. -\textbf{Important note:} -do not forget to delete the \ttt{cmake} cache and make file, before recreating the makefiles. -For example, you can do the following (assuming the current directory is \ttt{similarity\_search}): -\begin{verbatim} -rm -rf `find . -name CMakeFiles` CMakeCache.txt -\end{verbatim} - -Also note that, for some reason, \ttt{cmake} might sometimes ignore environmental variables \ttt{CXX} and \ttt{CC}. -In this unlikely case, you can specify the compiler directly through \ttt{cmake} arguments. -For example, in the case of the GNU C++ and the \ttt{Release} build, -this can be done as follows: -\begin{verbatim} -cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=g++-4.7 \ --DCMAKE_GCC_COMPILER=gcc-4.7 CMAKE_CC_COMPILER=gcc-4.7 . -\end{verbatim} - -The build process creates several binaries. Most importantly, -the main benchmarking utility \ttt{experiment}. -The directory \ttt{similarity\_search/release} contains release versions of -these binaries. Debug versions are placed into the folder \ttt{similarity\_search/debug}. - -\textbf{Important note:} a shortcut command: -\begin{verbatim} -cmake . -\end{verbatim} -(re)-creates makefiles for the previously -created build. When you type \ttt{cmake .} for the first time, -it creates release makefiles. However, if you create debug -makefiles and then type \ttt{cmake .}, -this will not lead to creation of release makefiles! - -If the user cannot install necessary libraries to a standard location, -it is still possible to build a project. -First, download Boost to some local directory. -Assume it is \ttt{\$HOME/boost\_download\_dir}. -Then, set the corresponding environment variable, which will inform \ttt{cmake} -about the location of the Boost files: -\begin{verbatim} - export BOOST_ROOT=$HOME/boost_download_dir -\end{verbatim} - -Second, the user needs to install the additional libraries. Assume that the -lib-files are installed to \ttt{\$HOME/local\_lib}, while corresponding include -files are installed to \ttt{\$HOME/local\_include}. Then, the user -needs to invoke \ttt{cmake} with the following arguments (after possibly deleting previously -created cache and makefiles): -\begin{verbatim} -cmake . -DCMAKE_LIBRARY_PATH=$HOME/local_lib \ - -DCMAKE_INCLUDE_PATH=$HOME/local_include \ - -DBoost_NO_SYSTEM_PATHS=true -\end{verbatim} -Note the last option. Sometimes, an old version of Boost is installed. -Setting the variable \ttt{Boost\_NO\_SYSTEM\_PATHS} to true, -tells \ttt{cmake} to ignore such an installation. - -To use the library in external applications, which do not belong to the library repository, -one needs to install the library first. -Assume that an installation location is the folder \ttt{NonMetrLibRelease} in the -home directory. Then, the following commands do the trick: -\begin{verbatim} -cmake \ - -DCMAKE_INSTALL_PREFIX=$HOME/NonMetrLibRelease \ - -DCMAKE_BUILD_TYPE=Release . -make install -\end{verbatim} - -A directory \href{\replocfile sample_standalone_app}{sample\_standalone\_app} -contains two sample programs (see files -\href{\replocfile sample_standalone_app/sample_standalone_app1.cc}{sample\_standalone\_app1.cc} -and -\href{\replocfile sample_standalone_app/sample_standalone_app2.cc}{sample\_standalone\_app2.cc}) -that use NMSLIB binaries installed in the folder \ttt{\$HOME/NonMetrLibRelease}. - -\subsection{Query Server (Linux/Mac)}\label{SectionQueryServer} -\todonoteinline{What about saving/storing data here?} - -The query server requires Apache Thrift. We used Apache Thrift 0.9.2, but, perhaps, newer versions will work as well. -To install Apache Thrift, you need to \href{https://thrift.apache.org/docs/BuildingFromSource}{build it from source}. -This usually requires additional libraries, but we do not cover their installation. - -After Apache Thrift is installed, you need to build the library itself. Then, change the directory -to \href{\replocfile query_server/cpp_client_server}{query\_server/cpp\_client\_server} and type \ttt{make} (the makefile -may need to be modified, if Apache Thrift is installed to a non-standard location). -The query server has a similar set of parameters to the benchmarking utility \ttt{experiment}. For example, -you can start the server as follows: -\begin{verbatim} -./query_server -i ../../sample_data/final8_10K.txt -s l2 -p 10000 \ - -m sw-graph -c NN=10,efConstruction=200,initIndexAttempts=1 -\end{verbatim} - -There are also three sample clients implemented in \href{\replocfile query_server/cpp_client_server}{C++}, \href{\replocfile query_server/python_client/}{Python}, -and \href{\replocfile query_server/java_client/}{Java}. -A client reads a string representation of the query object from the standard stream. -The format is the same as the format of objects in a data file. -Here is an example of searching for ten vectors closest to the first data set vector (stored in row one) of a provided sample data file: -\begin{verbatim} -export DATA_FILE=../../sample_data/final8_10K.txt -head -1 $DATA_FILE | ./query_client -p 10000 -a localhost -k 10 -\end{verbatim} - -It is also possible to generate client classes for other languages supported by Thrift from -\href{\replocfile query_server/protocol.thrift}{the interface definition file}, e.g., for C\#. To this end, one should invoke the thrift compiler as follows: -\begin{verbatim} -thrift --gen csharp protocol.thrift -\end{verbatim} -For instructions on using generated code, please consult the \href{https://thrift.apache.org/tutorial/}{Apache Thrift tutorial}. - -\subsection{Testing the Correctness of Implementations} -We have two main testing utilities \ttt{bunit} and \ttt{test\_integr} (\ttt{experiment.exe} and -\ttt{test\_integr.exe} on Windows). -Both utilities accept the single optional argument: the name of the log file. -If the log file is not specified, a lot of informational messages are printed to the screen. - -The \ttt{bunit} verifies some basic functitionality akin to unit testing. -In particular, it checks that an optimized version of the, e.g., Eucledian, distance -returns results that are very similar to the results returned by unoptimized and simpler version. -The utility \ttt{bunit} is expected to always run without errors. - -The utility \ttt{test\_integr} runs complete implementations of many methods -and checks if several effectiveness and efficiency characteristics -meet the expectations. -The expectations are encoded as an array of instances of the class \ttt{MethodTestCase} -(see \href{\replocdir similarity_search/test/test_integr.cc#L65}{the code here}). -For example, we expect that the recall (see \S~\ref{SectionEffect}) -fall in a certain pre-recorded range. -Because almost all our methods are randomized, there is a great deal of variance -in the observed performance characteristics. Thus, some tests -may fail infrequently, if e.g., the actual recall value is slightly lower or higher -than an expected minimum or maximum. -From an error message, it should be clear if the discrepancy is substantial, i.e., -something went wrong, or not, i.e., we observe an unlikely outcome due to randomization. -The exact search method, however, should always have an almost perfect recall. - -Variance is partly due to using low-dimensional test sets. In the future, we plan to change this. -For high-dimensional data sets, the outcomes are much more stable despite the randomized nature -of most implementations. - - -\subsection{Running Benchmarks}\label{SectionRunBenchmark} -There are no principal differences in benchmarking on Linux and Windows. -There is a \emph{single} benchmarking utility -\ttt{experiment} (\ttt{experiment.exe} on Windows) that includes implementation of all methods. -It has multiple options, which specify, among others, -a space, a data set, a type of search, a method to test, and method parameters. -These options and their use cases are described in the following subsections. -Note that unlike older versions it is not possible to test multiple methods in a single run. -However, it is possible to test the single method with different values of query-time parameters. - -\subsubsection{Space and distance value type} - -A distance function can return an integer (\ttt{int}), a single-precision (\ttt{float}), -or a double-precision (\ttt{double}) real value. -A type of the distance and its value is specified as follows: - -\begin{verbatim} - -s [ --spaceType ] arg space type, e.g., l1, l2, lp:p=0.25 - --distType arg (=float) distance value type: - int, float, double -\end{verbatim} - -A description of a space may contain parameters (parameters may not contain whitespaces). -In this case, there is colon after the space mnemonic name followed by a -comma-separated (not spaces) list of parameters in the format: -\ttt{=}. -Currently, this is used only for $L_p$ spaces. For instance, - \ttt{lp:0.5} denotes the space $L_{0.5}$. -A detailed list of possible spaces and respective -distance functions is given in Table~\ref{TableSpaces} in \S~\ref{SectionSpaces}. - -For real-valued distance functions, one can use either single- or double-precision -type. Single-precision is a recommended default. -One example of integer-valued distance function the Levenshtein distance. - -\subsubsection{Input Data/Test Set} -There are two options that define the data to be indexed: -\begin{verbatim} - -i [ --dataFile ] arg input data file - -D [ --maxNumData ] arg (=0) if non-zero, only the first - maxNumData elements are used -\end{verbatim} -The input file can be indexed either completely, or partially. -In the latter case, the user can create the index using only -the first \ttt{--maxNumData} elements. - -For testing, the user can use a separate query set. -It is, again, possible to limit the number of queries: -\begin{verbatim} - -q [ --queryFile ] arg query file - -Q [ --maxNumQuery ] arg (=0) if non-zero, use maxNumQuery query - elements(required in the case - of bootstrapping) -\end{verbatim} -If a separate query set is not available, it can be simulated by bootstrapping. -To this, end the \ttt{--maxNumData} elements of the original data set -are randomly divided into testing and indexable sets. -The number of queries in this case is defined by the option \ttt{--maxNumQuery}. -A number of bootstrap iterations is specified through an option: -\begin{verbatim} - -b [ --testSetQty ] arg (=0) # of sets created by bootstrapping; -\end{verbatim} -Benchmarking can be carried out in either a single- or a multi-threaded -mode. The number of test threads are specified as follows: -\begin{verbatim} - --threadTestQty arg (=1) # of threads -\end{verbatim} - -\subsubsection{Query Type} -Our framework supports the \knn and the range search. -The user can request to run both types of queries: -\begin{verbatim} - -k [ --knn ] arg comma-separated values of k - for the k-NN search - -r [ --range ] arg comma-separated radii for range search -\end{verbatim} -For example, by specifying the options -\begin{verbatim} ---knn 1,10 --range 0.01,0.1,1 -\end{verbatim} -the user requests to run queries of five different types: $1$-NN, $10$-NN, -as well three range queries with radii 0.01, 0.1, and 1. - -\subsubsection{Method Specification} -Unlike older versions it is possible to test only a single method at a time. -To specify a method's name, use the following option: -\begin{verbatim} - -m [ --method ] arg method/index name -\end{verbatim} -A method can have a single set of index-time parameters, which is specified -via: -\begin{verbatim} --c [ --createIndex ] arg index-time method(s) parameters -\end{verbatim} -In addition to the set of index-time parameters, -the method can have multiple sets of query-time parameters, which are specified -using the following (possibly repeating) option: -\begin{verbatim} --t [ --queryTimeParams ] arg query-time method(s) parameters -\end{verbatim} -For each set of query-time parameters, i.e., for each occurrence of the option \ttt{--queryTimeParams}, -the benchmarking utility \ttt{experiment}, -carries out an evaluation using the specified set of queries and a query type (e.g., a 10-NN search with -queries from a specified file). -If the user does not specify any query-time parameters, there is only one evaluation to be carried out. -This evaluation uses default query-time parameters. -In general, we ensure that \emph{whenever a query-time parameter is missed, the default value is used}. - -Similar to parameters of the spaces, -a set of method's parameters is a comma-separated list (no-spaces) -of parameter-value pairs in the format: -\ttt{=}. -For a detailed list of methods and their parameters, please, refer to \S~\ref{SectionMethods}. - -Note that a few methods can save/restore (meta) indices. To save and load indices -one should use the following options: -\begin{verbatim} - -L [ --loadIndex ] arg a location to load the index from - -S [ --saveIndex ] arg a location to save the index to -\end{verbatim} -When the user defines the location of the index using the option \ttt{--loadIndex}, -the index-time parameters may be ignored. -Specifically, if the specified index does not exist, the index is created from scratch. -Otherwise, the index is loaded from disk. -Also note that the benchmarking utility \emph{does not override an already existing index} (when the option \ttt{--saveIndex} is present). - -If the tests are run the bootstrapping mode, i.e., when queries are randomly sampled (without replacement) from the -data set, several indices may need to be created. Specifically, for each split we create a separate index file. -The identifier of the split is indicated using a special suffix. -Also note that we need to memorize which data points in the split were used as queries. -This information is saved in a gold standard cache file (see \S~\ref{SectionEfficiency}). -Thus, saving and loading of indices in the bootstrapping mode is possible only if -gold standard caching is used. - -\subsubsection{Saving and Processing Benchmark Results} -The benchmarking utility may produce output of three types: -\begin{itemize} -\item Benchmarking results (a human readable report and a tab-separated data file); -\item Log output (which can be redirected to a file); -\item Progress bars to indicate the progress in index creation for some methods (cannot be currently suppressed); -\end{itemize} - -To save benchmarking results to a file, on needs to specify a parameter: -\begin{verbatim} - -o [ --outFilePrefix ] arg output file prefix -\end{verbatim} -As noted above, we create two files: a human-readable report (suffix \ttt{.rep}) and -a tab-separated data file (suffix \ttt{.data}). -By default, the benchmarking utility creates files from scratch: -If a previously created report exists, it is erased. -The following option can be used to append results to the previously created report: -\begin{verbatim} - -a [ --appendToResFile ] do not override information in results -\end{verbatim} -For information on processing and interpreting results see \S~\ref{SectionMeasurePerf}. -A description of the plotting utility is given in \S~\ref{SectionGenPlot}. - -By default, all log messages are printed to the standard error stream. -However, they can also be redirected to a log-file: -\begin{verbatim} - -l [ --logFile ] arg log file -\end{verbatim} - -\subsubsection{Efficiency of Testing}\label{SectionBenchEfficiency} -Except for measuring methods' performance, -the following are the most expensive operations: -\begin{itemize} -\item computing ground truth answers (also -known as \emph{gold standard} data); -\item loading the data set; -\item indexing. -\end{itemize} -To make testing faster, the following methods can be used: -\begin{itemize} -\item Caching of gold standard data; -\item Creating gold standard data using multiple threads; -\item Reusing previously created indices (when loading and saving is supported by a method); -\item Carrying out multiple tests using different sets of query-time parameters. -\end{itemize} - -By default, we recompute gold standard data every time we run benchmarks, which may take long time. -However, it is possible to save gold standard data and re-use it later -by specifying an additional argument: -\begin{verbatim} --g [ --cachePrefixGS ] arg a prefix of gold standard cache files -\end{verbatim} -The benchmarks can be run in a multi-threaded mode by specifying a parameter: -\begin{verbatim} ---threadTestQty arg (=1) # of threads during querying -\end{verbatim} -In this case, the gold standard data is also created in a multi-threaded mode (which can also be much faster). -Note that NMSLIB directly supports only an inter-query parallelism, i.e., multiple queries are executed in parallel, -rather than the intra-query parallelism, where a single query can be processed by multiple CPU cores. - -Gold standard data is stored in two files. One is a textual meta file -that memorizes important input parameters such as the name of the data and/or query file, -the number of test queries, etc. -For each query, the binary cache files contains ids of answers (as well as distances and -class labels). -When queries are created by random sampling from the main data set, -we memorize which objects belong to each query set. - -When the gold standard data is reused later, -the benchmarking code verifies if input parameters match the content of the cache file. -Thus, we can prevent an accidental use of gold standard data created for one data set -while testing with a different data set. - -Another sanity check involves verifying that data points obtained via an approximate -search are not closer to the query than data points obtained by an exact search. -This check has turned out to be quite useful. -It has helped detecting a problem in at least the following two cases: -\begin{itemize} -\item The user creates a gold standard cache. Then, the user modifies a distance function and runs the tests again; -\item Due to a bug, the search method reports distances to the query that are smaller than -the actual distances (computed using a distance function). This may occur, e.g., due to -a memory corruption. -\end{itemize} - -When the benchmarking utility detects a situation when an approximate method -returns points closer than points returned by an exact method, the testing procedure is terminated -and the user sees a diagnostic message (see Table~\ref{TableFailGSCheck} for an example). - -\begin{table} -\scriptsize -\begin{verbatim} -... [INFO] >>>> Computing effectiveness metrics for sw-graph -... [INFO] Ex: -2.097 id = 140154 -> Apr: -2.111 id = 140154 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.974 id = 113850 -> Apr: -2.005 id = 113850 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.883 id = 102001 -> Apr: -1.898 id = 102001 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.6667 id = 58445 -> Apr: -1.6782 id = 58445 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.6547 id = 76888 -> Apr: -1.6688 id = 76888 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.5805 id = 47669 -> Apr: -1.5947 id = 47669 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.5201 id = 65783 -> Apr: -1.4998 id = 14954 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.4688 id = 14954 -> Apr: -1.3946 id = 25564 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.454 id = 90204 -> Apr: -1.3785 id = 120613 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.3804 id = 25564 -> Apr: -1.3190 id = 22051 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.367 id = 120613 -> Apr: -1.205 id = 101722 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.318 id = 71704 -> Apr: -1.1661 id = 136738 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.3103 id = 22051 -> Apr: -1.1039 id = 52950 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.191 id = 101722 -> Apr: -1.0926 id = 16190 1 - ratio: -0.0202 diff: -0.0221 -... [INFO] Ex: -1.157 id = 136738 -> Apr: -1.0348 id = 13878 1 - ratio: -0.0202 diff: -0.0221 -... [FATAL] bug: the approximate query should not return objects that are closer to the query - than object returned by (exact) sequential searching! - Approx: -1.09269 id = 16190 Exact: -1.09247 id = 52950 -\end{verbatim} -\caption{\label{TableFailGSCheck} A diagnostic message indicating that -there is some mismatch in the current experimental setup and the setup used to create -the gold standard cache file.} -\end{table} - -To compute recall, it is enough to memorize only query answers. -For example, in the case of a 10-NN search, it is enough to memorize only 10 data points closest -to the query as well as respective distances (to the query). -However, this is not sufficient for computation of a rank approximation metric (see \S~\ref{SectionEffect}). -To control the accuracy of computation, we permit the user to change the number of entries memorized. -This number is defined by the parameter: -\begin{verbatim} ---maxCacheGSRelativeQty arg (=10) a maximum number of gold - standard entries -\end{verbatim} -Note that this parameter is a coefficient: the actual number of entries is defined relative to -the result set size. For example, if a range search returns 30 entries and the value of -\ttt{--maxCacheGSRelativeQty} is 10, then $30 \times 10=300$ entries are saved in the gold standard -cache file. - -\subsection{Measuring Performance and Interpreting Results}\label{SectionMeasurePerf} -\subsubsection{Efficiency.} We measure several efficiency metrics: query runtime, the number of distance computations, -the amount of memory used by the index \emph{and} the data, and the time to create the index. -We also measure the improvement in runtime (improvement in efficiency) -with respect to a sequential search (i.e., brute-force) approach -as well as an improvement in the number of distance computations. -If the user runs benchmarks in a multi-threaded mode (by specifying the option \ttt{--threadTestQty}), -we compare against the multi-threaded version of the brute-force search as well. - -A good method should -carry out fewer distance computations and be faster than the brute-force search, -which compares \emph{all the objects} directly with the query. -However, great reduction in the number of distance computations -does not always entail good improvement in efficiency: -while we do not spend CPU time on computing directly, -we may be spending CPU time on, e.g., computing the value of the distance -in the projected space (as -in the case of projection-based methods, see \S~\ref{SectionProjMethod}). - -Note that the improvement in efficiency is adjusted for the number of threads. -Therefore, it does not increase as more threads are added: in contrast, -it typically decreases as there is more competition for computing resources (e.g., memory) -in a multi-threaded mode. - -The amount of memory consumed by a search method is measured indirectly: -We record the overall memory usage of a benchmarking process before and after creation of the index. Then, we add the amount of memory used by the data. -On Linux, we query a special file \ttt{/dev//status}, -which might not work for all Linux distributives. -Under Windows, we retrieve the working set size using the function -\ttt{GetProcessMemoryInfo}. -Note that we do not have a truly portable code to measure memory consumption of a process. - -\subsubsection{Effectiveness} -\label{SectionEffect} - -In the following description, we assume that a method returns -a set of points/objects $\{o_i\}$. -The value of $\pos(o_i)$ represents a positional distance from $o_i$ to the query, -i.e., the number of database objects closer to the query than $o_i$ plus one. -Among objects with identical distances to the query, -the object with the smallest index is considered to be the closest. -Note that $\pos(o_i) \ge i$. - -Several effectiveness metrics are computed by the benchmarking utility: -\begin{itemize} -\item A \emph{number of points closer} to the query than the nearest returned point. -This metric is equal $\pos(o_1)$ minus one. If $o_1$ is always the true nearest object, -its positional distance is one and, thus, the \emph{number of points closer} is always equal to zero. -\item A \emph{relative position} error for point $o_i$ is equal to $\pos(o_i)/i$, -an aggregate value is obtained by computing the geometric mean over all returned $o_i$; -\item \emph{Recall}, which is is equal to the fraction of all correct answers retrieved. -\item \emph{Classification accuracy}, which is equal to the fraction of labels correctly -predicted by a \knn based classification procedure. -\end{itemize} - -The first two metrics represent a so-called rank (approximation) error. -The closer the returned objects are to the query object, -the better is the quality of the search response and the lower is the rank -approximation error. - -\begin{wraptable}{R}{0.52\textwidth} -\caption{An example of a human-readable report -\label{TableHRep}} -\begin{verbatim} -=================================== -vptree: triangle inequality -alphaLeft=2.0,alphaRight=2.0 -=================================== -# of points: 9900 -# of queries: 100 ------------------------------------- -Recall: 0.954 -> [0.95 0.96] -ClassAccuracy: 0 -> [0 0] -RelPosError: 1.05 -> [1.05 1.06] -NumCloser: 0.11 -> [0.09 0.12] ------------------------------------- -QueryTime: 0.2 -> [0.19 0.21] -DistComp: 2991 -> [2827 3155] ------------------------------------- -ImprEfficiency: 2.37 -> [2.32 2.42] -ImprDistComp: 3.32 -> [3.32 3.39] ------------------------------------- -Memory Usage: 5.8 MB ------------------------------------- -\end{verbatim} -\textbf{Note:} \emph{confidence intervals} are in brackets -%\vspace{-4em} -\end{wraptable} - - -Recall is a classic metric. -It was argued, however, -that recall does not account for positional information of returned objects -and is, therefore, inferior to rank approximation error metrics \cite{Amato_et_al:2003,Cayton2008}. -Consider the case of 10-NN search and imagine that there are two methods -that, on average, find only half of true 10-NN objects. -Also, assume that the first method always finds neighbors from one to five, -but misses neighbors from six to ten. -The second method always finds neighbors from six to ten, but misses the first five ones. -Clearly, the second method produces substantially inferior results, but it has the same recall as the first one. - -If we had ground-truth queries and relevance judgments from human assessors, -we could in principle compute other realistic effectiveness metrics -such as the mean average precision, -or the normalized discounted cumulative gain. -This remains for the future work. - -Note that it is pointless to compute the mean average precision -when human judgments are not available, \href{http://searchivarius.org/blog/when-average-precision-equal-recall}{as the mean average precision -is identical to the recall in this case}. - -\subsection{Interpreting and Processing Benchmark Results} -If the user specifies the option \ttt{--outFilePrefix}, -the benchmarking results are stored to the file system. -A prefix of result files is defined by the parameter \ttt{--outFilePrefix} -while the suffix is defined by a type of the search procedure (the \knn or the range search) -as well as by search parameters (e.g., the range search radius). -For each type of search, two files are generated: -a report in a human-readable format, -and a tab-separated data file intended for automatic processing. -The data file contains only the average values, -which can be used to, e.g., produce efficiency-effectiveness plots -as described in \S~\ref{SectionGenPlot}. - -An example of human readable report (\emph{confidence intervals} are in square brackets) -is given in Table~\ref{TableHRep}. -In addition to averages, the human-readable report provides 95\% confidence intervals. -In the case of bootstrapping, statistics collected for several splits of the data set are aggregated. -For the retrieval time and the number of distance computations, -this is done via a classic fixed-effect model adopted in meta analysis \cite{Hedges_and_Vevea:1998}. -When dealing with other performance metrics, -we employ a simplistic approach of averaging split-specific values -and computing the sample variance over spit-specific averages.\footnote{The distribution -of many metric values is not normal. There are approaches to resolve this issue (e.g., apply a transformation), -but an additional investigation is needed to understand which approaches work best.} -Note for all metrics, except relative position error, an average is computed using an arithmetic mean. -For the relative error, however, we use the geometric mean \cite{king:1986}. - diff --git a/manual/latex/extend.tex b/manual/latex/extend.tex deleted file mode 100644 index 494ed46..0000000 --- a/manual/latex/extend.tex +++ /dev/null @@ -1,612 +0,0 @@ - -\section{Extending the code}\label{SectionExtend} -It is possible to add new spaces and search methods. -This is done in three steps, which we only outline here. -A more detailed description can be found in \S~\ref{SectionCreateSpace} -and \S~\ref{SectionCreateMethod}. - -In the first step, the user writes the code that implements a -functionality of a method or a space. -In the second step, the user writes a special helper file -containing a method that creates a class or a method. -In this helper file, it is necessary to include -the method/space header. - -Because we tend to give the helper file the same name -as the name of header for a specific method/space, -we should not include method/space headers using quotes (in other words, -use only \emph{angle} brackets). -Such code fails to compile under the Visual Studio. -Here is an example of a proper include-directive: -\begin{verbatim} -#include -\end{verbatim} - - -In the third step, the user adds -the registration code to either the file -\href{\replocfile similarity_search/include/factory/init_spaces.h}{init\_spaces.h} (for spaces) -or to the file -\href{\replocfile similarity_search/include/factory/init_methods.h}{init\_methods.h} (for methods). -This step has two sub-steps. -First, the user includes the previously created helper file into either -\ttt{init\_spaces.h} or \ttt{init\_methods.h}. -Second, the function \ttt{initMethods} or \ttt{initSpaces} is extended -by adding a macro call that actually registers the space or method in a factory class. - -Note that no explicit/manual modification of makefiles (or other configuration files) is required. -However, you have to re-run \ttt{cmake} each time a new source file is created (addition -of header files does not require a \ttt{cmake} run). This is necessary to automatically update makefiles so that they include new source files. - -Is is noteworthy that all implementations of methods and spaces -are mostly template classes parameterized by the distance value type. -Recall that the distance function can return an integer (\ttt{int}), -a single-precision (\ttt{float}), or a double-precision (\ttt{double}) real value. -The user may choose to provide specializations for all possible -distance values or decide to focus, e.g., only on integer-valued distances. - -The user can also add new applications, which are meant to be -a part of the testing framework/library. -However, adding new applications does require minor editing of the meta-makefile \ttt{CMakeLists.txt} -(and re-running \ttt{cmake} \S~\ref{SectionBuildLinux}) on Linux -It is also possible to create standalone applications that use the library. - -In the following subsections, -we consider extension tasks in more detail. -For illustrative purposes, -we created a zero-functionality space (\ttt{DummySpace}), -method (\ttt{DummyMethod}), and application (\ttt{dummy\_app}). -These zero-functionality examples can also be used as starting points to develop fully functional code. - -\subsection{Test Workflow}\label{SectionWorkflow} -The main benchmarking utility \ttt{experiment} parses command line parameters. -Then, it creates a space and a search method using the space and the method factories. -Both search method and spaces can have parameters, -which are passed to the method/space in an instance -of the class \ttt{AnyParams}. We consider this in detail in \S~\ref{SectionCreateSpace} and \S~\ref{SectionCreateMethod}. - -When we create a class representing a search method, -the constructor of the class does not create an index in the memory. -The index is created using either the function \ttt{CreateIndex} (from scratch) -or the function \ttt{LoadIndex} (from a previously created index image). -The index can be saved to disk using the function \ttt{SaveIndex}. -Note, however, that most methods do not support index (de)-serialization. - -Depending on parameters passed to the benchmarking utility, two test scenarios are possible. -In the first scenario, the user specifies separate data and test files. -In the second scenario, a test file is created by bootstrapping: -The data set is randomly divided into training and a test set. -Then, -we call the function \href{\replocfile similarity_search/include/experiments.h#L70}{RunAll} -and subsequently \href{\replocfile similarity_search/include/experiments.h#L213}{Execute} for all possible test sets. - -The function \ttt{Execute} is a main workhorse, which creates queries, runs searches, -produces gold standard data, and collects execution statistics. -There are two types of queries: nearest-neighbor and range queries, -which are represented by (template) classes \ttt{RangeQuery} and \ttt{KNNQuery}. -Both classes inherit from the class \ttt{Query}. -Similar to spaces, these template classes are parameterized by the type of the distance value. - -Both types of queries are similar in that they implement the \ttt{Radius} function -and the functions \ttt{CheckAndAddToResult}. -In the case of the range query, the radius of a query is constant. -However, in the case of the nearest-neighbor query, -the radius typically decreases as we compare the query -with previously unseen data objects (by calling the function \ttt{CheckAndAddToResult}). -In both cases, the value of the function \ttt{Radius} can be used to prune unpromising -partitions and data points. - -This commonality between the \ttt{RangeQuery} and \ttt{KNNQuery} -allows us in many cases to carry out a nearest-neighbor query -using an algorithm designed to answer range queries. -Thus, only a single implementation of a search method---that answers queries of both types---can be used in many cases. - -A query object proxies distance computations during the testing phase. -Namely, the distance function is accessible through the function -\ttt{IndexTimeDistance}, which is defined in the class \ttt{Space}. -During the testing phase, a search method can compute a distance -only by accessing functions \ttt{Distance}, -\ttt{DistanceObjLeft} (for left queries) and -\ttt{DistanceObjRight} for right queries, -which are member functions of the class \ttt{Query}. -The function \ttt{Distance} accepts two parameters (i.e., object pointers) and -can be used to compare two arbitrary objects. -The functions \ttt{DistanceObjLeft} and \ttt{DistanceObjRight} are used -to compare data objects with the query. -Note that it is a query object memorizes the number of distance computations. -This allows us to compute the variance in the number of distance evaluations -and, consequently, a respective confidence interval. - -\todonoteinline{Add stuff about saving the space} - -\subsection{Creating a space}\label{SectionCreateSpace} -A space is a collection of data objects. -In our library, objects are represented by instances -of the class \ttt{Object}. -The functionality of this class is limited to -creating new objects and/or their copies as well providing -access to the raw (i.e., unstructured) representation of the data -(through functions \ttt{data} and \ttt{datalength}). -We would re-iterate that currently (though this may change in the future releases), -\ttt{Object} is a very basic class that only keeps a blob of data and blob's size. -For example, the \ttt{Object} can store an array of single-precision floating point -numbers, but it has no function to obtain the number of elements. -These are the spaces that are responsible for reading objects from files, -interpreting the structure of the data blobs (stored in the \ttt{Object}), -and computing a distance between two objects. - - -For dense vector spaces the easiest way to create a new space, -is to create a functor (function object class) that computes a distance. -Then, this function should be used to instantiate a template -\href{\replocfile similarity_search/include/space/space_vector_gen.h}{VectorSpaceGen}. -A sample implementation of this approach can be found -in \href{\replocfile sample_standalone_app/sample_standalone_app1.cc#L114}{sample\_standalone\_app1.cc}. -However, as we explain below, \textbf{additional work} is needed if the space should work correctly with all projection methods -(see \S~\ref{SectionProjMethod}) or any other methods that rely on projections (e.g., OMEDRANK \S~\ref{SectionOmedrank}). - -To further illustrate the process of developing a new space, -we created a sample zero-functionality space \ttt{DummySpace}. -It is represented by -the header file -\href{\replocfile similarity_search/include/space/space_dummy.h}{space\_dummy.h} -and the source file -\href{\replocfile similarity_search/src/space/space_dummy.cc}{space\_dummy.cc}. -The user is encouraged to study these files and read the comments. -Here we focus only on the main aspects of creating a new space. - -The sample files include a template class \ttt{DummySpace} (see Table~\ref{FigDummySpace}), -which is declared and defined in the namespace \ttt{similarity}. -It is a direct ancestor of the class \ttt{Space}. - -It is possible to provide the complete implementation of the \ttt{DummySpace} -in the header file. However, this would make compilation slower. -Instead, we recommend to use the mechanism of explicit template instantiation. -To this end, the user should instantiate the template in the source file -for all possible combination of parameters. -In our case, the \emph{source} file -\href{\replocfile similarity_search/src/space/space_dummy.cc}{space\_dummy.cc} -contains the following lines: -\begin{verbatim} -template class SpaceDummy; -template class SpaceDummy; -template class SpaceDummy; -\end{verbatim} - -\begin{table}[!htbp] -\caption{\label{FigDummySpace}A sample space class} -\begin{verbatim} -template -class SpaceDummy : public Space { - public: - ... - /** Standard functions to read/write/create objects */ - // Create an object from a (possibly binary) string. - virtual unique_ptr - CreateObjFromStr(IdType id, LabelType label, const string& s, - DataFileInputState* pInpState) const; - - // Create a string representation of an object. - // The string representation may include external ID. - virtual string - CreateStrFromObj(const Object* pObj, const string& externId) const; - - // Open a file for reading, fetch a header - // (if there is any) and memorize an input state - virtual unique_ptr - OpenReadFileHeader(const string& inputFile) const; - - // Open a file for writing, write a header (if there is any) - // and memorize an output state - virtual unique_ptr - OpenWriteFileHeader(const ObjectVector& dataset, - const string& outputFile) const; - /* - * Read a string representation of the next object in a file as well - * as its label. Return false, on EOF. - */ - virtual bool - ReadNextObjStr(DataFileInputState &, string& strObj, LabelType& label, - string& externId) const; - - /* - * Write a string representation of the next object to a file. We totally delegate - * this to a Space object, because it may package the string representation, by - * e.g., in the form of an XML fragment. - */ - virtual void WriteNextObj(const Object& obj, const string& externId, - DataFileOutputState &) const; - /** End of standard functions to read/write/create objects */ - - ... - - /* - * CreateDenseVectFromObj and GetElemQty() are only needed, if - * one wants to use methods with random projections. - */ - virtual void CreateDenseVectFromObj(const Object* obj, dist_t* pVect, - size_t nElem) const { - throw runtime_error("Cannot create vector for the space: " + StrDesc()); - } - virtual size_t GetElemQty(const Object* object) const {return 0;} - protected: - virtual dist_t HiddenDistance(const Object* obj1, - const Object* obj2) const; - // Don't permit copying and/or assigning - DISABLE_COPY_AND_ASSIGN(SpaceDummy); -}; -\end{verbatim} -\end{table} - -Most importantly, the user needs to implement the function \ttt{HiddenDistance}, -which computes the distance between objects, -and the function \ttt{CreateObjFromStr} that creates a data point object from an instance -of a C++ class \ttt{string}. -For simplicity---even though this is not the most efficient approach---all our spaces create -objects from textual representations. However, this is not a principal limitation, -because a C++ string can hold binary data as well. -Perhaps, the next most important function is \ttt{ReadNextObjStr}, -which reads a string representation of the next object from a file. -A file is represented by a reference to a subclass of the class \ttt{DataFileInputState}. - -Compared to previous releases, the new \ttt{Space} API is substantially more complex. -This is necessary to standardize reading/writing of generic objects. -In turn, this has been essential to implementing a generic query server. -The query server accepts data points in the same format as they are stored in a data file. -The above mentioned function \ttt{CreateObjFromStr} is used for de-serialization -of both the data points stored in a file and query data points passed to the query server. - -Additional complexity arises from the need to update space parameters after a space object is created. -This permits a more complex storage model where, e.g., parameters are stored -in a special dedicated header file, while data points are stored elsewhere, -e.g., split among several data files. -To support such functionality, we have a function that opens a data file (\ttt{OpenReadFileHeader}) -and creates a state object (sub-classed from \ttt{DataFileInputState}), -which keeps the current file(s) state as well as all space-related parameters. -When we read data points using the function \ttt{ReadNextObjStr}, -the state object is updated. -The function \ttt{ReadNextObjStr} may also read an optional external identifier for an object. -When it produces a non-empty identifier it is memorized by the query server and is further -used for query processing (see \S~\ref{SectionQueryServer}). -After all data points are read, this state object is supposed to be passed to the \ttt{Space} object -in the following fashion: -\begin{verbatim} -unique_ptr -inpState(space->ReadDataset(dataSet, externIds, fileName, maxNumRec)); -space->UpdateParamsFromFile(*inpState); -\end{verbatim} -For a more advanced implementation of the space-related functions, -please, see the file -\href{\replocfile similarity_search/src/space/space_vector.cc}{space\_vector.cc}. - -Remember that the function \ttt{HiddenDistance} should not be directly accessible -by classes that are not friends of the \ttt{Space}. -As explained in \S~\ref{SectionWorkflow}, -during the indexing phase, -\ttt{HiddenDistance} is accessible through the function -\ttt{Space::IndexTimeDistance}. -During the testing phase, a search method can compute a distance -only by accessing functions \ttt{Distance}, \ttt{DistanceObjLeft}, or -\ttt{DistanceObjRight}, which are member functions of the \ttt{Query}. -This is by far not a perfect solution and we are contemplating about better ways to proxy distance computations. - -Should we implement a vector space that works properly with projection methods -and classic random projections, we need to define functions \ttt{GetElemQty} and \ttt{CreateDenseVectFromObj}. -In the case of a \emph{dense} vector space, \ttt{GetElemQty} -should return the number of vector elements stored in the object. -For \emph{sparse} vector spaces, it should return zero. The function \ttt{CreateDenseVectFromObj} -extracts elements stored in a vector. For \emph{dense} vector spaces, -it merely copies vector elements to a buffer. -For \emph{sparse} space vector spaces, -it should do some kind of basic dimensionality reduction. -Currently, we do it via the hashing trick (see \S~\ref{SectionProjDetails}). - - -Importantly, we need to ``tell'' the library about the space, -by registering the space in the space factory. -At runtime, the space is created through a helper function. -In our case, it is called \ttt{CreateDummy}. -The function, accepts only one parameter, -which is a reference to an object of the type \ttt{AllParams}: -%\newpage - -\begin{verbatim} -template -Space* CreateDummy(const AnyParams& AllParams) { - AnyParamManager pmgr(AllParams); - - int param1, param2; - - pmgr.GetParamRequired("param1", param1); - pmgr.GetParamRequired("param2", param2); - - pmgr.CheckUnused(); - - return new SpaceDummy(param1, param2); -} -\end{verbatim} -To extract parameters, the user needs an instance of the class \ttt{AnyParamManager} (see the above example). -In most cases, it is sufficient to call two functions: \ttt{GetParamOptional} and -\ttt{GetParamRequired}. -To verify that no extra parameters are added, it is recommended to call the function \ttt{CheckUnused} -(it fires an exception if some parameters are unclaimed). -This may also help to identify situations where the user misspells -a parameter's name. - - -Parameter values specified in the commands line are interpreted as strings. -The \ttt{GetParam*} functions can convert these string values -to integer or floating-point numbers if necessary. -A conversion occurs, if the type of a receiving variable (passed as a second parameter -to the functions \ttt{GetParam*}) is different from a string. -It is possible to use boolean variables as parameters. -In that, in the command line, one has to specify 1 (for \ttt{true}) or 0 (for \ttt{false}). -Note that the function \ttt{GetParamRequired} raises an exception, -if the request parameter was not supplied in the command line. - -The function \ttt{CreateDummy} is registered in the space factory using a special macro. -This macro should be used for all possible values of the distance function, -for which our space is defined. For example, if the space is defined -only for integer-valued distance function, this macro should be used only once. -However, in our case the space \ttt{CreateDummy} is defined for integers, -single- and double-precision floating pointer numbers. Thus, we use this macro -three times as follows: -\begin{verbatim} -REGISTER_SPACE_CREATOR(int, SPACE_DUMMY, CreateDummy) -REGISTER_SPACE_CREATOR(float, SPACE_DUMMY, CreateDummy) -REGISTER_SPACE_CREATOR(double, SPACE_DUMMY, CreateDummy) -\end{verbatim} - -This macro should be placed into the function \ttt{initSpaces} in the -file -\href{\replocfile similarity_search/include/factory/init\_spaces.h}{init\_spaces.h}. -Last, but not least we need to add the include-directive -for the helper function, which creates -the class, to the file \ttt{init\_spaces.h} as follows: -\begin{verbatim} -#include "factory/space/space_dummy.h" -\end{verbatim} - -To conlcude, we recommend to make a \ttt{Space} object is non-copyable. -This can be done by using our macro \ttt{DISABLE\_COPY\_AND\_ASSIGN}. - - -\subsection{Creating a search method}\label{SectionCreateMethod} - -%\newpage -\begin{table}[!htbp] -\caption{\label{FigDummyMethod}A sample search method class} -\begin{verbatim} -template -class DummyMethod : public Index { - public: - DummyMethod(Space& space, - const ObjectVector& data) : data_(data), space_(space) {} - - void CreateIndex(const AnyParams& IndexParams) override { - AnyParamManager pmgr(IndexParams); - pmgr.GetParamOptional("doSeqSearch", - bDoSeqSearch_, - // One should always specify the default value of an optional parameter! - false - ); - // Check if a user specified extra parameters, - // which can be also misspelled variants of existing ones - pmgr.CheckUnused(); - // It is recommended to call ResetQueryTimeParams() - // to set query-time parameters to their default values - this->ResetQueryTimeParams(); - } - - // SaveIndex is not necessarily implemented - virtual void SaveIndex(const string& location) override { - throw runtime_error( - "SaveIndex is not implemented for method: " + StrDesc()); - } - // LoadIndex is not necessarily implemented - virtual void LoadIndex(const string& location) override { - throw runtime_error( - "LoadIndex is not implemented for method: " + StrDesc()); - } - void SetQueryTimeParams(const AnyParams& QueryTimeParams) override; - - // Description of the method, consider printing crucial parameter values - const std::string StrDesc() const override { - stringstream str; - str << "Dummy method: " - << (bDoSeqSearch_ ? " does seq. search " : - " does nothing (really dummy)"); - return str.str(); - } - - // One needs to implement two search functions. - void Search(RangeQuery* query, IdType) const override; - void Search(KNNQuery* query, IdType) const override; - - // If we duplicate data, let the framework know it - virtual bool DuplicateData() const override { return false; } - private: - ... - // Don't permit copying and/or assigning - DISABLE_COPY_AND_ASSIGN(DummyMethod); -}; -\end{verbatim} -\end{table} - -To illustrate the basics of developing a new search method, -we created a sample zero-functionality method \ttt{DummyMethod}. -It is represented by -the header file -\href{\replocfile similarity_search/include/method/dummy.h}{dummy.h} -and the source file -\href{\replocfile similarity_search/src/method/dummy.cc}{dummy.cc}. -The user is encouraged to study these files and read the comments. -Here we would omit certain details. - -Similar to the space and query classes, a search method is implemented using -a template class, which is parameterized by the distance function value (see Table~\ref{FigDummyMethod}). -Note again that the constructor of the class does not create an index in the memory. -The index is created using either the function \ttt{CreateIndex} (from scratch) -or the function \ttt{LoadIndex} (from a previously created index image). -The index can be saved to disk using the function \ttt{SaveIndex}. -It does not have to be a comprehensive index that contains a copy of the data set. -Instead, it is sufficient to memorize only the index structure itself (because -the data set is always loaded separately). -Also note that most methods do not support index (de)-serialization. - -The constructor receives a reference to a space object as well as a reference to an array of data objects. -In some cases, e.g., when we wrap existing methods such as the multiprobe LSH (see \S~\ref{SectionLSH}), -we create a copy of the data set (simply because is was easier to write the wrapper this way). -The framework can be informed about such a situation via the virtual function \ttt{DuplicateData}. -If this function returns true, the framework ``knows'' that the data was duplicated. -Thus, it can correct an estimate for the memory required by the method. - -The function \ttt{CreateIndex} receives a parameter object. -In our example, the parameter object is used to retrieve the single index-time parameter: \ttt{doSeqSearch}. -When this parameter value is true, our dummy method carries out a sequential search. -Otherwise, it does nothing useful. -Again, it is recommended to call the function \ttt{CheckUnused} to ensure -that the user did not enter parameters with incorrect names. -It is also recommended to call the function \ttt{ResetQueryTimeParams} (\ttt{this} pointer needs to -be specified explicitly here) to reset query-time parameters after the index is created (or loaded from disk). - -Unlike index-time parameters, query-time parameters can be changed without rebuilding the index -by invoking the function \ttt{SetQueryTimeParams}. -The function \ttt{SetQueryTimeParams} accepts a constant reference to a parameter object. -The programmer, in turn, creates a parameter manager object to extract actual parameter values. -To this end, two functions are used: \ttt{GetParamRequired} and \ttt{GetParamOptional}. -Note that the latter function must be supplied with a mandatory \emph{default} value for the parameter. -Thus, the parameter value is properly reset to its default value when the user does not specify the parameter value -explicitly (e.g., the parameter specification is omitted when the user invokes the benchmarking utility \ttt{experiment})! - -There are two search functions each of which receives two parameters. -The first parameter is a pointer to a query (either a range or a \knn query). -The second parameter is currently unused. -Note again that during the search phase, a search method can -compute a distance only by accessing functions \ttt{Distance}, \ttt{DistanceObjLeft}, or -\ttt{DistanceObjRight}, which are member functions of a query object. -The function \ttt{IndexTimeDistance} \textbf{should not be used} in a function \ttt{Search}, -but it can be used in the function \ttt{CreateIndex}. -If the user attempts to invoke \ttt{IndexTimeDistance} during the test phase, -\textbf{the program will terminate}. -\footnote{As noted previously, we want to compute the number of times -the distance was computed for each query. This allows us to estimate the variance. -Hence, during the testing phase, the distance function should be invoked only through -a query object.} - - -Finally, we need to ``tell'' the library about the method, -by registering the method in the method factory, -similarly to registering a space. -At runtime, the method is created through a helper function, -which accepts several parameters. -One parameter is a reference to an object of the type \ttt{AllParams}. -In our case, the function name is \ttt{CreateDummy}: - -\begin{verbatim} -#include - -namespace similarity { -template -Index* CreateDummy(bool PrintProgress, - const string& SpaceType, - Space& space, - const ObjectVector& DataObjects) { - return new DummyMethod(space, DataObjects); -} -\end{verbatim} -There is an include-directive preceding -the creation function, which uses angle brackets. -As explained previously, if you opt to using quotes (in the include-directive), -the code may not compile under the Visual Studio. - -Again, similarly to the case of the space, -the method-creating function \ttt{CreateDummy} needs -to be registered in the method factory in two steps. -First, we need to include \ttt{dummy.h} into the file -\href{\replocfile similarity_search/include/factory/init_methods.h}{init\_methods.h} as follows: -\begin{verbatim} -#include "factory/method/dummy.h" -\end{verbatim} -Then, this file is further modified by adding the following lines to the function \ttt{initMethods}: -\begin{verbatim} -REGISTER_METHOD_CREATOR(float, METH_DUMMY, CreateDummy) -REGISTER_METHOD_CREATOR(double, METH_DUMMY, CreateDummy) -REGISTER_METHOD_CREATOR(int, METH_DUMMY, CreateDummy) -\end{verbatim} - -If we want our method to work only with integer-valued distances, -we only need the following line: -\begin{verbatim} -REGISTER_METHOD_CREATOR(int, METH_DUMMY, CreateDummy) -\end{verbatim} - -When adding the method, please, consider expanding -the test utility \ttt{test\_integr}. -This is especially important if for some combination of parameters the method is expected -to return all answers (and will have a perfect recall). Then, if we break the code in the future, -this will be detected by \ttt{test\_integr}. - -To create a test case, the user needs to add one or more test cases -to the file -\href{\replocfile similarity_search/test/test_integr.cc#L65}{test\_integr.cc}. -A test case is an instance of the class \ttt{MethodTestCase}. -It encodes the range of plausible values -for the following performance parameters: the recall, -the number of points closer to the query than the nearest returned point, -and the improvement in the number of distance computations. - -\subsection{Creating an application on Linux (inside the framework)}\label{SectionCreateAppLinux} -First, we create a hello-world source file -\href{\replocfile similarity_search/src/dummy_app.cc}{dummy\_app.cc}: -\begin{verbatim} -#include - -using namespace std; -int main(void) { - cout << "Hello world!" << endl; -} -\end{verbatim} -Now we need to modify the meta-makefile -\href{\replocfile similarity_search/src/CMakeLists.txt}{similarity\_search/src/CMakeLists.txt} and -re-run \ttt{cmake} as described in \S~\ref{SectionBuildLinux}. - -More specifically, we do the following: -\begin{itemize} -\item by default, all source files in the -\href{\replocfile similarity_search/src/}{similarity\_search/src/} directory are included into the library. -To prevent \ttt{dummy\_app.cc} from being included into the library, we use the following command: -\begin{verbatim} -list(REMOVE_ITEM SRC_FILES ${PROJECT_SOURCE_DIR}/src/dummy_app.cc) -\end{verbatim} - -\item tell \ttt{cmake} to build an additional executable: -\begin{verbatim} -add_executable (dummy_app dummy_app.cc ${SRC_FACTORY_FILES}) -\end{verbatim} - -\item specify the necessary libraries: -\begin{verbatim} -target_link_libraries (dummy_app NonMetricSpaceLib lshkit - ${Boost_LIBRARIES} ${GSL_LIBRARIES} - ${CMAKE_THREAD_LIBS_INIT}) -\end{verbatim} -\end{itemize} - -\subsection{Creating an application on Windows (inside the framework)}\label{SectionCreateAppWindows} -The following description was created for Visual Studio Express 2015. -It may be a bit different for newer releases of the Visual Studio. -Creating a new sub-project in the Visual Studio is rather straightforward. - -In addition, one can use a provided sample project file \href{\replocfile similarity_search/src/dummy_app.vcxproj}{dummy\_app.vcxproj} as a template. -To this end, one needs to to create a copy of this sample project file and subsequently edit it. -One needs to do the following: -\begin{itemize} -\item Obtain a new value of the project GUI and put it between the tags \newline \ttt{...}; -\item Add/delete new files; -\item Add/delete/change references to the boost directories (both header files and libraries); -\item If the CPU has AVX extension, it may be necessary to enable them -as explained in \S~\ref{SectionBuildWindows}. -\item Finally, one may manually add an entry to the main project -file \href{\replocfile similarity_search/NonMetricSpaceLib.sln}{NonMetricSpaceLib.sln}. -\end{itemize}