From 261347286be77d784ce1069ab65bfe839901a7bb Mon Sep 17 00:00:00 2001 From: searchivarius Date: Mon, 3 Jun 2019 02:15:18 -0400 Subject: [PATCH] Generic doc improvements (interm. commit) --- README.md | 18 +++++++++--------- manual/README.md | 3 ++- manual/benchmarking.md | 3 +++ manual/build.md | 41 +++++++++++++++++++++++++++++++++++------ manual/methods.md | 2 +- manual/spaces.md | 20 ++++++++++---------- 6 files changed, 60 insertions(+), 27 deletions(-) create mode 100644 manual/benchmarking.md diff --git a/README.md b/README.md index 9a94911..37e9e4c 100644 --- a/README.md +++ b/README.md @@ -3,22 +3,22 @@ [![Windows Build Status](https://ci.appveyor.com/api/projects/status/wd63b9doe7xco81t/branch/master?svg=true)](https://ci.appveyor.com/project/searchivarius/nmslib) [![Join the chat at https://gitter.im/nmslib/Lobby](https://badges.gitter.im/nmslib/Lobby.svg)](https://gitter.im/nmslib/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -#Non-Metric Space Library (NMSLIB) +# Non-Metric Space Library (NMSLIB) -##Important Notes +## Important Notes * NMSLIB is generic, but fast, see the results of [ANN benchmarks](https://github.com/erikbern/ann-benchmarks). * A stand-alone implementation of our fastest method HNSW [also exists as a header-only library](https://github.com/nmslib/hnswlib). * All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library) can be found [on this page](/manual/README.md). * For **generic questions/inquiries**, please, use [**the Gitter chat**](https://gitter.im/nmslib/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge): GitHub issues page is for bugs and feature requests. -##Some Limitations +## Some Limitations * Only static data sets are supported (with an exception of SW-graph) * HNSW currently duplicates memory to create optimized indices * Range/threshold search is not supported by many methods including SW-graph/HNSW -##Objectives +## Objectives Non-Metric Space Library (NMSLIB) is an **efficient** cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does **not** have any third-party dependencies. @@ -32,7 +32,7 @@ NMSLIB is an **extendible library**, which means that is possible to add new sea **Authors**: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov. **With contributions from** David Novak, Lawrence Cayton, Wei Dong, Avrelin Nikita, Ben Frederickson, Dmitry Yashunin, Bob Poekert, @orgoro, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko. -##Brief History +## Brief History NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings, and participated in earlier evaluations. @@ -44,11 +44,11 @@ a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and i as well as a vanilla uncompressed inverted file. -##Credits and Citing +## Credits and Citing If you find this library useful, feel free to cite our SISAP paper [**[BibTex]**](http://dblp.uni-trier.de/rec/bibtex/conf/sisap/BoytsovN13) as well as other papers listed in the end. One **crucial contribution** to cite is the fast Hierarchical Navigable World graph (HNSW) method [**[BibTex]**](https://dblp.uni-trier.de/rec/bibtex/journals/corr/MalkovY16). Please, [also check out the stand-alone HNSW implementation by Yury Malkov](https://github.com/nmslib/hnswlib), which is released as a header-only HNSWLib library. -##License +## License Most of this code is released under the Apache License Version 2.0 http://www.apache.org/licenses/. @@ -57,11 +57,11 @@ Apache License Version 2.0 http://www.apache.org/licenses/. * The k-NN graph construction algorithm *NN-Descent* due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2. * FALCONN library's licence is MIT. -##Funding +## Funding Leonid Boytsov was supported by the [Open Advancement of Question Answering Systems (OAQA) group](https://github.com/oaqa) and the following NSF grant #1618159: "[Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1618159&HistoricalAwards=false)". Bileg was supported by the [iAd Center](https://web.archive.org/web/20160306011711/http://www.iad-center.com/). -##Related Publications +## Related Publications Most important related papers are listed below in the chronological order: * L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). [Off the Beaten Path: Let’s Replace Term-Based Retrieval diff --git a/manual/README.md b/manual/README.md index 6841d3d..57d579d 100644 --- a/manual/README.md +++ b/manual/README.md @@ -46,10 +46,11 @@ an average fraction of true neighbors returned by the method (with ties broken a # Documentation Links * [Python bindings overview](/python_bindings) and [Python bindings API](https://nmslib.github.io/nmslib/index.html) -* [A Brief List of Methods and Parameters](/manual/parameters.md) +* [A Brief List of Methods and Parameters](/manual/methods.md) * [A brief list of supported spaces/distance](/manual/spaces.md) * [Building the main library](/manual/build.md) * [Building and using the query server](/manual/query_server.md) +* [Benchmarking using NMSLIB utility ``experiment``](/manual/benchmarking.md) * [Extending the library](/manual/extensions.md) * [A more detailed and formal description of methods and spaces (PDF)](/manual/latex/manual.pdf) diff --git a/manual/benchmarking.md b/manual/benchmarking.md new file mode 100644 index 0000000..9b75c94 --- /dev/null +++ b/manual/benchmarking.md @@ -0,0 +1,3 @@ +# Benchmarking + +## \ No newline at end of file diff --git a/manual/build.md b/manual/build.md index 73c34ea..b8e65fc 100644 --- a/manual/build.md +++ b/manual/build.md @@ -1,6 +1,6 @@ -#Building the main library on Linux/Mac +# Building the Main Library on Linux/Mac -##Prerequisites +## Prerequisites 1. A modern compiler that supports C++11: G++ 4.7, Intel compiler 14, Clang 3.4, or Visual Studio 14 (version 12 can probably be used as well, but the project files need to be downgraded). 2. **64-bit** Linux is recommended, but most of our code builds on **64-bit** Windows and MACOS as well. @@ -14,7 +14,7 @@ To install additional prerequisite packages on Ubuntu, type the following sudo apt-get install libboost-all-dev libgsl0-dev libeigen3-dev ``` -##Quick Start on Linux/Mac +## Quick Start on Linux/Mac To compile, go to the directory **similarity_search** and type: ``` @@ -27,11 +27,11 @@ cmake . -DWITH_EXTRAS=1 make ``` -##Quick Start on Windows +## Quick Start on Windows Building on Windows requires [Visual Studio 2015 Express for Desktop](https://www.visualstudio.com/en-us/downloads/download-visual-studio-vs.aspx) and [CMake for Windows](https://cmake.org/download/). First, generate Visual Studio solution file for 64 bit architecture using CMake **GUI**. You have to specify both the platform and the version of Visual Studio. Then, the generated solution can be built using Visual Studio. **Attention**: this way of building on Windows is not well tested yet. We suspect that there might be some issues related to building truly 64-bit binaries. -##Additional Building Details +## Additional Building Details Here we cover a few details on choosing the compiler, a type of the release, and manually pointing to the location @@ -90,4 +90,33 @@ manually as follows: ``` export BOOST_ROOT=$HOME/boost_download_dir -``` \ No newline at end of file +``` + +## Testing the Correctness of Implementations + +We have two main testing utilities ``bunit`` and ``test_integr`` (``experiment.exe`` and +``test_integr.exe`` on Windows). +Both utilities accept the single optional argument: the name of the log file. +If the log file is not specified, a lot of informational messages are printed to the screen. + +The ``bunit`` verifies some basic functitionality akin to unit testing. +In particular, it checks that an optimized version of the, e.g., Eucledian, distance +returns results that are very similar to the results returned by unoptimized and simpler version. +The utility ``bunit`` is expected to always run without errors. + +The utility ``test_integr`` runs complete implementations of many methods +and checks if several effectiveness and efficiency characteristics +meet the expectations. +The expectations are encoded as an array of instances of the class ``MethodTestCase`` +(see [the code here](similarity_search/test/test_integr.cc#L65)). +For example, we expect that the recall falls in a certain pre-recorded range. + +Because almost all our methods are randomized, there is a great deal of variance +in the observed performance characteristics. Thus, some tests +may fail infrequently, if e.g., the actual recall value is slightly lower or higher +than an expected minimum or maximum. +From an error message, it should be clear if the discrepancy is substantial, i.e., +something went wrong, or not, i.e., we observe an unlikely outcome due to randomization. + + + diff --git a/manual/methods.md b/manual/methods.md index ce18f6d..5d97dad 100644 --- a/manual/methods.md +++ b/manual/methods.md @@ -1,4 +1,4 @@ -#A Brief List of Methods and Parameters +# A Brief List of Methods and Parameters ## Overview diff --git a/manual/spaces.md b/manual/spaces.md index 9d9684e..be5cd33 100644 --- a/manual/spaces.md +++ b/manual/spaces.md @@ -1,4 +1,4 @@ -#Spaces and Distances +# Spaces and Distances Below, there is a list of nearly all spaces (a space is a combination of data and the distance). The mnemonic name of a space is passed to python bindings function as well as to the benchmarking utility ``experiment``. When initializing the space in Python embeddings, please use the type @@ -6,7 +6,7 @@ When initializing the space in Python embeddings, please use the type A more detailed description is given in the [manual](manual/latex/manual.pdf). -##Specifying parameters of the space +## Specifying parameters of the space In some rare cases, spaces have parameters, which are specified after the colon. @@ -17,7 +17,7 @@ For example, ``lp:p=3`` denotes the L3 space and ``lp:p=2`` is a synonym for the Euclidean, i.e., L2 space. -##Fast, Slow, and Approximate variants +## Fast, Slow, and Approximate variants There can be more than one version of a distance function, which have different space-performance trade-off. @@ -42,7 +42,7 @@ and another is for right queries (the data object is the second argument and the In the latter case the name of the space ends on ``rq``. -##Input Format +## Input Format For Python bindings, all dense-vector spaces require float32 numpy-array input (two-dimensional). See an example [here](python_bindings/notebooks/search_vector_dense_optim.ipynb). One exception is the squared Euclidean space for SIFT vectors, which requires input as uint8 integer numpy arrays. An example can be found [here](python_bindings/notebooks/search_sift_uint8.ipynb). @@ -62,7 +62,7 @@ currently a limitation). You can pass a UTF8-encoded string, but the distance will be sometimes larger than the actual distance. -##Storage Format +## Storage Format For dense vector spaces, the data can be either single-precision or double-precision floating-point numbers. However, double-precision has not been useful so far and we do not recommend use it. @@ -134,14 +134,14 @@ and for the [Itakura-Saito distance](https://en.wikipedia.org/wiki/Itakura%E2%80 We also explicitly implement the squared JS-divergence, which is a true metric distance. -For the meaning of infixes `fast`, `slow`, `approx`, and `rq` see the information above. +For the meaning of infixes ``fast``, ``slow``, ``approx``, and ``rq`` see the information above. | Space code(s) | Description and Notes | |--------------------------------------------|-------------------------------------------------| -| ``kldivfast``, ``kldivfastrq`` | Regular KL-divergence | -| ``kldivgenslow``, ``kldivgenfast``, ``kldivgenfastrq`` | Generalized KL-divergence | -| ``itakurasaitoslow``, ``itakurasaitofast``, ``itakurasaitofastrq`` | Itakura-Saito distance | -| ``jsdivslow``, ``jsdivfast``, `jsdivfastapprox` | JS-divergence | +| `kldivfast`, `kldivfastrq` | Regular KL-divergence | +| `kldivgenslow`, `kldivgenfast`, `kldivgenfastrq` | Generalized KL-divergence | +| `itakurasaitoslow`, `itakurasaitofast`, `itakurasaitofastrq` | Itakura-Saito distance | +| `jsdivslow`, `jsdivfast`, `jsdivfastapprox` | JS-divergence | | `jsmetrslow`, `jsmetrfast`, `jsmetrfastapprox` | JS-metric | | `renyidiv_slow`, `renyidiv_fast` | Renyi divergence: parameter name `alpha` |