Generic doc improvements (interm. commit)

mldrugdiscovery · Jun 3, 2019 · 2613472 · 2613472
1 parent 7bb63ef
commit 2613472
Show file tree

Hide file tree

Showing 6 changed files with 60 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -3,22 +3,22 @@
 [![Windows Build Status](https://ci.appveyor.com/api/projects/status/wd63b9doe7xco81t/branch/master?svg=true)](https://ci.appveyor.com/project/searchivarius/nmslib)
 [![Join the chat at https://gitter.im/nmslib/Lobby](https://badges.gitter.im/nmslib/Lobby.svg)](https://gitter.im/nmslib/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 
-#Non-Metric Space Library (NMSLIB) 
+# Non-Metric Space Library (NMSLIB) 
 
-##Important Notes
+## Important Notes
 
 * NMSLIB is generic, but fast, see the results of [ANN benchmarks](https://github.com/erikbern/ann-benchmarks).
 * A stand-alone implementation of our fastest method HNSW [also exists as a header-only library](https://github.com/nmslib/hnswlib).
 * All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library) can be found [on this page](/manual/README.md).
 * For **generic questions/inquiries**, please, use [**the Gitter chat**](https://gitter.im/nmslib/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge): GitHub issues page is for bugs and feature requests.
 
-##Some Limitations
+## Some Limitations
 
 * Only static data sets are supported (with an exception of SW-graph)
 * HNSW currently duplicates memory to create optimized indices
 * Range/threshold search is not supported by many methods including SW-graph/HNSW
 
-##Objectives
+## Objectives
 
 Non-Metric Space Library (NMSLIB) is an **efficient** cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does **not** have any third-party dependencies.
 
@@ -32,7 +32,7 @@ NMSLIB is an **extendible library**, which means that is possible to add new sea
 
 **Authors**: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov. **With contributions from** David Novak, Lawrence Cayton, Wei Dong, Avrelin Nikita, Ben Frederickson, Dmitry Yashunin, Bob Poekert, @orgoro, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko.
 
-##Brief History
+## Brief History
 
 NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings,
 and participated in earlier evaluations. 
@@ -44,11 +44,11 @@ a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and i
 as well as a vanilla uncompressed inverted file.
 
 
-##Credits and Citing
+## Credits and Citing
 
 If you find this library useful, feel free to cite our SISAP paper [**[BibTex]**](http://dblp.uni-trier.de/rec/bibtex/conf/sisap/BoytsovN13) as well as other papers listed in the end. One **crucial contribution** to cite is the fast Hierarchical Navigable World graph (HNSW) method [**[BibTex]**](https://dblp.uni-trier.de/rec/bibtex/journals/corr/MalkovY16). Please, [also check out the stand-alone HNSW implementation by Yury Malkov](https://github.com/nmslib/hnswlib), which is released as a header-only HNSWLib library.
 
-##License
+## License
 
 Most of this code is released under the
 Apache License Version 2.0 http://www.apache.org/licenses/.
@@ -57,11 +57,11 @@ Apache License Version 2.0 http://www.apache.org/licenses/.
 * The k-NN graph construction algorithm *NN-Descent* due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.
 * FALCONN library's licence is MIT.
 
-##Funding
+## Funding
 
 Leonid Boytsov was supported by the [Open Advancement of Question Answering Systems (OAQA) group](https://github.com/oaqa) and the following NSF grant #1618159: "[Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1618159&HistoricalAwards=false)". Bileg was supported by the [iAd Center](https://web.archive.org/web/20160306011711/http://www.iad-center.com/).
 
-##Related Publications
+## Related Publications
 
 Most important related papers are listed below in the chronological order: 
 * L. Boytsov, D. Novak, Y. Malkov, E. Nyberg  (2016). [Off the Beaten Path: Let’s Replace Term-Based Retrieval

diff --git a/manual/README.md b/manual/README.md
@@ -46,10 +46,11 @@ an average fraction of true neighbors returned by the method (with ties broken a
 # Documentation Links
 
 * [Python bindings overview](/python_bindings) and [Python bindings API](https://nmslib.github.io/nmslib/index.html)
-* [A Brief List of Methods and Parameters](/manual/parameters.md)
+* [A Brief List of Methods and Parameters](/manual/methods.md)
 * [A brief list of supported spaces/distance](/manual/spaces.md)
 * [Building the main library](/manual/build.md)
 * [Building and using the query server](/manual/query_server.md)
+* [Benchmarking using NMSLIB utility ``experiment``](/manual/benchmarking.md)
 * [Extending the library](/manual/extensions.md)
 * [A more detailed and formal description of methods and spaces (PDF)](/manual/latex/manual.pdf)
 
diff --git a/manual/benchmarking.md b/manual/benchmarking.md
@@ -0,0 +1,3 @@
+# Benchmarking
+
+## 
diff --git a/manual/build.md b/manual/build.md
@@ -1,6 +1,6 @@
-#Building the main library on Linux/Mac
+# Building the Main Library on Linux/Mac
 
-##Prerequisites
+## Prerequisites
 
 1. A modern compiler that supports C++11: G++ 4.7, Intel compiler 14, Clang 3.4, or Visual Studio 14 (version 12 can probably be used as well, but the project files need to be downgraded).
 2. **64-bit** Linux is recommended, but most of our code builds on **64-bit** Windows and MACOS as well. 
@@ -14,7 +14,7 @@ To install additional prerequisite packages on Ubuntu, type the following
 sudo apt-get install libboost-all-dev libgsl0-dev libeigen3-dev
 ```
 
-##Quick Start on Linux/Mac
+## Quick Start on Linux/Mac
 
 To compile, go to the directory **similarity_search** and type:  
 ```
@@ -27,11 +27,11 @@ cmake . -DWITH_EXTRAS=1
 make  
 ```
 
-##Quick Start on Windows
+## Quick Start on Windows
 
 Building on Windows requires [Visual Studio 2015 Express for Desktop](https://www.visualstudio.com/en-us/downloads/download-visual-studio-vs.aspx) and [CMake for Windows](https://cmake.org/download/). First, generate Visual Studio solution file for 64 bit architecture using CMake **GUI**. You have to specify both the platform and the version of Visual Studio. Then, the generated solution can be built using Visual Studio. **Attention**: this way of building on Windows is not well tested yet. We suspect that there might be some issues related to building truly 64-bit binaries.
 
-##Additional Building Details
+## Additional Building Details
 
 Here we cover a few details on choosing the compiler,
 a type of the release, and manually pointing to the location
@@ -90,4 +90,33 @@ manually as follows:
 
 ```
 export BOOST_ROOT=$HOME/boost_download_dir
-```
+```
+
+## Testing the Correctness of Implementations
+
+We have two main testing utilities ``bunit`` and ``test_integr`` (``experiment.exe`` and
+``test_integr.exe`` on Windows).
+Both utilities accept the single optional argument: the name of the log file.
+If the log file is not specified, a lot of informational messages are printed to the screen.
+
+The ``bunit`` verifies some basic functitionality akin to unit testing.
+In particular, it checks that an optimized version of the, e.g., Eucledian, distance
+returns results that are very similar to the results returned by unoptimized and simpler version.
+The utility ``bunit`` is expected to always run without errors.
+
+The utility ``test_integr`` runs complete implementations of many methods
+and checks if several effectiveness and efficiency characteristics
+meet the expectations.
+The expectations are encoded as an array of instances of the class ``MethodTestCase``
+(see [the code here](similarity_search/test/test_integr.cc#L65)).
+For example, we expect that the recall falls in a certain pre-recorded range.
+
+Because almost all our methods are randomized, there is a great deal of variance
+in the observed performance characteristics. Thus, some tests
+may fail infrequently, if e.g., the actual recall value is slightly lower or higher 
+than an expected minimum  or maximum.
+From an error message, it should be clear if the discrepancy is substantial, i.e.,
+something went wrong, or not, i.e., we observe an unlikely outcome due to randomization.
+
+
+
diff --git a/manual/methods.md b/manual/methods.md
@@ -1,4 +1,4 @@
-#A Brief List of Methods and Parameters
+# A Brief List of Methods and Parameters
 
 ## Overview
 

diff --git a/manual/spaces.md b/manual/spaces.md
@@ -1,12 +1,12 @@
-#Spaces and Distances
+# Spaces and Distances
 
 Below, there is a list of nearly all spaces (a space is a combination of data and the distance). The mnemonic name of a space is passed to python bindings function   as well  as  to  the  benchmarking  utility ``experiment``. 
 When initializing the space in Python embeddings, please use the type 
 `FLOAT` for all spaces, except `leven`: [see the description here.](https://nmslib.github.io/nmslib/api.html#nmslib-init)
 A more detailed description is given
 in the [manual](manual/latex/manual.pdf).
 
-##Specifying parameters of the space
+## Specifying parameters of the space
 
 In some rare cases, spaces have parameters, which are specified after the
 colon. 
@@ -17,7 +17,7 @@ For example, ``lp:p=3`` denotes the L<sub>3</sub> space and
 ``lp:p=2`` is a synonym for the Euclidean, i.e., L<sub>2</sub> space.
 
 
-##Fast, Slow, and Approximate variants
+## Fast, Slow, and Approximate variants
 
 There can be more than one version of a distance function,
 which have different space-performance trade-off.
@@ -42,7 +42,7 @@ and another is for right queries (the data object is the second argument and the
 In the latter case the name of the space ends on ``rq``.
 
 
-##Input Format
+## Input Format
 
 For Python bindings, all dense-vector spaces require float32 numpy-array input (two-dimensional). See an example [here](python_bindings/notebooks/search_vector_dense_optim.ipynb). 
 One exception is the squared Euclidean space for SIFT vectors, which requires input as uint8 integer numpy arrays. An example can be found [here](python_bindings/notebooks/search_sift_uint8.ipynb).
@@ -62,7 +62,7 @@ currently a limitation).
 You can pass a UTF8-encoded string, but the distance will be sometimes
 larger than the actual distance. 
 
-##Storage Format
+## Storage Format
 
 For dense vector spaces, the data can be either single-precision or double-precision floating-point numbers. 
 However, double-precision has not been useful so far and we do not recommend use it.
@@ -134,14 +134,14 @@ and for the [Itakura-Saito distance](https://en.wikipedia.org/wiki/Itakura%E2%80
 We also explicitly implement the squared JS-divergence,
 which is a true metric distance.
 
-For the meaning of infixes `fast`, `slow`, `approx`, and `rq` see the information above.
+For the meaning of infixes ``fast``, ``slow``, ``approx``, and ``rq`` see the information above.
 
 | Space code(s)                              | Description and Notes                           |
 |--------------------------------------------|-------------------------------------------------|
-| ``kldivfast``, ``kldivfastrq``             | Regular KL-divergence                           |
-| ``kldivgenslow``, ``kldivgenfast``, ``kldivgenfastrq`` | Generalized KL-divergence           | 
-| ``itakurasaitoslow``, ``itakurasaitofast``, ``itakurasaitofastrq`` |  Itakura-Saito distance |
-| ``jsdivslow``, ``jsdivfast``, `jsdivfastapprox` | JS-divergence                              |
+| `kldivfast`, `kldivfastrq`             | Regular KL-divergence                           |
+| `kldivgenslow`, `kldivgenfast`, `kldivgenfastrq` | Generalized KL-divergence           | 
+| `itakurasaitoslow`, `itakurasaitofast`, `itakurasaitofastrq` |  Itakura-Saito distance |
+| `jsdivslow`, `jsdivfast`, `jsdivfastapprox` | JS-divergence                              |
 | `jsmetrslow`, `jsmetrfast`, `jsmetrfastapprox`  | JS-metric                                  |
 | `renyidiv_slow`, `renyidiv_fast`                | Renyi divergence: parameter name `alpha`   |