NMSLIB Manual

Overview

Documentation is split into several parts. Links to these parts are given below. This is followed by the list of some limitations and terminological introduction. Terminology might be crucial to understand certain parts (but mostly these are well-known terms). Also note that the PDF part of the manual contains a detailed description of implemented search spaces and methods.

Documentation Links

Python bindings
A brief list of methods and parameters (including basic tuning guidelines)
A brief list of supported spaces/distance
Some data used in the past
Building the main library (Linux/Mac)
Building and using the query server (Linux/Mac)
Benchmarking using NMSLIB utility experiment
Extending NMSLIB (adding spaces, methods, and apps)
A detailed and formal description of spaces and methods with examples of benchmarking the methods (PDF)

Some Limitations

Only static data sets are supported (with an exception of SW-graph)
HNSW currently duplicates memory to create optimized indices
Range/threshold search is not supported by many methods including SW-graph/HNSW

Terminology and Problem Formulation

NMSLIB provides a fast similarity search. The search is carried out in a finite database of objects {o_i} using a search query q and a dissimilarity measure. An object is a synonym for a data point or simply a point. The dissimilarity measure is typically represented by a distance function d(o_i, q). The ultimate goal is to answer a query by retrieving a subset of database objects sufficiently similar to the query q. A combination of data points and the distance function is called a search space, or simply a space.

Note that we use the terms distance and the distance function in a broader sense than most other folks: We do not assume that the distance is a true metric distance. The distance function can disobey the triangle inequality and/or be even non-symmetric.

Two retrieval tasks are typically considered: a nearest-neighbor and a range search. Currently, we mostly support only the nearest-neighbor search, which aims to find the object at the smallest distance from the query. Its direct generalization is the k nearest-neighbor search (the k-NN search), which looks for the k closest objects, which have k smallest distance values to the query q.

In generic spaces, the distance is not necessarily symmetric. Thus, two types of queries can be considered. In a left query, the object is the left argument of the distance function, while the query is the right argument. In a right query, the query q is the first argument and the object is the second, i.e., the right, argument.

The queries can be answered either exactly, i.e., by returning a complete result set that does not contain erroneous elements, or, approximately, e.g., by finding only some neighbors. Thus, the methods are evaluated in terms of efficiency-effectiveness trade-offs rather than merely in terms of their efficiency. One common effectiveness metric is recall, which is computed as an average fraction of true neighbors returned by the method (with ties broken arbitrarily).