Skip to content
master
Go to file
Code

README.md

Welcome to Molecular Similarity Search Benchmark (MssBenchmark)!

A molecular similarity search benchmark.

Algorithms currently supported:

- [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209)
- Bruteforce/Exhausive search
- [Chemfp 1.6.1](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8)
- [the standard modulo-OR-compression algorithm, or folding](https://pubs.acs.org/doi/10.1021/ci100132g)
- [Min-Hash](https://ekzhu.github.io/datasketch)
- [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r)
- [Hnsw](https://arxiv.org/abs/1603.09320)
- [Onng](https://arxiv.org/abs/1810.07355)
- [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2)
- [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf)
- [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069)
- [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300)
- [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R)

Datasets prepared (To include your own data, please refer to instructions later):

[Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP
[Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP

Computational environments supported:

- A local PC
- Docker-based container
- Singularity-based container / High Performance Computing (HPC)

In principle, running this benchmark on the HPC only requires RDKit and the Python packages listed in "singularity-install/requirements.txt". However, different HPC might have slightly different environments. We will show two examples of executing our benchmark in the University of Connecticut (UConn) HPC cluster and Extreme Science and Engineering Discovery Environment (XSEDE).

Useful links

Using prepared datasets and pre-compiled Singularity images

  1. Download and put a dataset, e.g. chembl-1024-jaccard.hdf5, under "data" folder;
  2. Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder.

Executions under a PC through Singularity

  1. pip install -r requirements.txt

  2. Run your algorithm

Run.py Parameters:

dataset: dataset name (Required)
    Examples:
    - chembl-1024-jaccard
    - molport-1024-jaccard
algorithm: algorithm name (Required)
    Choices:
    - Balltree(Sklearn)
    - Bruteforce
    - Chemfp
    - Datasketch
    - DivideSkip
	- Folding
    - Hnsw(Nmslib)
    - Onng(Ngt)
    - Panng(Ngt)
    - Pynndescent
    - Risc
    - SW-graph(Nmslib)
    - VPtree(Nmslib)
count: the value of K for top-K nearest neighbor search
    Default: 10
runs: the number of times the query set will be executed 
    Default: 2
sif-dir: Singularity image files directory
	Default: "./singularity"
batch: batch query mode
	Default: False
rq: range query / threshold-based query mode
	Default: False
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
	Default: 0.3
force: re-run algorithms even if their results already exist
	Default: False
time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set
	Default: -1
run-disabled: run algorithms that are disabled in algos.yml
	Default: False

Command Examples (for Singularity only):

  • Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query

    python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"

  • Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query

    python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity"

  • Run algorithm SW-graph on chembl-1024-jaccard dataset for range query retrieving all neighbors with similarity coefficient larger than 0.8

    python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2

  • Run algorithm Hnsw on chembl-1024-jaccard dataset for batch top-K (K=100) nearest neighbor search

    python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" --batch

Visualization of Execution Results under a PC

Run plotting python: plot.py

Plot.py Parameters:

dataset: dataset name (Required)
    Examples:
    - chembl-1024-jaccard
    - molport-1024-jaccard
count: the value of K for top-K nearest neighbor search
    Default: 10
output/-o: the output file
x-axis/-x: which metric to use on the X-axis
    Choices:
    - k-nn: Recall for top-K nearest neighbor search (Default)
    - range: Recall for range query
    - qps: Queries per second (1/s)
    - build: Indexing time (s)
    - indexsize: Index size (kB)
y-axis/-y: which metric to use on the Y-axis
    Choices:
    - k-nn: Recall for top-K nearest neighbor search
    - range: Recall for range query
    - qps: Queries per second (1/s) (Default)
    - build: Indexing time (s)
    - indexsize: Index size (kB)
x-log/-X: Draw the X-axis using a logarithmic scale
	Default: False
y-log/-Y: draw the Y-axis using a logarithmic scale
	Default: False
raw: show raw results (not just Pareto frontier) in faded colours
	Default: False
batch: batch query mode
	Default: False
rq: range query / threshold-based query mode
	Default: False
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
	Default: 0.3

Command Examples:

  • Plot results on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query to "results/chembl-1024-jaccard-100.png". X-axis: recall. Y-axis: qps, log-scale.

    python plot.py --dataset=chembl-1024-jaccard -Y --count=100 -o=results/chembl-1024-jaccard-100

  • Plot results on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query to "results/molport-1024-jaccard-indexsize-10.png". X-axis: recall. Y-axis: index size, log-scale.

    ython plot.py --dataset=molport-1024-jaccard -Y -y=indexsize --count=10 -o=results/molport-1024-jaccard-indexsize-10

  • Plot results on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query to "results/molport-1024-jaccard-buildtime-10.png". X-axis: recall. Y-axis: indexing time, log-scale.

    python plot.py --dataset=molport-1024-jaccard -Y -y=build --count=10 -o=results/molport-1024-jaccard-buildtime-10

  • Plot batch mode results on molport-1024-jaccard dataset for top-K (K=100) nearest neighbor query to "results/molport-1024-jaccard-batch-100.png". X-axis: recall. Y-axis: qps, log-scale.

    python plot.py --dataset=molport-1024-jaccard -Y --batch --count=100 -o=results/molport-1024-jaccard-batch-100

  • Plot results on chembl-1024-jaccard dataset for range query with similarity cutoff 0.6 to "results/chembl-1024-jaccard-0_4.png". X-axis: recall (range query). Y-axis: qps, log-scale.

    python plot.py --dataset=chembl-1024-jaccard -Y -x=range --rq --radius=0.4 -o=results/chembl-1024-jaccard-0_4

Executions under an HPC environment (Example: UConn HPC)

  1. Load anaconda module

    module load anaconda/5.1.0

  2. Create anaconda environment, and then install dependent libraries

    conda create -c rdkit -n ann_env rdkit python=3.5.2

    source activate ann_env

    pip install -r singularity-install/requirements.txt

    source deactivate ann_env

  3. Run your algorithm scripts by SLURM shell

    sbatch run.sh

An example "run.sh":

#!/bin/bash

#SBATCH --partition=HaswellPriority
#SBATCH --ntasks=1
#SBATCH --exclude=cn[65-69,71-136,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
#SBATCH --exclusive
  

module load anaconda/5.1.0

source activate ann_env

module purge

module load gcc/5.4.0

module load singularity/3.1

  

python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"  

Executions under an HPC environment (Example: XSEDE Comet HPC)

  1. Please download and install miniconda [1] [2] in your HOME directory on Comet

    [1] https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

    [2] https://docs.conda.io/en/latest/miniconda.html

  2. Create anaconda environment, and then install dependent libraries

    conda create -c rdkit -n ann_env rdkit python=3.7.7

    source activate ann_env

    pip install -r singularity-install/requirements.txt

    source deactivate ann_env

  3. Run your algorithm scripts by SLURM shell

    sbatch run.sh

An example "run.sh":

#!/bin/bash

#SBATCH --partition=compute
#SBATCH --no-requeue
#SBATCH --ntasks=1
#SBATCH --exclusive
#SBATCH -t 48:00:00
  

source activate ann_env

module purge

module load singularity/3.5

  

python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"  

Visualization of Execution Results under an HPC environment

Run your algorithm scripts by SLURM shell

sbatch plot.sh

An example "plot.sh":

#!/bin/bash

#SBATCH --partition=HaswellPriority
#SBATCH --ntasks=1
#SBATCH --exclude=cn[65-69,71-136,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
  

module load anaconda/5.1.0

source activate ann_env

module purge

module load gcc/5.4.0

module load singularity/3.1

  

python plot.py --dataset=chembl-1024-jaccard -Y --count=100 -o=results/chembl-1024-jaccard-100

Parameter tuning

All algorithmic parameter settings are included in the "./algos.yaml" file.

An example Hnsw parameters: disable: false (Not disable this algorithm) singularity-tag: ann-bench-nmslib3 (the name of Singularity image) run-groups: M-20: ... M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M": 20, "post": 0, "efConstruction": 800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]. These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.)

Hnsw(Nmslib):

disabled:  false

docker-tag: ann-benchmarks-nmslib

singularity-tag: ann-bench-nmslib3

module: ann_benchmarks.algorithms.nmslib

constructor: NmslibReuseIndex

base-args:  ["@metric", "hnsw"]

run-groups:

M-20:

arg-groups:

- {"M":  20, "post":  0, "efConstruction":  800}

- False

query-args:  [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

M-12:

arg-groups:

- {"M":  12, "post":  0, "efConstruction":  800}

- False

query-args:  [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

After adding a new set of parameters, M-32:

Hnsw(Nmslib):

disabled:  false

docker-tag: ann-benchmarks-nmslib

singularity-tag: ann-bench-nmslib3

module: ann_benchmarks.algorithms.nmslib

constructor: NmslibReuseIndex

base-args:  ["@metric", "hnsw"]

run-groups:

M-32:

arg-groups:

- {"M":  32, "post":  2, "efConstruction":  800}

- False

query-args:  [[100, 300, 500, 700, 1000, 1500, 2000]]

M-20:

arg-groups:

- {"M":  20, "post":  0, "efConstruction":  800}

- False

query-args:  [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

M-12:

arg-groups:

- {"M":  12, "post":  0, "efConstruction":  800}

- False

query-args:  [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

An example Onng parameters: disable: false (Not disable this algorithm) singularity-tag: ann-bench-ngt (the name of Singularity image) run-groups: onng: args: ... query_args: ... (the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py". A grid search is performed with 3^3=27 combinations. The "query_args" includes query parameters, epsilon.)

Onng(Ngt):

disabled:  false

docker-tag: ann-benchmarks-ngt

singularity-tag: ann-bench-ngt

module: ann_benchmarks.algorithms.onng_ngt

constructor: ONNG

base-args:  ["@metric", "Byte", 1.0]

run-groups:

onng:

args:  [[100, 300, 500], [10, 30, 50], [10, 30, 50]]

query-args:  [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]

After adding a new value for each of the three construction parameters:

Onng(Ngt):

disabled:  false

docker-tag: ann-benchmarks-ngt

singularity-tag: ann-bench-ngt

module: ann_benchmarks.algorithms.onng_ngt

constructor: ONNG

base-args:  ["@metric", "Byte", 1.0]

run-groups:

onng:

args:  [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]]

query-args:  [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]

At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric.

Prepare a custom dataset including fingerprint computation

Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example.

  1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed.

  2. Include the key-value pair below to the data strucutre DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file.

    'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'),

  3. Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder.

    python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"

Note: to use an existing dataset, e.g. X, one needs to make sure the data structure DATASETS, defined at the bottom of "./ann_benchmark/datasets.py" contains a key-value pair with key X. Otherwise, one needs to include a key-value pair with key X and an arbitrary value, e.g., "'X': gist", to the DATASETS.

References

  • Omohundro, S. M. Five Balltree Construction Algorithms. Tech. report, UC Berkeley1989.

  • Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees. Inf. Process. Lett.1991, 40, 175–179.

  • Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In Proceedings of WWW Conference; 2011; pp 577–586.

  • Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. CoRR, abs/1603.093202016.

  • Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. Inf. Syst.2014, 45, 61–68.

  • Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods. J. Chem. Inf. Model.2012, 52(4), 891–900.

  • Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. J. Chem. Inf. Model.2019.

  • Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In Proceedings of International Conference on Similarity Search and Applications; 2016; pp 20–33.

  • Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data. CoRR, abs/1810.073552018.

  • Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019).

  • Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res.2012,40, 1100–1107.

You can’t perform that action at this time.