A molecular similarity search benchmark.
Algorithms currently supported:
- [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209)
- Bruteforce/Exhausive search
- [Chemfp 1.6.1](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8)
- [the standard modulo-OR-compression algorithm, or folding](https://pubs.acs.org/doi/10.1021/ci100132g)
- [Min-Hash](https://ekzhu.github.io/datasketch)
- [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r)
- [Hnsw](https://arxiv.org/abs/1603.09320)
- [Onng](https://arxiv.org/abs/1810.07355)
- [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2)
- [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf)
- [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069)
- [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300)
- [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R)
Datasets prepared (To include your own data, please refer to instructions later):
[Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP
[Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP
Computational environments supported:
- A local PC
- Docker-based container
- Singularity-based container / High Performance Computing (HPC)
In principle, running this benchmark on the HPC only requires RDKit and the Python packages listed in "singularity-install/requirements.txt". However, different HPC might have slightly different environments. We will show two examples of executing our benchmark in the University of Connecticut (UConn) HPC cluster and Extreme Science and Engineering Discovery Environment (XSEDE).
- Github Repo: https://github.uconn.edu/mldrugdiscovery/MssBenchmark
- Example Datasets: https://drive.google.com/open?id=19mfbPoL1Ajvs0Ol2w50ILQQTZyXKcahC
- Pre-compiled Singularity Images: https://drive.google.com/open?id=1L9Bj5TxAfxf27J1PdCnQ2EPgbM59BDwb
- Download and put a dataset, e.g. chembl-1024-jaccard.hdf5, under "data" folder;
- Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder.
-
pip install -r requirements.txt
-
Run your algorithm
Run.py Parameters:
dataset: dataset name (Required)
Examples:
- chembl-1024-jaccard
- molport-1024-jaccard
algorithm: algorithm name (Required)
Choices:
- Balltree(Sklearn)
- Bruteforce
- Chemfp
- Datasketch
- DivideSkip
- Folding
- Hnsw(Nmslib)
- Onng(Ngt)
- Panng(Ngt)
- Pynndescent
- Risc
- SW-graph(Nmslib)
- VPtree(Nmslib)
count: the value of K for top-K nearest neighbor search
Default: 10
runs: the number of times the query set will be executed
Default: 2
sif-dir: Singularity image files directory
Default: "./singularity"
batch: batch query mode
Default: False
rq: range query / threshold-based query mode
Default: False
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
Default: 0.3
force: re-run algorithms even if their results already exist
Default: False
time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set
Default: -1
run-disabled: run algorithms that are disabled in algos.yml
Default: False
Command Examples (for Singularity only):
-
Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"
-
Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query
python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity"
-
Run algorithm SW-graph on chembl-1024-jaccard dataset for range query retrieving all neighbors with similarity coefficient larger than 0.8
python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2
-
Run algorithm Hnsw on chembl-1024-jaccard dataset for batch top-K (K=100) nearest neighbor search
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" --batch
Run plotting python: plot.py
Plot.py Parameters:
dataset: dataset name (Required)
Examples:
- chembl-1024-jaccard
- molport-1024-jaccard
count: the value of K for top-K nearest neighbor search
Default: 10
output/-o: the output file
x-axis/-x: which metric to use on the X-axis
Choices:
- k-nn: Recall for top-K nearest neighbor search (Default)
- range: Recall for range query
- qps: Queries per second (1/s)
- build: Indexing time (s)
- indexsize: Index size (kB)
y-axis/-y: which metric to use on the Y-axis
Choices:
- k-nn: Recall for top-K nearest neighbor search
- range: Recall for range query
- qps: Queries per second (1/s) (Default)
- build: Indexing time (s)
- indexsize: Index size (kB)
x-log/-X: Draw the X-axis using a logarithmic scale
Default: False
y-log/-Y: draw the Y-axis using a logarithmic scale
Default: False
raw: show raw results (not just Pareto frontier) in faded colours
Default: False
batch: batch query mode
Default: False
rq: range query / threshold-based query mode
Default: False
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
Default: 0.3
Command Examples:
-
Plot results on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query to "results/chembl-1024-jaccard-100.png". X-axis: recall. Y-axis: qps, log-scale.
python plot.py --dataset=chembl-1024-jaccard -Y --count=100 -o=results/chembl-1024-jaccard-100
-
Plot results on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query to "results/molport-1024-jaccard-indexsize-10.png". X-axis: recall. Y-axis: index size, log-scale.
ython plot.py --dataset=molport-1024-jaccard -Y -y=indexsize --count=10 -o=results/molport-1024-jaccard-indexsize-10
-
Plot results on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query to "results/molport-1024-jaccard-buildtime-10.png". X-axis: recall. Y-axis: indexing time, log-scale.
python plot.py --dataset=molport-1024-jaccard -Y -y=build --count=10 -o=results/molport-1024-jaccard-buildtime-10
-
Plot batch mode results on molport-1024-jaccard dataset for top-K (K=100) nearest neighbor query to "results/molport-1024-jaccard-batch-100.png". X-axis: recall. Y-axis: qps, log-scale.
python plot.py --dataset=molport-1024-jaccard -Y --batch --count=100 -o=results/molport-1024-jaccard-batch-100
-
Plot results on chembl-1024-jaccard dataset for range query with similarity cutoff 0.6 to "results/chembl-1024-jaccard-0_4.png". X-axis: recall (range query). Y-axis: qps, log-scale.
python plot.py --dataset=chembl-1024-jaccard -Y -x=range --rq --radius=0.4 -o=results/chembl-1024-jaccard-0_4
-
Load anaconda module
module load anaconda/5.1.0
-
Create anaconda environment, and then install dependent libraries
conda create -c rdkit -n ann_env rdkit python=3.5.2
source activate ann_env
pip install -r singularity-install/requirements.txt
source deactivate ann_env
-
Run your algorithm scripts by SLURM shell
sbatch run.sh
An example "run.sh":
#!/bin/bash
#SBATCH --partition=HaswellPriority
#SBATCH --ntasks=1
#SBATCH --exclude=cn[65-69,71-136,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
#SBATCH --exclusive
module load anaconda/5.1.0
source activate ann_env
module purge
module load gcc/5.4.0
module load singularity/3.1
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"
-
Please download and install miniconda [1] [2] in your HOME directory on Comet
[1] https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
-
Create anaconda environment, and then install dependent libraries
conda create -c rdkit -n ann_env rdkit python=3.7.7
source activate ann_env
pip install -r singularity-install/requirements.txt
source deactivate ann_env
-
Run your algorithm scripts by SLURM shell
sbatch run.sh
An example "run.sh":
#!/bin/bash
#SBATCH --partition=compute
#SBATCH --no-requeue
#SBATCH --ntasks=1
#SBATCH --exclusive
#SBATCH -t 48:00:00
source activate ann_env
module purge
module load singularity/3.5
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"
Run your algorithm scripts by SLURM shell
sbatch plot.sh
An example "plot.sh":
#!/bin/bash
#SBATCH --partition=HaswellPriority
#SBATCH --ntasks=1
#SBATCH --exclude=cn[65-69,71-136,325-343,345-353,355-358,360-364,369-398,400-401],gpu[07-10]
module load anaconda/5.1.0
source activate ann_env
module purge
module load gcc/5.4.0
module load singularity/3.1
python plot.py --dataset=chembl-1024-jaccard -Y --count=100 -o=results/chembl-1024-jaccard-100
All algorithmic parameter settings are included in the "./algos.yaml" file.
An example Hnsw parameters: disable: false (Not disable this algorithm) singularity-tag: ann-bench-nmslib3 (the name of Singularity image) run-groups: M-20: ... M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M": 20, "post": 0, "efConstruction": 800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]. These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.)
Hnsw(Nmslib):
disabled: false
docker-tag: ann-benchmarks-nmslib
singularity-tag: ann-bench-nmslib3
module: ann_benchmarks.algorithms.nmslib
constructor: NmslibReuseIndex
base-args: ["@metric", "hnsw"]
run-groups:
M-20:
arg-groups:
- {"M": 20, "post": 0, "efConstruction": 800}
- False
query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
M-12:
arg-groups:
- {"M": 12, "post": 0, "efConstruction": 800}
- False
query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
After adding a new set of parameters, M-32:
Hnsw(Nmslib):
disabled: false
docker-tag: ann-benchmarks-nmslib
singularity-tag: ann-bench-nmslib3
module: ann_benchmarks.algorithms.nmslib
constructor: NmslibReuseIndex
base-args: ["@metric", "hnsw"]
run-groups:
M-32:
arg-groups:
- {"M": 32, "post": 2, "efConstruction": 800}
- False
query-args: [[100, 300, 500, 700, 1000, 1500, 2000]]
M-20:
arg-groups:
- {"M": 20, "post": 0, "efConstruction": 800}
- False
query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
M-12:
arg-groups:
- {"M": 12, "post": 0, "efConstruction": 800}
- False
query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
An example Onng parameters: disable: false (Not disable this algorithm) singularity-tag: ann-bench-ngt (the name of Singularity image) run-groups: onng: args: ... query_args: ... (the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py". A grid search is performed with 3^3=27 combinations. The "query_args" includes query parameters, epsilon.)
Onng(Ngt):
disabled: false
docker-tag: ann-benchmarks-ngt
singularity-tag: ann-bench-ngt
module: ann_benchmarks.algorithms.onng_ngt
constructor: ONNG
base-args: ["@metric", "Byte", 1.0]
run-groups:
onng:
args: [[100, 300, 500], [10, 30, 50], [10, 30, 50]]
query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]
After adding a new value for each of the three construction parameters:
Onng(Ngt):
disabled: false
docker-tag: ann-benchmarks-ngt
singularity-tag: ann-bench-ngt
module: ann_benchmarks.algorithms.onng_ngt
constructor: ONNG
base-args: ["@metric", "Byte", 1.0]
run-groups:
onng:
args: [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]]
query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]
At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric.
Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example.
-
Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed.
-
Include the key-value pair below to the data strucutre DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file.
'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'),
-
Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder.
python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"
Note: to use an existing dataset, e.g. X, one needs to make sure the data structure DATASETS, defined at the bottom of "./ann_benchmark/datasets.py" contains a key-value pair with key X. Otherwise, one needs to include a key-value pair with key X and an arbitrary value, e.g., "'X': gist", to the DATASETS.
-
Omohundro, S. M. Five Balltree Construction Algorithms. Tech. report, UC Berkeley1989.
-
Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees. Inf. Process. Lett.1991, 40, 175–179.
-
Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In Proceedings of WWW Conference; 2011; pp 577–586.
-
Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. CoRR, abs/1603.093202016.
-
Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. Inf. Syst.2014, 45, 61–68.
-
Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods. J. Chem. Inf. Model.2012, 52(4), 891–900.
-
Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. J. Chem. Inf. Model.2019.
-
Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In Proceedings of International Conference on Similarity Search and Applications; 2016; pp 20–33.
-
Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data. CoRR, abs/1810.073552018.
-
Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019).
-
Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res.2012,40, 1100–1107.