forked from mldrugdiscovery/MssBenchmark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
318 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,318 @@ | ||
# Welcome to Molecular Similarity Search Benchmark (MssBenchmark)! | ||
|
||
A molecular similarity search benchmark. | ||
|
||
Algorithms currently supported: | ||
|
||
- [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209) | ||
- Bruteforce/Exhausive search | ||
- [Min-Hash](https://ekzhu.github.io/datasketch) | ||
- [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r) | ||
- [Hnsw](https://arxiv.org/abs/1603.09320) | ||
- [Onng](https://arxiv.org/abs/1810.07355) | ||
- [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2) | ||
- [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf) | ||
- [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069) | ||
- [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300) | ||
- [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R) | ||
|
||
|
||
Datasets prepared (To include your own data, please refer to instructions later): | ||
|
||
[Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP | ||
[Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP | ||
|
||
|
||
Computational environments supported: | ||
|
||
- A local PC | ||
- Docker-based container | ||
- Singularity-based container / High Performance Computing (HPC) | ||
|
||
# Useful links | ||
|
||
- Github Repo: https://github.uconn.edu/mldrugdiscovery/MssBenchmark | ||
- Example Datasets: https://drive.google.com/open?id=19mfbPoL1Ajvs0Ol2w50ILQQTZyXKcahC | ||
- Pre-compiled Singularity Images: https://drive.google.com/open?id=1L9Bj5TxAfxf27J1PdCnQ2EPgbM59BDwb | ||
|
||
# Using prepared datasets and pre-compiled Singularity images | ||
|
||
1. Download and put a dataset, e.g. Chembl-1024-jaccard.hdf5, under "data" folder; | ||
2. Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder. | ||
|
||
# Executions under a PC with Singularity | ||
Run.py Parameters: | ||
|
||
dataset: dataset name (Required) | ||
Examples: | ||
- chembl-1024-jaccard | ||
- molport-1024-jaccard | ||
algorithm: algorithm name (Required) | ||
Options: | ||
- Balltree(Sklearn) | ||
- Bruteforce | ||
- Datasketch | ||
- DivideSkip | ||
- Hnsw(Nmslib) | ||
- Onng(Ngt) | ||
- Panng(Ngt) | ||
- Pynndescent | ||
- Risc | ||
- SW-graph(Nmslib) | ||
- VPtree(Nmslib) | ||
count: the value of K for top-K nearest neighbor search | ||
Default: 10 | ||
runs: the number of times the query set will be executed | ||
Default: 2 | ||
sif-dir: Singularity image files directory | ||
Default: "./singularity" | ||
batch: batch query mode | ||
Default: False | ||
rq: range query / threshold-based query mode | ||
Default: False | ||
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2. | ||
Default: 0.3 | ||
force: re-run algorithms even if their results already exist | ||
Default: False | ||
time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set | ||
Default: -1 | ||
run-disabled: run algorithms that are disabled in algos.yml | ||
Default: False | ||
|
||
Command Examples (for Singularity only): | ||
- Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query | ||
|
||
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" | ||
|
||
- Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query | ||
|
||
python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity" | ||
|
||
- Run algorithm SW-graph on chembl-1024-jaccard dataset for **range query** retrieving all neighbors with similarity coefficient larger than 0.8 | ||
|
||
python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2 | ||
|
||
- Run algorithm Hnsw on chembl-1024-jaccard dataset for **batch** top-K (K=100) nearest neighbor search | ||
|
||
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" --batch | ||
|
||
# Executions under an HPC environment | ||
|
||
1. Load anaconda module | ||
|
||
module load anaconda/5.1.0 | ||
|
||
2. Create anaconda environment, and then install dependent libraries | ||
|
||
conda create -c rdkit -n ann_env rdkit python=3.5.2 | ||
pip install -r singularity-install/requirements.txt | ||
|
||
3. Run your algorithm scripts by SLURM shell | ||
|
||
sbatch run.sh | ||
|
||
An example "run.sh": | ||
|
||
#!/bin/bash | ||
|
||
#SBATCH --ntasks=1 | ||
|
||
|
||
module load anaconda/5.1.0 | ||
|
||
source activate ann_env | ||
|
||
module purge | ||
|
||
module load gcc/5.4.0 | ||
|
||
module load singularity/3.1 | ||
|
||
|
||
python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" | ||
|
||
|
||
# Parameter tuning | ||
All algorithmic parameter settings are included in the "./algos.yaml" file. | ||
|
||
An example Hnsw parameters: | ||
disable: false (Not disable this algorithm) | ||
singularity-tag: ann-bench-nmslib3 (the name of Singularity image) | ||
run-groups: M-20: ... M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M": 20, "post": 0, "efConstruction": 800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]. These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.) | ||
|
||
Hnsw(Nmslib): | ||
|
||
disabled: false | ||
|
||
docker-tag: ann-benchmarks-nmslib | ||
|
||
singularity-tag: ann-bench-nmslib3 | ||
|
||
module: ann_benchmarks.algorithms.nmslib | ||
|
||
constructor: NmslibReuseIndex | ||
|
||
base-args: ["@metric", "hnsw"] | ||
|
||
run-groups: | ||
|
||
M-20: | ||
|
||
arg-groups: | ||
|
||
- {"M": 20, "post": 0, "efConstruction": 800} | ||
|
||
- False | ||
|
||
query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] | ||
|
||
M-12: | ||
|
||
arg-groups: | ||
|
||
- {"M": 12, "post": 0, "efConstruction": 800} | ||
|
||
- False | ||
|
||
query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] | ||
|
||
After adding a new set of parameters, M-32: | ||
|
||
Hnsw(Nmslib): | ||
|
||
disabled: false | ||
|
||
docker-tag: ann-benchmarks-nmslib | ||
|
||
singularity-tag: ann-bench-nmslib3 | ||
|
||
module: ann_benchmarks.algorithms.nmslib | ||
|
||
constructor: NmslibReuseIndex | ||
|
||
base-args: ["@metric", "hnsw"] | ||
|
||
run-groups: | ||
|
||
M-32: | ||
|
||
arg-groups: | ||
|
||
- {"M": 32, "post": 2, "efConstruction": 800} | ||
|
||
- False | ||
|
||
query-args: [[100, 300, 500, 700, 1000, 1500, 2000]] | ||
|
||
M-20: | ||
|
||
arg-groups: | ||
|
||
- {"M": 20, "post": 0, "efConstruction": 800} | ||
|
||
- False | ||
|
||
query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] | ||
|
||
M-12: | ||
|
||
arg-groups: | ||
|
||
- {"M": 12, "post": 0, "efConstruction": 800} | ||
|
||
- False | ||
|
||
query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] | ||
|
||
An example Onng parameters: | ||
disable: false (Not disable this algorithm) | ||
singularity-tag: ann-bench-ngt (the name of Singularity image) | ||
run-groups: onng: args: ... query_args: ... | ||
(the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py". A grid search is performed with 3^3=27 combinations. The "query_args" includes query parameters, epsilon.) | ||
|
||
Onng(Ngt): | ||
|
||
disabled: false | ||
|
||
docker-tag: ann-benchmarks-ngt | ||
|
||
singularity-tag: ann-bench-ngt | ||
|
||
module: ann_benchmarks.algorithms.onng_ngt | ||
|
||
constructor: ONNG | ||
|
||
base-args: ["@metric", "Byte", 1.0] | ||
|
||
run-groups: | ||
|
||
onng: | ||
|
||
args: [[100, 300, 500], [10, 30, 50], [10, 30, 50]] | ||
|
||
query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]] | ||
|
||
After adding a new value for each of the three construction parameters: | ||
|
||
Onng(Ngt): | ||
|
||
disabled: false | ||
|
||
docker-tag: ann-benchmarks-ngt | ||
|
||
singularity-tag: ann-bench-ngt | ||
|
||
module: ann_benchmarks.algorithms.onng_ngt | ||
|
||
constructor: ONNG | ||
|
||
base-args: ["@metric", "Byte", 1.0] | ||
|
||
run-groups: | ||
|
||
onng: | ||
|
||
args: [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]] | ||
|
||
query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]] | ||
|
||
|
||
At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric. | ||
|
||
# Prepare a custom dataset including fingerprint computation | ||
|
||
Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example. | ||
1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed. | ||
2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". | ||
If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file. | ||
|
||
'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'), | ||
|
||
3. Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder. | ||
|
||
python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" | ||
|
||
# References | ||
- Omohundro, S. M. Five Balltree Construction Algorithms. _Tech. report, UC Berkeley_**1989**. | ||
|
||
- Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees. _Inf. Process. Lett._**1991**, _40_, 175–179. | ||
|
||
- Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In _Proceedings of WWW Conference_; 2011; pp 577–586. | ||
|
||
- Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. _CoRR, abs/1603.09320_**2016**. | ||
|
||
- Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. _Inf. Syst._**2014**, _45_, 61–68. | ||
|
||
- Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods. _J. Chem. Inf. Model._**2012**, _52_(4), 891–900. | ||
|
||
- Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. _J. Chem. Inf. Model._**2019**. | ||
|
||
- Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In _Proceedings of International Conference on Similarity Search and Applications_; 2016; pp 20–33. | ||
|
||
- Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data. _CoRR, abs/1810.07355_**2018**. | ||
|
||
- Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019). | ||
|
||
- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107. |