diff --git a/README.md b/README.md new file mode 100644 index 0000000..dc59844 --- /dev/null +++ b/README.md @@ -0,0 +1,318 @@ +# Welcome to Molecular Similarity Search Benchmark (MssBenchmark)! + +A molecular similarity search benchmark. + +Algorithms currently supported: + + - [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209) + - Bruteforce/Exhausive search + - [Min-Hash](https://ekzhu.github.io/datasketch) + - [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r) + - [Hnsw](https://arxiv.org/abs/1603.09320) + - [Onng](https://arxiv.org/abs/1810.07355) + - [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2) + - [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf) + - [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069) + - [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300) + - [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R) + + +Datasets prepared (To include your own data, please refer to instructions later): + + [Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP + [Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP + + +Computational environments supported: + + - A local PC + - Docker-based container + - Singularity-based container / High Performance Computing (HPC) + +# Useful links + +- Github Repo: https://github.uconn.edu/mldrugdiscovery/MssBenchmark + - Example Datasets: https://drive.google.com/open?id=19mfbPoL1Ajvs0Ol2w50ILQQTZyXKcahC +- Pre-compiled Singularity Images: https://drive.google.com/open?id=1L9Bj5TxAfxf27J1PdCnQ2EPgbM59BDwb + +# Using prepared datasets and pre-compiled Singularity images + +1. Download and put a dataset, e.g. Chembl-1024-jaccard.hdf5, under "data" folder; +2. Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder. + +# Executions under a PC with Singularity +Run.py Parameters: + + dataset: dataset name (Required) + Examples: + - chembl-1024-jaccard + - molport-1024-jaccard + algorithm: algorithm name (Required) + Options: + - Balltree(Sklearn) + - Bruteforce + - Datasketch + - DivideSkip + - Hnsw(Nmslib) + - Onng(Ngt) + - Panng(Ngt) + - Pynndescent + - Risc + - SW-graph(Nmslib) + - VPtree(Nmslib) + count: the value of K for top-K nearest neighbor search + Default: 10 + runs: the number of times the query set will be executed + Default: 2 + sif-dir: Singularity image files directory + Default: "./singularity" + batch: batch query mode + Default: False + rq: range query / threshold-based query mode + Default: False + radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2. + Default: 0.3 + force: re-run algorithms even if their results already exist + Default: False + time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set + Default: -1 + run-disabled: run algorithms that are disabled in algos.yml + Default: False + +Command Examples (for Singularity only): +- Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query + + python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" + +- Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query + + python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity" + +- Run algorithm SW-graph on chembl-1024-jaccard dataset for **range query** retrieving all neighbors with similarity coefficient larger than 0.8 + + python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2 + +- Run algorithm Hnsw on chembl-1024-jaccard dataset for **batch** top-K (K=100) nearest neighbor search + + python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" --batch + +# Executions under an HPC environment + +1. Load anaconda module + + module load anaconda/5.1.0 + +2. Create anaconda environment, and then install dependent libraries + + conda create -c rdkit -n ann_env rdkit python=3.5.2 + pip install -r singularity-install/requirements.txt + +3. Run your algorithm scripts by SLURM shell + + sbatch run.sh + +An example "run.sh": + + #!/bin/bash + + #SBATCH --ntasks=1 + + + + module load anaconda/5.1.0 + + source activate ann_env + + module purge + + module load gcc/5.4.0 + + module load singularity/3.1 + + + + python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" + + +# Parameter tuning +All algorithmic parameter settings are included in the "./algos.yaml" file. + +An example Hnsw parameters: +disable: false (Not disable this algorithm) +singularity-tag: ann-bench-nmslib3 (the name of Singularity image) +run-groups: M-20: ... M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M": 20, "post": 0, "efConstruction": 800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]. These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.) + + Hnsw(Nmslib): + + disabled: false + + docker-tag: ann-benchmarks-nmslib + + singularity-tag: ann-bench-nmslib3 + + module: ann_benchmarks.algorithms.nmslib + + constructor: NmslibReuseIndex + + base-args: ["@metric", "hnsw"] + + run-groups: + + M-20: + + arg-groups: + + - {"M": 20, "post": 0, "efConstruction": 800} + + - False + + query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] + + M-12: + + arg-groups: + + - {"M": 12, "post": 0, "efConstruction": 800} + + - False + + query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] + +After adding a new set of parameters, M-32: + + Hnsw(Nmslib): + + disabled: false + + docker-tag: ann-benchmarks-nmslib + + singularity-tag: ann-bench-nmslib3 + + module: ann_benchmarks.algorithms.nmslib + + constructor: NmslibReuseIndex + + base-args: ["@metric", "hnsw"] + + run-groups: + + M-32: + + arg-groups: + + - {"M": 32, "post": 2, "efConstruction": 800} + + - False + + query-args: [[100, 300, 500, 700, 1000, 1500, 2000]] + + M-20: + + arg-groups: + + - {"M": 20, "post": 0, "efConstruction": 800} + + - False + + query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] + + M-12: + + arg-groups: + + - {"M": 12, "post": 0, "efConstruction": 800} + + - False + + query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]] + +An example Onng parameters: +disable: false (Not disable this algorithm) +singularity-tag: ann-bench-ngt (the name of Singularity image) +run-groups: onng: args: ... query_args: ... +(the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py". A grid search is performed with 3^3=27 combinations. The "query_args" includes query parameters, epsilon.) + + Onng(Ngt): + + disabled: false + + docker-tag: ann-benchmarks-ngt + + singularity-tag: ann-bench-ngt + + module: ann_benchmarks.algorithms.onng_ngt + + constructor: ONNG + + base-args: ["@metric", "Byte", 1.0] + + run-groups: + + onng: + + args: [[100, 300, 500], [10, 30, 50], [10, 30, 50]] + + query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]] + +After adding a new value for each of the three construction parameters: + + Onng(Ngt): + + disabled: false + + docker-tag: ann-benchmarks-ngt + + singularity-tag: ann-bench-ngt + + module: ann_benchmarks.algorithms.onng_ngt + + constructor: ONNG + + base-args: ["@metric", "Byte", 1.0] + + run-groups: + + onng: + + args: [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]] + + query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]] + + +At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric. + +# Prepare a custom dataset including fingerprint computation + +Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example. +1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed. +2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". +If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file. + + 'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'), + +3. Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder. + + python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" + +# References +- Omohundro, S. M. Five Balltree Construction Algorithms. _Tech. report, UC Berkeley_**1989**. + +- Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees. _Inf. Process. Lett._**1991**, _40_, 175–179. + +- Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In _Proceedings of WWW Conference_; 2011; pp 577–586. + +- Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. _CoRR, abs/1603.09320_**2016**. + +- Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. _Inf. Syst._**2014**, _45_, 61–68. + +- Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods. _J. Chem. Inf. Model._**2012**, _52_(4), 891–900. + +- Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. _J. Chem. Inf. Model._**2019**. + +- Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In _Proceedings of International Conference on Similarity Search and Applications_; 2016; pp 20–33. + +- Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data. _CoRR, abs/1810.07355_**2018**. + +- Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019). + +- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107. \ No newline at end of file