add REDME.md

mldrugdiscovery · May 13, 2020 · 6a42628 · 6a42628
1 parent 0f0236f
commit 6a42628
Showing 1 changed file with 318 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,318 @@
+# Welcome to Molecular Similarity Search Benchmark (MssBenchmark)!
+
+A molecular similarity search benchmark.
+
+Algorithms currently supported:
+
+    - [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209)
+    - Bruteforce/Exhausive search
+    - [Min-Hash](https://ekzhu.github.io/datasketch)
+    - [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r)
+    - [Hnsw](https://arxiv.org/abs/1603.09320)
+    - [Onng](https://arxiv.org/abs/1810.07355)
+    - [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2)
+    - [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf)
+    - [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069)
+    - [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300)
+    - [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R)
+
+
+Datasets prepared (To include your own data, please refer to instructions later):
+
+    [Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP
+    [Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP
+
+
+Computational environments supported:
+
+    - A local PC
+    - Docker-based container
+    - Singularity-based container / High Performance Computing (HPC)
+
+# Useful links
+
+- Github Repo: https://github.uconn.edu/mldrugdiscovery/MssBenchmark
+ - Example Datasets: https://drive.google.com/open?id=19mfbPoL1Ajvs0Ol2w50ILQQTZyXKcahC
+- Pre-compiled Singularity Images: https://drive.google.com/open?id=1L9Bj5TxAfxf27J1PdCnQ2EPgbM59BDwb
+
+# Using prepared datasets and pre-compiled Singularity images
+
+1. Download and put a dataset, e.g. Chembl-1024-jaccard.hdf5, under "data" folder;
+2. Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder.
+
+# Executions under a PC with Singularity
+Run.py Parameters:
+
+    dataset: dataset name (Required)
+	    Examples:
+	    - chembl-1024-jaccard
+	    - molport-1024-jaccard
+    algorithm: algorithm name (Required)
+	    Options:
+	    - Balltree(Sklearn)
+	    - Bruteforce
+	    - Datasketch
+	    - DivideSkip
+	    - Hnsw(Nmslib)
+	    - Onng(Ngt)
+	    - Panng(Ngt)
+	    - Pynndescent
+	    - Risc
+	    - SW-graph(Nmslib)
+	    - VPtree(Nmslib)
+    count: the value of K for top-K nearest neighbor search
+	    Default: 10
+    runs: the number of times the query set will be executed 
+	    Default: 2
+	sif-dir: Singularity image files directory
+		Default: "./singularity"
+	batch: batch query mode
+		Default: False
+	rq: range query / threshold-based query mode
+		Default: False
+	radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
+		Default: 0.3
+	force: re-run algorithms even if their results already exist
+		Default: False
+	time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set
+		Default: -1
+	run-disabled: run algorithms that are disabled in algos.yml
+		Default: False
+
+Command Examples (for Singularity only):
+- Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query
+
+    python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"  
+
+- Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query
+
+    python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity"  
+
+- Run algorithm SW-graph on chembl-1024-jaccard dataset for **range query** retrieving all neighbors with similarity coefficient larger than 0.8
+
+    python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2
+
+- Run algorithm Hnsw on chembl-1024-jaccard dataset for **batch** top-K (K=100) nearest neighbor search
+
+    python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"  --batch
+
+# Executions under an HPC environment
+
+1. Load anaconda module
+
+    module load anaconda/5.1.0
+
+2. Create anaconda environment, and then install dependent libraries
+
+    conda create -c rdkit -n ann_env rdkit python=3.5.2
+    pip install -r singularity-install/requirements.txt
+
+3. Run your algorithm scripts by SLURM shell
+
+    sbatch run.sh
+
+An example "run.sh":
+
+    #!/bin/bash
+
+    #SBATCH --ntasks=1
+
+      
+
+    module load anaconda/5.1.0
+
+    source activate ann_env
+
+    module purge
+
+    module load gcc/5.4.0
+
+    module load singularity/3.1
+
+      
+
+    python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"  
+
+
+# Parameter tuning
+All algorithmic parameter settings are included in the "./algos.yaml" file.
+
+An example Hnsw parameters:
+disable: false (Not disable this algorithm)
+singularity-tag: ann-bench-nmslib3 (the name of Singularity image)
+run-groups: M-20: ...  M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M":  20, "post":  0, "efConstruction":  800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]].  These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.)
+
+    Hnsw(Nmslib):
+
+    disabled:  false
+
+    docker-tag: ann-benchmarks-nmslib
+
+    singularity-tag: ann-bench-nmslib3
+
+    module: ann_benchmarks.algorithms.nmslib
+
+    constructor: NmslibReuseIndex
+
+    base-args:  ["@metric", "hnsw"]
+
+    run-groups:
+
+    M-20:
+
+    arg-groups:
+
+    - {"M":  20, "post":  0, "efConstruction":  800}
+
+    - False
+
+    query-args:  [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
+
+    M-12:
+
+    arg-groups:
+
+    - {"M":  12, "post":  0, "efConstruction":  800}
+
+    - False
+
+    query-args:  [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
+
+After adding a new set of parameters, M-32:
+
+    Hnsw(Nmslib):
+
+    disabled:  false
+
+    docker-tag: ann-benchmarks-nmslib
+
+    singularity-tag: ann-bench-nmslib3
+
+    module: ann_benchmarks.algorithms.nmslib
+
+    constructor: NmslibReuseIndex
+
+    base-args:  ["@metric", "hnsw"]
+
+    run-groups:
+
+    M-32:
+
+    arg-groups:
+
+    - {"M":  32, "post":  2, "efConstruction":  800}
+
+    - False
+
+    query-args:  [[100, 300, 500, 700, 1000, 1500, 2000]]
+
+    M-20:
+
+    arg-groups:
+
+    - {"M":  20, "post":  0, "efConstruction":  800}
+
+    - False
+
+    query-args:  [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
+
+    M-12:
+
+    arg-groups:
+
+    - {"M":  12, "post":  0, "efConstruction":  800}
+
+    - False
+
+    query-args:  [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]
+
+An example Onng parameters:
+disable: false (Not disable this algorithm)
+singularity-tag: ann-bench-ngt (the name of Singularity image)
+run-groups: onng: args: ... query_args: ... 
+(the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py".  A grid search is performed with 3^3=27 combinations.  The "query_args" includes query parameters, epsilon.)
+
+    Onng(Ngt):
+
+    disabled:  false
+
+    docker-tag: ann-benchmarks-ngt
+
+    singularity-tag: ann-bench-ngt
+
+    module: ann_benchmarks.algorithms.onng_ngt
+
+    constructor: ONNG
+
+    base-args:  ["@metric", "Byte", 1.0]
+
+    run-groups:
+
+    onng:
+
+    args:  [[100, 300, 500], [10, 30, 50], [10, 30, 50]]
+
+    query-args:  [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]
+
+After adding a new value for each of the three construction parameters:
+
+    Onng(Ngt):
+
+    disabled:  false
+
+    docker-tag: ann-benchmarks-ngt
+
+    singularity-tag: ann-bench-ngt
+
+    module: ann_benchmarks.algorithms.onng_ngt
+
+    constructor: ONNG
+
+    base-args:  ["@metric", "Byte", 1.0]
+
+    run-groups:
+
+    onng:
+
+    args:  [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]]
+
+    query-args:  [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]
+
+
+At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric.
+
+# Prepare a custom dataset including fingerprint computation
+
+Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example.
+1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed.
+2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py".
+If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file.
+
+    'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'),
+
+3. Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder.
+
+    python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"
+
+# References
+- Omohundro, S. M. Five Balltree Construction Algorithms.  _Tech. report, UC Berkeley_**1989**.
+
+- Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees.  _Inf. Process. Lett._**1991**,  _40_, 175–179.
+
+- Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In  _Proceedings of WWW Conference_; 2011; pp 577–586.
+
+- Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.  _CoRR, abs/1603.09320_**2016**.
+
+- Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs.  _Inf. Syst._**2014**,  _45_, 61–68.
+
+- Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods.  _J. Chem. Inf. Model._**2012**,  _52_(4), 891–900.
+
+- Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints.  _J. Chem. Inf. Model._**2019**.
+
+- Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In  _Proceedings of International Conference on Similarity Search and Applications_; 2016; pp 20–33.
+
+- Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data.  _CoRR, abs/1810.07355_**2018**.
+
+- Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019).
+
+- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery.  _Nucleic Acids Res._**2012**,_40_, 1100–1107.