Skip to content

Commit

Permalink
add REDME.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cjz18001 authored May 13, 2020
1 parent 0f0236f commit 6a42628
Showing 1 changed file with 318 additions and 0 deletions.
318 changes: 318 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
# Welcome to Molecular Similarity Search Benchmark (MssBenchmark)!

A molecular similarity search benchmark.

Algorithms currently supported:

- [Balltree](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.8209)
- Bruteforce/Exhausive search
- [Min-Hash](https://ekzhu.github.io/datasketch)
- [DivideSkip](https://pubs.acs.org/doi/10.1021/ci200552r)
- [Hnsw](https://arxiv.org/abs/1603.09320)
- [Onng](https://arxiv.org/abs/1810.07355)
- [Panng](https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2)
- [Pynndescent](http://wwwconference.org/proceedings/www2011/proceedings/p577.pdf)
- [Risc](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00069)
- [SW-graph](https://www.sciencedirect.com/science/article/pii/S0306437913001300)
- [VPtree](https://www.sciencedirect.com/science/article/abs/pii/002001909190074R)


Datasets prepared (To include your own data, please refer to instructions later):

[Chembl](https://www.ebi.ac.uk/chembl/), 1024-bits ECFP
[Molport](https://www.molport.com/shop/database-download), 1024-bits ECFP


Computational environments supported:

- A local PC
- Docker-based container
- Singularity-based container / High Performance Computing (HPC)

# Useful links

- Github Repo: https://github.uconn.edu/mldrugdiscovery/MssBenchmark
- Example Datasets: https://drive.google.com/open?id=19mfbPoL1Ajvs0Ol2w50ILQQTZyXKcahC
- Pre-compiled Singularity Images: https://drive.google.com/open?id=1L9Bj5TxAfxf27J1PdCnQ2EPgbM59BDwb

# Using prepared datasets and pre-compiled Singularity images

1. Download and put a dataset, e.g. Chembl-1024-jaccard.hdf5, under "data" folder;
2. Download and put a Singularity image file, e.g. "ann-bench-nmslib3.sif" under "singularity" folder.

# Executions under a PC with Singularity
Run.py Parameters:

dataset: dataset name (Required)
Examples:
- chembl-1024-jaccard
- molport-1024-jaccard
algorithm: algorithm name (Required)
Options:
- Balltree(Sklearn)
- Bruteforce
- Datasketch
- DivideSkip
- Hnsw(Nmslib)
- Onng(Ngt)
- Panng(Ngt)
- Pynndescent
- Risc
- SW-graph(Nmslib)
- VPtree(Nmslib)
count: the value of K for top-K nearest neighbor search
Default: 10
runs: the number of times the query set will be executed
Default: 2
sif-dir: Singularity image files directory
Default: "./singularity"
batch: batch query mode
Default: False
rq: range query / threshold-based query mode
Default: False
radius: in the range query mode, the used cut-off value. Here the distance is used, so if all near neighbors with a similarity coefficient larger than 0.8, please set it 0.2.
Default: 0.3
force: re-run algorithms even if their results already exist
Default: False
time-out: Timeout (in seconds) for each individual algorithm run, or -1 if no timeout should be set
Default: -1
run-disabled: run algorithms that are disabled in algos.yml
Default: False

Command Examples (for Singularity only):
- Run algorithm Hnsw on chembl-1024-jaccard dataset for top-K (K=100) nearest neighbor query

python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"

- Run algorithm Onng on molport-1024-jaccard dataset for top-K (K=10) nearest neighbor query

python run.py --dataset=molport-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=10 --sif-dir="./singularity"

- Run algorithm SW-graph on chembl-1024-jaccard dataset for **range query** retrieving all neighbors with similarity coefficient larger than 0.8

python run.py --dataset=chembl-1024-jaccard --algorithm='SW-graph(Nmslib)' --sif-dir="./singularity" --rq --radius=0.2

- Run algorithm Hnsw on chembl-1024-jaccard dataset for **batch** top-K (K=100) nearest neighbor search

python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" --batch

# Executions under an HPC environment

1. Load anaconda module

module load anaconda/5.1.0

2. Create anaconda environment, and then install dependent libraries

conda create -c rdkit -n ann_env rdkit python=3.5.2
pip install -r singularity-install/requirements.txt

3. Run your algorithm scripts by SLURM shell

sbatch run.sh

An example "run.sh":

#!/bin/bash

#SBATCH --ntasks=1


module load anaconda/5.1.0

source activate ann_env

module purge

module load gcc/5.4.0

module load singularity/3.1


python run.py --dataset=chembl-1024-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"


# Parameter tuning
All algorithmic parameter settings are included in the "./algos.yaml" file.

An example Hnsw parameters:
disable: false (Not disable this algorithm)
singularity-tag: ann-bench-nmslib3 (the name of Singularity image)
run-groups: M-20: ... M-12: ... (will run two groups of parameters: the first group M-20 has construction parameters "M": 20, "post": 0, "efConstruction": 800, and query parameters [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]. These correspond to the algorithmic parameters of Hnsw in "./ann_benchmark/algorithms/nmslib.py". Similarly for M-12.)

Hnsw(Nmslib):

disabled: false

docker-tag: ann-benchmarks-nmslib

singularity-tag: ann-bench-nmslib3

module: ann_benchmarks.algorithms.nmslib

constructor: NmslibReuseIndex

base-args: ["@metric", "hnsw"]

run-groups:

M-20:

arg-groups:

- {"M": 20, "post": 0, "efConstruction": 800}

- False

query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

M-12:

arg-groups:

- {"M": 12, "post": 0, "efConstruction": 800}

- False

query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

After adding a new set of parameters, M-32:

Hnsw(Nmslib):

disabled: false

docker-tag: ann-benchmarks-nmslib

singularity-tag: ann-bench-nmslib3

module: ann_benchmarks.algorithms.nmslib

constructor: NmslibReuseIndex

base-args: ["@metric", "hnsw"]

run-groups:

M-32:

arg-groups:

- {"M": 32, "post": 2, "efConstruction": 800}

- False

query-args: [[100, 300, 500, 700, 1000, 1500, 2000]]

M-20:

arg-groups:

- {"M": 20, "post": 0, "efConstruction": 800}

- False

query-args: [[2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

M-12:

arg-groups:

- {"M": 12, "post": 0, "efConstruction": 800}

- False

query-args: [[1, 2, 5, 10, 15, 20, 30, 40, 50, 70, 80]]

An example Onng parameters:
disable: false (Not disable this algorithm)
singularity-tag: ann-bench-ngt (the name of Singularity image)
run-groups: onng: args: ... query_args: ...
(the "args" includes three sets of construction parameters. The first set [100, 300, 500] is for edge, the second set [10,30,50] is for outdegree, and the third set [10,30,50] is for indegree. These correspond to algorithmic parameters of Onng defined in "./ann_benchmark/algorithms/onng_ngt.py". A grid search is performed with 3^3=27 combinations. The "query_args" includes query parameters, epsilon.)

Onng(Ngt):

disabled: false

docker-tag: ann-benchmarks-ngt

singularity-tag: ann-bench-ngt

module: ann_benchmarks.algorithms.onng_ngt

constructor: ONNG

base-args: ["@metric", "Byte", 1.0]

run-groups:

onng:

args: [[100, 300, 500], [10, 30, 50], [10, 30, 50]]

query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]

After adding a new value for each of the three construction parameters:

Onng(Ngt):

disabled: false

docker-tag: ann-benchmarks-ngt

singularity-tag: ann-bench-ngt

module: ann_benchmarks.algorithms.onng_ngt

constructor: ONNG

base-args: ["@metric", "Byte", 1.0]

run-groups:

onng:

args: [[100, 300, 500, 1000], [10, 30, 50, 100], [10, 30, 50, 120]]

query-args: [[0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]]


At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use "bit" as the data type and "jaccard" as the distance metric.

# Prepare a custom dataset including fingerprint computation

Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example.
1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed.
2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py".
If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file.

'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'),

3. Run a command with dataset being "chembl-2048-jaccard", and the dataset "chembl-2048-jaccard.hdf5" will be constructed under "/data" folder.

python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"

# References
- Omohundro, S. M. Five Balltree Construction Algorithms. _Tech. report, UC Berkeley_**1989**.

- Uhlmann, J. Satisfying General Proximity/Similarity Queries with Metric Trees. _Inf. Process. Lett._**1991**, _40_, 175–179.

- Dong, W.; Charikar, M.; Li, K. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In _Proceedings of WWW Conference_; 2011; pp 577–586.

- Malkov, Y.; Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. _CoRR, abs/1603.09320_**2016**.

- Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs. _Inf. Syst._**2014**, _45_, 61–68.

- Nasr, R.; Vernica, R.; Li, C.; Baldi, P. Speeding up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods. _J. Chem. Inf. Model._**2012**, _52_(4), 891–900.

- Vachery, J.; Ranu, S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. _J. Chem. Inf. Model._**2019**.

- Iwasaki, M. Pruned Bi-Directed k-Nearest Neighbor Graph for Proximity Search. In _Proceedings of International Conference on Similarity Search and Applications_; 2016; pp 20–33.

- Iwasaki, M.; Miyazaki, D. Optimization of Indexing Based on K-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data. _CoRR, abs/1810.07355_**2018**.

- Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019).

- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107.

0 comments on commit 6a42628

Please sign in to comment.