diff --git a/README.md b/README.md index f45c047..3fdf9a0 100644 --- a/README.md +++ b/README.md @@ -383,7 +383,7 @@ At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example. 1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed. -2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". +2. Include the key-value pair below to the data strucutre DATASETS, defined at the bottom of "./ann_benchmark/datasets.py". If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file. 'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'), @@ -392,6 +392,7 @@ If a new fingerprint rather than ECFP is used, please define a fingerprint calcu python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity" +Note: to use an existing dataset, e.g. X, one needs to make sure the data structure DATASETS, defined at the bottom of "./ann_benchmark/datasets.py" contains a key-value pair with key X. Otherwise, one needs to include a key-value pair with key X and an arbitrary value, e.g., "'X': gist", to the DATASETS. # References - Omohundro, S. M. Five Balltree Construction Algorithms. _Tech. report, UC Berkeley_**1989**. @@ -413,4 +414,4 @@ If a new fingerprint rather than ECFP is used, please define a fingerprint calcu - Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019). -- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107. \ No newline at end of file +- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107.