Skip to content

Commit

Permalink
add note for using an existing dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
cjz18001 authored Jun 6, 2020
1 parent 513b065 commit b4f71b8
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ At the beginning of the file, there is "bit:\n jaccard:\n". It means that we use

Here is the process to add a custom dataset. We will use Chembl dataset and 2048-bits ECFP as example.
1. Put raw sdf file, e.g. chembl_24_1.sdf.gz, under "data" folder. Note only ".sdf.gz" files are accepted. Multiple sdf files are allowed.
2. Include the key-value pair below to DATASETS, defined at the bottom of "./ann_benchmark/datasets.py".
2. Include the key-value pair below to the data strucutre DATASETS, defined at the bottom of "./ann_benchmark/datasets.py".
If a new fingerprint rather than ECFP is used, please define a fingerprint calculation function similar to ecfp() in the same Python file.

'chembl-2048-jaccard': lambda out_fn: ecfp(out_fn, 'Chembl', 2048, 'jaccard', 'bit'),
Expand All @@ -392,6 +392,7 @@ If a new fingerprint rather than ECFP is used, please define a fingerprint calcu

python run.py --dataset=chembl-2048-jaccard --algorithm='Hnsw(Nmslib)' --count=100 --sif-dir="./singularity"

Note: to use an existing dataset, e.g. X, one needs to make sure the data structure DATASETS, defined at the bottom of "./ann_benchmark/datasets.py" contains a key-value pair with key X. Otherwise, one needs to include a key-value pair with key X and an arbitrary value, e.g., "'X': gist", to the DATASETS.
# References
- Omohundro, S. M. Five Balltree Construction Algorithms. _Tech. report, UC Berkeley_**1989**.

Expand All @@ -413,4 +414,4 @@ If a new fingerprint rather than ECFP is used, please define a fingerprint calcu

- Datasketch: Big data looks small https://ekzhu.github.io/datasketch (accessed May 31, 2019).

- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107.
- Gaulton, A.; Bellis, L. J.; Bento, P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. _Nucleic Acids Res._**2012**,_40_, 1100–1107.

0 comments on commit b4f71b8

Please sign in to comment.