Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
mol_fps/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
67 lines (42 sloc)
2.1 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# mol_fps | |
These programs will read molecules from a .sdf.gz file and then split them into multiple parts to run | |
on multiple notes in an HPC environment. | |
In order to run these codes, make sure you have an anaconda environment with rdkit installed. | |
- module load anaconda/5.1.0 | |
Here is an environment set up, if you are not already using one: | |
- conda create -c rdkit -n my_rdkit_env rdkit python=3.5.2 | |
- source activate my_rdkit_env | |
## Partitioning | |
partition.py parameters: | |
- dir: directory where the sdf.gz file is located | |
- (must be the only sdf.gz file in that directory) | |
- T: the threshold, the maximum number of molecules in each subset | |
- name: the name you want for each partition | |
- example: name="chembl27" will create files chembl27-part0.csv, chembl27-part1.csv, ... | |
- dest: the directory you want to store the paritioned files in | |
- examples: ./ , ./paritioned_chembl27 | |
Example call: python partition.py --dir=./chembl27 --T=10000 --name='chembl27' --dest=./partitioned_chembl27 | |
The above call will take the .sdf.gz file located in the cheml27 directory, partition the molecules into ceil(N/10000) .csv files named | |
chembl27-part#.csv, all of which are located in the directory partioned_chembl27 | |
## Fingerprints | |
After paritioning the files, you can then run get the fingerprints from each molecule by running the following algoriith: | |
run_get_hdf5.py parameters: | |
- csv: the single csv file | |
- ofn: the desired name of the outputed .hdf5 files. | |
- example: 'chembl27_fps' | |
- dimension | |
- default: 1024 | |
- dtype | |
- default: numpy.bool | |
To run your algorithm scipts by SLURM shell, submit a run.sh file | |
- sbatch run.sh | |
Here is a template to follow: | |
#!/bin/bash | |
#SBATCH --ntasks=1 | |
#SBATCH --nodes=1 | |
#SBATCH --cpus-per-task=1 | |
#SBATCH --array=0-$M | |
#Where M is your number of .csv files | |
jobid = $SLURM_ARRAY_TASK_ID | |
python run_get_hdf5.py --csv=./directory1./filename-part$jobid.csv --ofn=directory2/name$jobid | |
An example .sh file is included in this repo. | |