Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
# mol_fps
These programs will read molecules from a .sdf.gz file and then split them into multiple parts to run
on multiple notes in an HPC environment.
In order to run these codes, make sure you have an anaconda environment with rdkit installed.
- module load anaconda/5.1.0
Here is an environment set up, if you are not already using one:
- conda create -c rdkit -n my_rdkit_env rdkit python=3.5.2
- source activate my_rdkit_env
## Partitioning
partition.py parameters:
- dir: directory where the sdf.gz file is located
- (must be the only sdf.gz file in that directory)
- T: the threshold, the maximum number of molecules in each subset
- name: the name you want for each partition
- example: name="chembl27" will create files chembl27-part0.csv, chembl27-part1.csv, ...
- dest: the directory you want to store the paritioned files in
- examples: ./ , ./paritioned_chembl27
Example call: python partition.py --dir=./chembl27 --T=10000 --name='chembl27' --dest=./partitioned_chembl27
The above call will take the .sdf.gz file located in the cheml27 directory, partition the molecules into ceil(N/10000) .csv files named
chembl27-part#.csv, all of which are located in the directory partioned_chembl27
## Fingerprints
After paritioning the files, you can then run get the fingerprints from each molecule by running the following algoriith:
run_get_hdf5.py parameters:
- csv: the single csv file
- ofn: the desired name of the outputed .hdf5 files.
- example: 'chembl27_fps'
- dimension
- default: 1024
- dtype
- default: numpy.bool
To run your algorithm scipts by SLURM shell, submit a run.sh file
- sbatch run.sh
Here is a template to follow:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --array=0-$M
#Where M is your number of .csv files
jobid = $SLURM_ARRAY_TASK_ID
python run_get_hdf5.py --csv=./directory1./filename-part$jobid.csv --ofn=directory2/name$jobid
An example .sh file is included in this repo.