Skip to content

mldrugdiscovery/mol_fps

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
June 22, 2020 20:26
May 28, 2020 14:08
May 28, 2020 14:08

mol_fps

These programs will read molecules from a .sdf.gz file and then split them into multiple parts to run on multiple notes in an HPC environment.

In order to run these codes, make sure you have an anaconda environment with rdkit installed.

  • module load anaconda/5.1.0

Here is an environment set up, if you are not already using one:

  • conda create -c rdkit -n my_rdkit_env rdkit python=3.5.2
  • source activate my_rdkit_env

Partitioning

partition.py parameters:

- dir: directory where the sdf.gz file is located
    - (must be the only sdf.gz file in that directory)
- T: the threshold, the maximum number of molecules in each subset
- name: the name you want for each partition
    - example: name="chembl27" will create files chembl27-part0.csv, chembl27-part1.csv, ...
- dest: the directory you want to store the paritioned files in
    - examples: ./ , ./paritioned_chembl27

Example call: python partition.py --dir=./chembl27 --T=10000 --name='chembl27' --dest=./partitioned_chembl27

The above call will take the .sdf.gz file located in the cheml27 directory, partition the molecules into ceil(N/10000) .csv files named chembl27-part#.csv, all of which are located in the directory partioned_chembl27

Fingerprints

After paritioning the files, you can then run get the fingerprints from each molecule by running the following algoriith:

run_get_hdf5.py parameters:

- csv: the single csv file
- ofn: the desired name of the outputed .hdf5 files.
    - example: 'chembl27_fps'
- dimension
    - default: 1024
- dtype
    - default: numpy.bool

To run your algorithm scipts by SLURM shell, submit a run.sh file - sbatch run.sh

Here is a template to follow:

#!/bin/bash

#SBATCH --ntasks=1

#SBATCH --nodes=1

#SBATCH --cpus-per-task=1

#SBATCH --array=0-$M
#Where M is your number of .csv files

jobid = $SLURM_ARRAY_TASK_ID

python run_get_hdf5.py --csv=./directory1./filename-part$jobid.csv --ofn=directory2/name$jobid

An example .sh file is included in this repo.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published