README.md

# mol_fps

These programs will read molecules from a .sdf.gz file and then split them into multiple parts to run
on multiple notes in an HPC environment.

In order to run these codes, make sure you have an anaconda environment with rdkit installed.

- module load anaconda/5.1.0

Here is an environment set up, if you are not already using one:
- conda create -c rdkit -n my_rdkit_env rdkit python=3.5.2
- source activate my_rdkit_env

## Partitioning

partition.py parameters:
  
    - dir: directory where the sdf.gz file is located
        - (must be the only sdf.gz file in that directory)
    - T: the threshold, the maximum number of molecules in each subset
    - name: the name you want for each partition
        - example: name="chembl27" will create files chembl27-part0.csv, chembl27-part1.csv, ...
    - dest: the directory you want to store the paritioned files in
        - examples: ./ , ./paritioned_chembl27
      
 Example call: python partition.py --dir=./chembl27 --T=10000 --name='chembl27' --dest=./partitioned_chembl27
 
 The above call will take the .sdf.gz file located in the cheml27 directory, partition the molecules into ceil(N/10000) .csv files named
 chembl27-part#.csv, all of which are located in the directory partioned_chembl27
 
 ## Fingerprints
 
 After paritioning the files, you can then run get the fingerprints from each molecule by running the following algoriith:
 
 run_get_hdf5.py parameters:
 
    - csv: the single csv file
    - ofn: the desired name of the outputed .hdf5 files.
        - example: 'chembl27_fps'
    - dimension
        - default: 1024
    - dtype
        - default: numpy.bool
        
 To run your algorithm scipts by SLURM shell, submit a run.sh file
    - sbatch run.sh
    
 Here is a template to follow:
 
    #!/bin/bash
    
    #SBATCH --ntasks=1
    
    #SBATCH --nodes=1
    
    #SBATCH --cpus-per-task=1
    
    #SBATCH --array=0-$M
    #Where M is your number of .csv files
    
    jobid = $SLURM_ARRAY_TASK_ID
    
    python run_get_hdf5.py --csv=./directory1./filename-part$jobid.csv --ofn=directory2/name$jobid
 
 An example .sh file is included in this repo.