Skip to content

Xiw14035/StructCleansing

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
This branch is 32 commits behind mldrugdiscovery:master.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

StructCleansing

This is a flexible pipeline for filtering out unwanted compounds in a chemical structure database. The current version is designed specifically for ChEMBL dataset but the code base is flexible enough to be quickly adapted to work on other types of database. And the filtering rules and procedures can be adapted to suit individual need.

Basic usage

python ChemFilter.py [input file path] -o [output file path]

Filtering procedure

Filtering out unwanted compounds is achieved by the following steps:

  1. Remove salt and solvent:

    The salt ion and solvent molecules will be removed from the compound and the remaining fragment in the case of salt will be neutralized. This procedure may also create duplicate compounds since some of the remaining fragment is separately recorded in ChEMBL dataset.

  2. Remove mixtures and isotope-containing compounds

  3. Rule based filtering: The default rule set contains a set of molecular decriptor (such as molecular weight, number of rotatable bonds, etc.) based rules, and a SMARTS pattern based rule set that was carefully engineered to filter out compounds that are likely too difficult to synthesize in practical condition or unlikely to have drug activity.

Options

  1. [mandatory] input file path

    The input file path has to be passed in. It should be a ChEMBL database file containing SMILES strings.

  2. -o

    --output_file: output file location. This should include the file extension such as .csv.

  3. -c

    --combined_result: output file will contain the combined results including both the filtered out compounds (these will be marked as False in the "result" column) and kept compounds.

  4. -cl

    -smarts_collection: file location of the collection of SMARTS patterns. This is a CSV file containing the collection of SMARTS patterns the filtering code will choose from. The default collection file has been carefully engineered, but it can be modified or new file can be specified using this option.

  5. -f

    --filter_rule: file location of the filter rules. This is a JSON file containing all the rules for filtering out unwanted compounds. The rules can be descriptor-based (the rule name corresponds to the RDKit function name), raw SMARTS pattern string (such rule name has to begin with "SMARTS_"), or entire rule set (such rule name has to begin with "Rule_set_" and the rest of the rule name corresponds to rule set name in the collection of SMARTS pattern specified by option -cl).

  6. -ncpu

--num_cpu: number of cpu cores to use, -1 means use all available cores.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages

  • Python 100.0%