StructCleansing
This is a flexible pipeline for filtering out unwanted compounds in a chemical structure database. The current version is designed specifically for ChEMBL dataset but the code base is flexible enough to be quickly adapted to work on other types of database. And the filtering rules and procedures can be adapted to suit individual need.
Basic usage
python ChemFilter.py [input file path] -o [output file path]
Filtering procedure
Filtering out unwanted compounds is achieved by the following steps:
-
Remove salt and solvent:
The salt ion and solvent molecules will be removed from the compound and the remaining fragment in the case of salt will be neutralized. This procedure may also create duplicate compounds since some of the remaining fragment is separately recorded in ChEMBL dataset.
-
Remove mixtures and isotope-containing compounds
-
Rule based filtering: The default rule set contains a set of molecular decriptor (such as molecular weight, number of rotatable bonds, etc.) based rules, and a SMARTS pattern based rule set that was carefully engineered to filter out compounds that are likely too difficult to synthesize in practical condition or unlikely to have drug activity.
Options
-
[mandatory] input file path
The input file path has to be passed in. It should be a ChEMBL database file containing SMILES strings.
-
-o
--output_file: output file location. This should include the file extension such as .csv.
-
-c
--combined_result: output file will contain the combined results including both the filtered out compounds (these will be marked as False in the "result" column) and kept compounds.
-
-cl
-smarts_collection: file location of the collection of SMARTS patterns. This is a CSV file containing the collection of SMARTS patterns the filtering code will choose from. The default collection file has been carefully engineered, but it can be modified or new file can be specified using this option.
-
-f
--filter_rule: file location of the filter rules. This is a JSON file containing all the rules for filtering out unwanted compounds. The rules can be descriptor-based (the rule name corresponds to the RDKit function name), raw SMARTS pattern string (such rule name has to begin with "SMARTS_"), or entire rule set (such rule name has to begin with "Rule_set_" and the rest of the rule name corresponds to rule set name in the collection of SMARTS pattern specified by option -cl).
-
-ncpu
--num_cpu: number of cpu cores to use, -1 means use all available cores.