Welcome to chemical structure cleaning pipeline (ChemStructClean)!

This is a flexible pipeline for filtering out unwanted compounds in a chemical structure database. The current version is designed specifically for ChEMBL dataset but the code base is flexible enough to be quickly adapted to work on other types of database. And the filtering rules and procedures can be adapted to suit individual need. It is built upon two existing code bases: ChEMBL_Structure_Pipeline [1], from which the lists of salt ions and solvents were taken; and rd_filters [2], from which the SMARTS- and molecular-descriptor-based filtering processes were adapted.

Requirements

The required Python libraries are: NumPy, Pandas, and RDKit. Users can install the first two libraries by running the command: pip install -r requirements.txt. And RDKit can be installed following the official website instructions: https://www.rdkit.org/docs/Install.html

Options

[mandatory] input file path

The input file path has to be passed in. It should be a text file containing SMILES representations of chemical structures formatted like the ChEMBL database file.
-o

--output_file: specify output file location. This should include the file extension such as .csv. If unspecified, the default is "output.csv" under the script execution directory.
-c

--combined_result: when enabled, output file will contain the combined results including compounds that failed the filtering process (these will be marked as False in the "result" column) and those that passed (these will be marked as True in the "result" column). If not enabled, the output file with be two separate files, one containing structures passed the filtering with the suffix "_keep" and one containing those that failed filtering with the suffix "_discard".
-cl

-smarts_collection: file location of the collection of SMARTS patterns. This is a CSV file containing the collection of SMARTS patterns the filtering code will choose from. The default collection file has been carefully engineered, but it can be modified or new file can be specified using this option.
-f

--filter_rule: file location of the filter rules. This is a JSON file containing all the rules for filtering out unwanted compounds. The rules can be descriptor-based (the rule name corresponds to the RDKit function name), raw SMARTS pattern string (such rule name has to begin with "SMARTS_"), or entire rule set (such rule name has to begin with "Rule_set_" and the rest of the rule name corresponds to rule set name in the collection of SMARTS pattern specified by option -cl).
-ncpu

--num_cpu: number of cpu cores to use, -1 means use all available cores.

Basic usage

python ChemFilter.py [input file path] -o [output file path]

An example of such usage: python ChemFilter.py chembl_28_chemreps.txt -o filtered_chembl.csv

Advanced usage

python ChemFilter.py [input file path] -o [output file path] -cl [SMARTS rules collection file] -f [filter rule JSON file] -ncpu [number of CPU to use]

An example of such usage which specify everything: python ChemFilter.py chembl_28_chemreps.txt -o filtered_chembl.csv -cl alert_collection.csv -f rules_one_step.json -ncpu 8

Customize filtering rules and criteria

Descriptor-based rules:

All descriptor-based rules are specified in the filter rule JSON file such as "rules_one_step.json". Each rule is specified as a key-value pair, in which the key is the name of the descriptor function name implemented in RDKit, and the value is a list of minimum and maximum values that specify the range of the descriptor values allowed. (Descriptor functions from RDKit are imported from the following paths: rdkit.Chem.Descriptors, rdkit.Chem.rdMolDescriptors,rdkit.Chem.Lipinski)

Examples:

"RingCount": [0, 9]: This rule specifies that the number of rings calculated by the function RingCount (rdkit.Chem.Lipinski.RingCount(x)) and the value needs to be in between 0 and 9, inclusively, to pass this rule.
"HeavyAtomCount": [4, "None"]: This rule specifies that the number of heavy atoms calculated by the function HeavyAtomCount (rdkit.Chem.Lipinski.HeavyAtomCount(mol)) and the value needs to be higher than or equal to 4 to pass this rule. "None" is specified for the upper bound to remove the upper bound limit.
"LargestRingSize": [0, 10]: The function LargestRingSize is custom-made and the definition is in the Python file "custom_functions.py". Similarly, users can define any descriptors in this Python file and invoke them in the rule JSON file.

SMARTS-based rules:

All SMARTS rules are listed in a CSV file such as "alert_collection.csv". Users can either modify this file or create a separate file with the same column structure and point to this new file wit the option -cl. The format of such files is a leftover from the code base rd_filters. The relevant columns are: "smarts" (the actual SMARTS string), "rule_id" (numeric ID for individual rule), "rule_set_name" (the name of the rule set). The other columns are there for bookkeeping purposes.
The SMARTS rule in the collection file can be invoked individually or as a set:
- Individually: specify in the filter rule JSON file a key-value pair with the key beginning with "Rule_ID_". An example is "Rule_ID_1002": true. Alternatively, the SMARTS rule can be directly specified as key-value pair with the key beginning with "SMARTS_" like "SMARTS_1": [n+].
- As a set: specify in the filter rule JSON file a key-value pair with the key beginning with "Rule_set_". An example is "Rule_set_Combined": true.

Filtering algorithm

Filtering out unwanted compounds is achieved by the following steps:

Remove salt and solvent:

The salt ion and solvent molecules will be removed from the compound and the remaining fragment in the case of salt will be neutralized. This procedure may also create duplicate compounds since some remaining fragment is separately recorded in the ChEMBL dataset.
Remove mixtures and isotope-containing compounds.
Rule based filtering: The default rule set contains a set of molecular descriptor-based rules (such as molecular weight, number of rotatable bonds, etc.), and a SMARTS pattern based rule set that was carefully engineered to filter out compounds that are likely too difficult to synthesize in practical condition or unlikely to have drug activity.

References

Bento, A.P., Hersey, A., Félix, E. et al. An open source chemical structure curation pipeline using RDKit. J Cheminform 12, 51 (2020). https://doi.org/10.1186/s13321-020-00456-1
Pat Walters, rd_filters, GitHub repository (2018), https://github.com/PatWalters/rd_filters.

mldrugdiscovery/ChemStructClean

About

Resources

Stars

Watchers

Forks

Releases

Contributors 2

Languages