Train

ehreact.train contains functions and classes for calculating a Hasse diagram.

Calculate Diagram

Classes and functions from ehreact.train.calculate_diagram.py.

calculate_diagram.py Training Hasse diagrams.

ehreact.train.calculate_diagram.calculate_diagram(smiles, verbose=False, quiet=True, compute_aam=False, save_path=None, save_plot=None, train_mode='transition_state', seed=[], no_props=False, plot_only_branches=False, temp_dir_img=None)[source]

Computes a Hasse diagram of a list of reaction or molecule smiles.

Parameters:
  • smiles (List[str]) – List of SMILES or reaction SMILES.

  • verbose (bool, default False) – Whether to print additional information.

  • quiet (bool, default True) – Whether to silence all output.

  • compute_aam (bool, default False) – Whether to compute atom-mappings for reactions.

  • save_path (str, default None) – File to which diagram is saved.

  • save_plot (str, default None) – File to which save image of diagram.

  • train_mode (Literal[“single_reactant”,”transition_state”], default “transition_state”) – Train mode, either transition states extracted from reaction smiles or single reactants extracted from smiles.

  • seed (List[str], default []) – List of SMILES seeds for the reactant algorithm, usually a single seed is given.

  • no_props (bool, default False) – Do not compute any properties, just output the diagram.

  • plot_only_branches (bool, default False) – Plot only substructures that branch off.

  • temp_dir_img (str, default None) – Directory to save temporary image files

Returns:

d – The Hasse diagram of the input list of molecules/reactions.

Return type:

ehreact.diagram.diagram.Diagram

ehreact.train.calculate_diagram.calculate_diagram_single_reactant(smiles, seed_list, verbose, quiet)[source]

Computes a Hasse diagram of a list of molecule smiles.

Parameters:
  • smiles (List[str]) – List of SMILES.

  • seed_list (List[str]) – List of SMILES seeds.

  • verbose (bool) – Whether to print additional information.

  • quiet (bool) – Whether to silence all output.

Returns:

  • d (ehreact.diagram.diagram.Diagram) – The Hasse diagram of the input list of molecules.

  • smiles_dict (dict) – A dictionary of the canonicalized input smiles.

ehreact.train.calculate_diagram.calculate_diagram_transition_state(smiles, verbose, quiet, compute_aam)[source]

Computes a Hasse diagram of a list of reaction smiles.

Parameters:
  • smiles (List[str]) – List of reaction SMILES.

  • verbose (bool, default False) – Whether to print additional information.

  • quiet (bool) – Whether to silence all output.

  • compute_aam (bool) – Whether to compute atom-mappings for reactions.

Returns:

  • d (ehreact.diagram.diagram.Diagram) – The Hasse diagram of the input list of reactions.

  • smiles_dict (dict) – A dictionary of the canonicalized input smiles.

ehreact.train.calculate_diagram.calculate_diversity(d, node, stereo, train_mode)[source]

Calculates diversity within a branch or tree.

Parameters:
  • d (ehreact.diagram.diagram.Diagram) – Hasse diagram.

  • node (ehreact.diagram.diagram.Node) – Node for which to calculate diversity, all leaf nodes attached to this node by an arbitrary number of edges toward children are taken into account.

  • stereo (bool) – Whether to include stereochemistry in fingerprints

  • train_mode (Literal[“single_reactant”, “transition_state”]) – Train mode, either transition states extracted from reaction smiles or single reactants extracted from smiles.

Returns:

  • mean_div_reac (float) – Mean pair similarity of reactants.

  • mean_div_prod (float) – Mean pair similarity of products.

ehreact.train.calculate_diagram.fill_information(d, train_mode, verbose, smiles_dict)[source]

Function to fill topology information and fingerprints into a Hasse diagram (alters diagram in-place).

Parameters:
  • d (ehreact.diagram.diagram.Diagram) – Hasse diagram.

  • train_mode (Literal[“single_reactant”,”transition_state”]) – Train mode, either transition states extracted from reaction smiles or single reactants extracted from smiles.

  • verbose (bool, default False) – Whether to print additional information.

  • smiles_dict (dict) – A dictionary of the canonicalized input smiles.

ehreact.train.calculate_diagram.find_lowest_template(curr_node, d)[source]

Function to find the lowest (most general) substructure/reaction rule in the tree.

Parameters:
  • curr_node (ehreact.diagram.diagram.Node) – Node for which to find the lowest template.

  • d (ehreact.diagram.diagram.Diagram) – Hasse diagram.

Returns:

lowest_template – Name of the lowest template.

Return type:

str

ehreact.train.calculate_diagram.write_fragment_list_to_root(d, train_mode, verbose, smiles_dict)[source]

Function to calculate a list of reactant rule fragments (only atoms in reaction center). This in needed to transform inputted molecules to their corresponding transition state. Save to the root node (in-place).

Parameters:
  • d (ehreact.diagram.diagram.Diagram) – Hasse diagram.

  • train_mode (Literal[“single_reactant”,”transition_state”]) – Train mode, either transition states extracted from reaction smiles or single reactants extracted from smiles.

  • verbose (bool, default False) – Whether to print additional information.

  • smiles_dict (dict) – A dictionary of the canonicalized input smiles.

Train

Classes and functions from ehreact.train.train.py

train.py Entry point for training Hasse diagrams.

ehreact.train.train.train(args)[source]

Computes a Hasse diagram based on the inputted arguments

Parameters:

args (Namespace) – Namespace of arguments.

Hasse

Classes and functions from ehreact.train.hasse.py

ehreact.train.hasse.check_one_extension(pivot_rule, max_possible_rules)[source]

Function to iterate over all current molecules/pseudomolecules and check whether each of them allows for an extension resulting in the current pivot_rule.

Parameters:
  • pivot_rule (RDKit.Chem.Mol) – A possible new template

  • max_possible_rules (dict) – A dictionary of the matching atoms and possible new templates for all molecules/pseudomolecules.

Returns:

Whether or not the pivot rule has a substructure match with any of the possible extensions.

Return type:

bool

ehreact.train.hasse.enlarge_rule(rule, mols, smiles, change_dict, verbose)[source]

Function to look for possible atoms to add to current template, find best combination, and create a new template.

Parameters:
  • rule (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • mols (List[rdkit.Chem.Mol]) – List of RDKit molecule objects of the molecules or pseudomolecules which are considered in the current branch.

  • smiles (List[str]) – List of SMILES strings of the molecules or pseudomolecules which are considered in the current branch.

  • change_dict (dict) – A dictionary of all changes upon going from reactants to products.

  • verbose (bool) – Whether to print additional information.

Returns:

new_rules – A dictionary of new templates (childs to the current node), might be one or multiple templates.

Return type:

dict

ehreact.train.hasse.extend_by_atom(patt, m, match, idx_list, ring_extension=False)[source]

Extends the current rule (‘patt’) by adding neighbours of the atoms specified in the list of atom indices ‘idx_list’ according to the molecule ‘m’.

Parameters:
  • patt (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • n (rdkit.Chem.Mol) – RDKit molecule object of the current leaf node.

  • match (tuple) – Tuple of matching atom indices of the current rule.

  • idx_list (list) – List of atom indices at which to extend the pattern.

  • ring_extension (bool) – Boolean whether current iteration has broken a ring and must thus be iterated until full ring is found.

Returns:

new_patt – RDKit molecule object of the new, extended template (molecule or pseudomolecule).

Return type:

rdkit.Chem.Mol

Examples

For the molecule CCOCO, the rule CCO matches the first three atoms and can be extended at atom 2 (the oxygen), yielding the new, extended molecule CCOC:

>>> rule = Chem.MolFromSmiles("CCO",sanitize=False)
>>> new_rule=extend_by_atom(rule,Chem.MolFromSmiles("CCOCO"),(0, 1, 2),[2])
>>> print(Chem.MolToSmiles(new_rule))
'COCC'
ehreact.train.hasse.extend_by_single_match(possible_extensions, match, idx, mols, smiles, rule, change_dict, verbose, max_possible_rules)[source]

Enlarges the current template by selecting the best extensions.

Parameters:
  • possible_extensions (dict) – A dictionary of possible extensions for each input molecule, containing the matching indices, atom indices of atoms to extend, and their possible extension as string.

  • match (tuple) – Tuple of the matching atom indices for the pivot molecule.

  • idx (str) – Index of pivot molecule.

  • mols (List[rdkit.Chem.Mol]) – List of RDKit molecule objects of the molecules or pseudomolecules which are considered in the current branch.

  • smiles (List[str]) – List of SMILES strings of the molecules or pseudomolecules which are considered in the current branch.

  • rule (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • change_dict (dict) – A dictionary of all changes upon going from reactants to products.

  • verbose (bool) – Whether to print additional information.

  • max_possible_rules (dict) – Dictionary of possibly extensions, containing the atoms indices of the matching atoms, as well as all the possible enlarged templates as string and RDKit molecule.

Returns:

new_patterns – Dictionary of new templates, to be attached to current node as children, including list of molecules and smiles belonging to each new template.

Return type:

dict

Examples

>>> smis = ["CCOCO", "CCOOC"]
>>> mols = [Chem.MolFromSmiles(smi) for smi in smis]
>>> rule = Chem.MolFromSmiles("CCO", sanitize=False)
>>> possible_extensions = {'CCOCO': {(0, 1, 2): {2: 'OC'}}, 'CCOOC': {(0, 1, 2): {2: 'OO'}}}
>>> change_dict={"reac": {"atom": {}, "bond": {}}, "prod": {"atom": {}, "bond": {}}}
>>> max_possible_rules=get_max_possible_rules(possible_extensions,rule,mols,change_dict)
>>> extend_by_single_match(possible_extensions,(0,1,2),0,mols,smis,rule,change_dict,False,max_possible_rules)
{'COCC': {'rule': <rdkit.Chem.rdchem.Mol at 0x7ffc4d7592b0>,
          'mols': [<rdkit.Chem.rdchem.Mol at 0x7ffc4d75eda0>], 'smiles': ['CCOCO']},
 'CCOO': {'rule': <rdkit.Chem.rdchem.Mol at 0x7ffc4d759f30>,
          'mols': [<rdkit.Chem.rdchem.Mol at 0x7ffc4d75e1c0>], 'smiles': ['CCOOC']}}
ehreact.train.hasse.extended_hasse(smiles_dict, seeds, rule_dict, tags_core, verbose, quiet)[source]

Create an extended Hasse diagram.

Parameters:
  • smiles_dict (dict) – A dictionary of the canonicalized input smiles.

  • seeds (List[str]) – List of SMILES seeds.

  • rule_dict (dict) – A dictionary of all minimal templates of all seeds.

  • tags_core (dict) – A dictionary of the atom map numbers for each minimal template. Empty dictionary for single reactant mode.

  • verbose (bool) – Whether to print additional information.

  • quiet (bool) – Whether to silence all output.

Returns:

d – The Hasse diagram of the input list of molecules.

Return type:

ehreact.diagram.diagram.Diagram

ehreact.train.hasse.get_max_possible_rules(possible_extensions, rule, mols, change_dict)[source]

Takes a dictionary of possible extensions and computes the corresponding enlarged templates as RDKit molecule objects.

Parameters:
  • possible_extensions (dict) – A dictionary of possible extensions for each input molecule, containing the matching indices, atom indices of atoms to extend, and their possible extension as string.

  • rule (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • mols (List[rdkit.Chem.Mol]) – List of RDKit molecule objects of the molecules or pseudomolecules which are considered in the current branch.

  • change_dict (dict) – A dictionary of all changes upon going from reactants to products.

Returns:

extension_dict – Dictionary of possibly extensions, containing the atoms indices of the matching atoms, as well as all the possible enlarged templates as string and RDKit molecule.

Return type:

dict

Examples

>>> smis = ["CCOCO", "CCOOC"]
>>> mols = [Chem.MolFromSmiles(smi) for smi in smis]
>>> rule = Chem.MolFromSmiles("CCO", sanitize=False)
>>> possible_extensions = {'CCOCO': {(0, 1, 2): {2: 'OC'}}, 'CCOOC': {(0, 1, 2): {2: 'OO'}}}
>>> change_dict={"reac": {"atom": {}, "bond": {}}, "prod": {"atom": {}, "bond": {}}}
>>> get_max_possible_rules(possible_extensions, rule, mols, change_dict)
{'CCOCO': {(0, 1, 2): {'COCC': <rdkit.Chem.rdchem.Mol at 0x7ffc4d75bee0>}},
 'CCOOC': {(0, 1, 2): {'CCOO': <rdkit.Chem.rdchem.Mol at 0x7ffc4d75b4e0>}}}
ehreact.train.hasse.get_possible_extensions(rule, mols, smiles)[source]

Searches for all possible extensions of a template taking into account a list of molecules.

Parameters:
  • rule (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • mols (List[rdkit.Chem.Mol]) – List of RDKit molecule objects of the molecules or pseudomolecules which are considered in the current branch.

  • smiles (List[str]) – List of SMILES strings of the molecules or pseudomolecules which are considered in the current branch.

Returns:

extension_dict – A dictionary of possible extensions for each input molecule, containing the matching indices, atom indices of atoms to extend, and their possible extension as string.

Return type:

dict

Examples

For the molecules CCOCO and CCOOC, the rule CCO matches both molecules, and allows for one possible extension at the oxygen, adding a carbon for the first molecule, or an oxygen for the second molecule:

>>> smis = ["CCOCO", "CCOOC"]
>>> mols = [Chem.MolFromSmiles(smi) for smi in smis]
>>> rule = Chem.MolFromSmiles("CCO",sanitize=False)
>>> get_possible_extensions(rule,mols,smis)
{'CCOCO': {(0, 1, 2): {2: 'OC'}}, 'CCOOC': {(0, 1, 2): {2: 'OO'}}}
ehreact.train.hasse.has_one_extension(pivot_rule, extensions_all_matches)[source]

Function to check whether any current template match and corresponding extension results in the current pivot_rule.

Parameters:
  • pivot_rule (RDKit.Chem.Mol) – A possible new template

  • extensions_all_matches (dict) – A dictionary of matching atoms and possible new templates.

Returns:

Whether or not the pivot rule has a substructure match with any of the possible extensions.

Return type:

bool

ehreact.train.hasse.iterate_algorithm(d, mols, smiles, rule, rule_smiles, verbose, quiet, change_dict, tags_core)[source]

Iterate atom extension algorithm by looking for possible atoms to add to, find best combination, create a new template and attach it to the current Hasse diagram (edits to diagram are in-place).

Parameters:
  • d (ehreact.diagram.diagram.Diagram) – The current Hasse diagram.

  • mols (List[rdkit.Chem.Mol]) – List of RDKit molecule objects of the molecules or pseudomolecules which are considered in the current branch.

  • smiles (List[str]) – List of SMILES strings of the molecules or pseudomolecules which are considered in the current branch.

  • rule (rdkit.Chem.Mol) – RDKit molecule object of the current template (molecule or pseudomolecule).

  • rule_smiles (str) – Name of the current template.

  • verbose (bool) – Whether to print additional information.

  • quiet (bool) – Whether to silence all output.

  • change_dict (dict) – A dictionary of all changes upon going from reactants to products.

  • tags_core (dict) – A dictionary of the atom map numbers for each minimal template. Empty dictionary for single reactant mode.

ehreact.train.hasse.quick_match(mol_small, mol_large)[source]

Computes whether mol_small could possibly be a subgraph of mol_large based on atom type counts.

Parameters:
  • mol_small (RDKit.Chem.Mol) – A molecule.

  • mol_large (RDKit.Chem.Mol) – A molecule.

Returns:

Whether a subgraph match is possible based on atom type counts.

Return type:

bool