How to train your force field 2: Bespoke fit

Posted on 20 Oct 2021 by Joshua Horton

General transferable force fields, such as Parsley or Sage, allow us to rapidly parameterize small molecules covering vast amounts of chemical space for use in molecular dynamics simulations. However, with every force field comes the inherent assumption that parameters fit to a set of representative molecules in a training set are transferable to any new molecules with similar chemistry. Despite potential accuracy problems caused by poor transferability or discrete atom types the use of force fields in fields like drug discovery has been a massive success. Within the Open Force Field, we use a unique method based on SMARTS patterns to link the force field parameters to chemical substructures, avoiding the use of atom types altogether, resulting in a more compact and manageable force field. While we have seen the number of unique torsion parameters grow between versions of the force field to improve parameter performance the number of parameters has remained very low when compared to other state-of-the-art force fields like OPLS3.

Force Field	Number of unique torsion parameters
OpenFF-1.0.0	157
OpenFF-1.2.0	163
OpenFF-1.3.0	167
OpenFF-2.0.0	167
OPLS3	48,142
OPLSe	146,669

This explosion in torsion parameters allows for high accuracy over a large area of chemical space with the types often becoming very specific to certain chemistry. However, due to improved computer power and the reduced computational cost of generating high accuracy reference data via the use of machine learned potentials such as ANI, a better solution may be to generate bespoke parameters on the fly. Enter 🎉BespokeFit 🎉 , BespokeFit is a python package which allows users to easily generate bespoke force field parameters for their molecules under study which compliment the base OpenFF force field. Here we can think of BespokeFit as our general fitting scheme which can be applied to new molecules to generate bespoke parameters on the fly for new unseen chemistry.

You may have seen previously how BespokeFit, QCSubmit and QCFractal can be combined to fit general force fields and this time we are going to show its other side and walk through an application to a simple molecule.

This will have the following structure:

Building the general BespokeFit optimisation workflow
Building and inspecting the molecule specific optimisation schema made from the general workflow
Setup and run the BespokeExecutor
Inspecting the fitted parameters

(Optional) Installation instructions

If you’d like to follow along with this example, download this conda environment yaml, then run:

conda env create -y -f bespokefit-blog-post.yml
conda activate 2021-bespokefit-blog-post

Note that the bespoke fitting package is pre-production, so this environment is heavily pinned and major changes may occur before a stable release.

Building the general workflow

BespokeFit aims to provide a reproducible parameter optimization workflow for SMIRNOFF based force fields. As such normal BespokeFit execution starts with a general fitting workflow. This captures every process in the workflow along with any adjustable settings such as how the reference data should be generated. The default workflow is designed to optimize bespoke torsion parameters at the same level of theory as that used in the mainline openff force fields. Here however, we will be starting with a blank workflow and building it up as we go.

First, we start with our optimization engine, which is an easy choice as we currently only support the fantastic ForceBalance package (watch this space - optimizers are coming!). Let’s start by creating the ForceBalance optimization schema.

from openff.bespokefit.schema.optimizers import ForceBalanceSchema

fb = ForceBalanceSchema()
fb.dict()

{'type': 'ForceBalance',
 'max_iterations': 10,
 'job_type': 'optimize',
 'penalty_type': 'L2',
 'step_convergence_threshold': 0.01,
 'objective_convergence_threshold': 0.01,
 'gradient_convergence_threshold': 0.01,
 'n_criteria': 2,
 'eigenvalue_lower_bound': 0.01,
 'finite_difference_h': 0.01,
 'penalty_additive': 1.0,
 'initial_trust_radius': -0.25,
 'minimum_trust_radius': 0.05,
 'error_tolerance': 1.0,
 'adaptive_factor': 0.2,
 'adaptive_damping': 1.0,
 'normalize_weights': False,
 'extras': {}}

As you can tell there are many options here and in some cases, it might not be clear what a valid input is. For example, what other penalty types could we use? ✨Pydantic✨ to the rescue, pydantic allows us to “define how data should be in pure, canonical python” and has run time validation to ensure our data is always correct. What’s more we even get a schema method for each model for free which fully describes the model with a short description for each field and information on the acceptable inputs, it basically comes with its own documentation📖. As such pydantic is used extensively throughout BespokeFit to try and catch possible input errors ahead of run time, there is nothing worse than queuing up a calculation for it to be instantly returned with an error on line one🤦‍♂️. Now the schema is validated during assignment as we see below. First lets start by inspecting the optimizer schema using:

fb.schema()

{'description': 'A class containing the main ForceBalance optimizer settings '
            'to use during an\n'
            'optimization.\n'
            '\n'
            'Priors and target definitions are stored separately as part '
            'of an\n'
            '``OptimizationSchema``.',
'properties': {'adaptive_damping':
...
'penalty_type': {'default': 'L2',
                 'description': 'The penalty type.',
                 'enum': ['L1', 'L2'],
                 'title': 'Penalty Type',
                 'type': 'string'},

We can now see that penalty_type only accepts the values L2 and L1 so lets try something else and see what happens.

fb.penalty_type = "L3"

ValidationError: 1 validation error for ForceBalanceSchema
penalty_type
unexpected value; permitted: 'L1', 'L2' (type=value_error.const; given=L3; permitted=('L1', 'L2'))

Nice, we get an informative error message before even running the workflow!

fb.penalty_type = "L1"

Now we can set it to L1 and start to build up our workflow and see what other pieces we might need.

Lets start by creating our BespokeWorkflowFactory which acts as a fitting template for future optimizations and add our ForceBalanceSchema optimizer settings. We can also remove all other defaults as we will go through them in the next sections.

from openff.bespokefit.workflows import BespokeWorkflowFactory

workflow = BespokeWorkflowFactory(
    fragmentation_engine=None,
    optimizer=fb,
    parameter_hyperparameters=[],
    target_templates=[],
    default_qc_specs=[]
)
workflow.dict()

Stage 1 Fragmentation

BespokeFit makes extensive use of the fantastic openff-fragmenter package where possible to reduce the cost of QM torsion drives as they can quickly become very expensive for large drug-like molecules. Here we will add the WBO based fragmentation scheme to the pipeline with default settings. You can find out more about how fragmenter works in a pre-print here.

from openff.fragmenter.fragment import WBOFragmenter

fragmenter = WBOFragmenter()
workflow.fragmentation_engine = fragmenter

By default, fragmenter will not fragment terminal rotatable bonds (such as methyl groups) as these are assumed to be trivial and well described by our chosen force field. However, to keep things simple in our example we will be working with BrCO and so we need to change this behaviour. Luckily BespokeFit/fragmenter allow us to overwrite this behaviour by defining a SMARTS pattern (chemical substructure) which will be used to select the rotatable bonds.

workflow.target_torsion_smirks = ["[*]~[!$(*#*)&!D1:1]-,=;!@[!$(*#*)&!D1:2]~[*]"]

Here we have taken the default SMARTS pattern [!#1]~[!$(*#*)&!D1:1]-,=;!@[!$(*#*)&!D1:2]~[!#1] used to find torsions and allowed the connecting atoms to the central bond to be hydrogens. This will ensure our simple molecule is processed by the workflow, as without this change the workflow would reject the molecule as with no rotatable bonds there would be no work to do 🤷‍♂️.

Stage 2 Reference Generation

As you may have guessed we are going to need some target reference data to optimize our force field parameters against and that’s where the target templates come in. These target schemas have two main functions:

Describe the contribution to the total error function in the parameter optimization
Generate specific reference data tasks that need to be computed in order to train with the target

In this case we will be using the TorsionProfileTargetSchema which requires a torsiondrive as input data. That is a series of constrained geometry optimizations around the targeted rotatable bond, usually ranging from -180 -> 180 degrees in 15 degree increments. There is slightly more to a torsiondrive than that and you can read more about how it works in the paper, but for now the main point is that we should get back a series of geometries and energies to fit to. The actual torsion profile target then contributes the average RMSE between the QM and MM energies at each geometry to the objective function and has a small number of settings exposed which we can see here.

from openff.bespokefit.schema.targets import TorsionProfileTargetSchema

target = TorsionProfileTargetSchema()
workflow.target_templates = [target, ]
target.dict()

{'attenuate_weights': True,
 'energy_cutoff': 10.0,
 'energy_denominator': 1.0,
 'extras': {},
 'reference_data': None,
 'type': 'TorsionProfile',
 'weight': 1.0}

You will also note that we passed the target in a list as the optimizer can use multiple targets to construct the objective function, you can check out Simons blog to see this in action.

We now need to tell BespokeFit what program, method and basis should be used to actually compute the reference tasks. As we use the game changing QCEngine to power these calculations we have a wide range of possibilities with one standard interface. BespokeFit defines the target specification using the QCSpec class from QCSubmit which has built in validation to again catch any errors early in the process. Choosing a new QC task specification for all tasks in the BespokeFit workflow is then as simple as:

from openff.qcsubmit.common_structures import QCSpec

xtb_spec = QCSpec(
    method="gfn2xtb",
    basis=None,
    program="xtb",
    spec_name="xtb",
    spec_description="gfn2xtb"
)

workflow.default_qc_specs = [xtb_spec]

To keep things moving quickly we have chosen to use a semi-empirical QC method, but by default BespokeFit will use the same level of theory as that used to fit the main line force fields.

Stage 3 Parameter Optimization

Finally, we have the parameter optimization stage and as we have already set up our optimizer using ForceBalance the last step is to define some parameter hyperparameters. These tell BespokeFit which parameters we would like to fit and allows us to define a prior to the parameter.

from openff.bespokefit.schema.smirnoff import  ProperTorsionHyperparameters

prior = ProperTorsionHyperparameters()
workflow.parameter_hyperparameters = [prior]

BespokeFit now knows we intend to optimize the torsion parameters to the target and with these final two settings we can generate SMARTS patterns specific to the molecules that pass through the workflow and to fully expand the torsion parameters to use all available k values.

workflow.generate_bespoke_terms = True
workflow.expand_torsion_terms = True

All that is left to do now is to save the workflow to file for later use, this is now our general workflow template which we can apply to all molecules that pass into the BespokeFit pipeline. The serialized workflow is also a fantastic provenance store, making it easy to tell exactly what settings were used to compute the parameters in a project.

# write to json
workflow.export_factory("workflow.json")
# or yaml
workflow.export_factory("workflow.yaml")

Building the Molecule Specific Schema

Now we can use our general fitting schema to build a molecule specific optimization schema. This schema fully defines the optimization protocol that should be applied to the molecule including information about the reference data generation tasks which BespokeFit will automatically perform locally for us. In this example we will be using BrCO, so lets create the molecule using the openff-toolkit, save it to file for later and create a BespokeOptimizationSchema for it using our workflow:

from openff.toolkit.topology import Molecule

target_molecule = Molecule.from_smiles("BrCO")
target_molecule.generate_conformers(n_conformers=1)
# write to file for later
target_molecule.to_file(file_path="BrCO.sdf", file_format="sdf")
# make the specific schema
schema = workflow.optimization_schema_from_molecule(molecule=target_molecule)

If we now inspect the schema we find two main changes, i) the input molecule has been inserted, ii) some bespoke SMARTS patterns have been created for the rotatable torsion identified in the molecule. These patterns and the initial force field values can be easily viewed via:

schema.initial_parameter_values

{ProperTorsionSMIRKS(type='ProperTorsions',
smirks='[#35H0X1x0!r+0A:1]-;!@[#6H2X4x0!r+0A:2](-;!@[#1H0X1x0!r+0A])(-;!@[#1H0X1x0!r+0A])-;!@[#8H1X2x0!r+0A:3]-;!@[#1H0X1x0!r+0A:4]',
attributes={'k1', 'k2', 'k3', 'k4'}): {
'k1': Quantity(value=0, unit=kilocalorie/mole),
'k2': Quantity(value=0, unit=kilocalorie/mole),
'k3': Quantity(value=0.6985935181583, unit=kilocalorie/mole),
'k4': Quantity(value=0, unit=kilocalorie/mole)},
ProperTorsionSMIRKS(type='ProperTorsions',
smirks='[#1H0X1x0!r+0A:1]-;!@[#6H2X4x0!r+0A:2](-;!@[#1H0X1x0!r+0A])(-;!@[#35H0X1x0!r+0A])-;!@[#8H1X2x0!r+0A:3]-;!@[#1H0X1x0!r+0A:4]',
attributes={'k1', 'k2', 'k3', 'k4'}): {
'k1': Quantity(value=0, unit=kilocalorie/mole),
'k2': Quantity(value=0, unit=kilocalorie/mole),
'k3': Quantity(value=0.6985935181583, unit=kilocalorie/mole),
'k4': Quantity(value=0, unit=kilocalorie/mole)}}

You may also notice that as requested BespokeFit has expanded the k terms for each torsion parameter with each new term set to zero by default. The non-zero values are then taken from the base force field to form our initial parameter values. We can also use the openff-toolkit to query the molecule and identify the atoms matching these new patterns and understand how the parameters will be applied to the molecule by visualizing the substructures.

from rdkit.Chem import Draw, rdDepictor

parameters = schema.parameters
rd_mol = target_molecule.to_rdkit()
rdDepictor.Compute2DCoords(rd_mol)
rd_mols = []
atoms = []
for parameter in parameters:
    matches = target_molecule.chemical_environment_matches(query=parameter.smirks)
    flat_matches = [atom for match in matches for atom in match]
    rd_mols.append(rd_mol)
    atoms.append(flat_matches)

Draw.MolsToGridImage(rd_mols, highlightAtomLists=atoms)

bespoke-smirks

We can see that BespokeFit has generated two bespoke SMARTS patterns, as determined by the symmetry of the atoms around the central bond, covering all three torsion terms.

Setting up the BespokeExecutor

In the last blog post we made use of the invaluable QCFractal infrastructure to compute our target data. Since then however, we have been hard at work to offer a local execution pathway as an alternative reference generation method. Here we still use QCEngine to execute the tasks, but spin up simple Celery workers to perform the jobs. In fact the BespokeFit executor has been totally reworked to allow for a more flexible and scalable fitting solution. Each of the major stages: fragmentation, reference generation and optimization are carried out by dedicated workers which can be on the same local machine or distributed over multiple. Here we will be spinning up the BespokeExecutor and three workers, one for each of the stages, in a single line! We will then submit our optimization task via the RESTful API and let BespokeFit take care of the rest.

from openff.bespokefit.executor import BespokeExecutor, wait_until_complete
import os

# set keep files to true so we can view the results
os.environ["BEFLOW_KEEP_FILES"] = "True"

# launch the executor
with BespokeExecutor(
        n_fragmenter_workers=1,
        n_qc_compute_workers=1,
        n_optimizer_workers=1) as executor:
    # grab the task id and wait for the task to finish
    task = executor.submit(input_schema=schema)
    result = wait_until_complete(optimization_id=task.id)

[✓] fragmentation successful
[✓] qc-generation successful
[✓] optimization successful

As our optimization task works through each stage you should see them complete with a tick and once all tasks are complete the executor will shut down for us and clean up the workers as well! We can then check the result object to make sure each stage did finish with no errors.

for stage in result.stages:
    print(stage.type, stage.status)

fragmentation success
qc-generation success
optimization success

Inspecting the Optimized Parameters

Once the torsion optimization is complete a result schema is returned. The schema contains the final optimized parameters that can be used with the openff-toolkit in workflows to parameterize molecules and set up systems in OpenMM to run dynamics. The result schema also contains all provenance information which can help with reproducibility.

Now to convince ourselves that the optimization was successful and has lead to an improvement in the parameters - lets load the final output from ForceBalance.

from IPython.display import IFrame

opt_id = f"bespoke-executor/{result.results.input_schema.id}/optimize.tmp/torsion-0/iter_0002/plot_torsion.pdf"
IFrame(opt_id, width=900, height=600)

brco-result

BespokeFit has been successful as there is a clear improvement in the PES around this torsion with respect to the reference data. We can also plot how the parameters have changed during optimization as the result schema stores the initial and final torsion parameters.

import pandas as pd
from simtk import unit

parameter_data = []
for i, (parameter, initial_values) in enumerate(result.results.input_schema.initial_parameter_values.items()):
    for final_parameter, final_values in result.results.refit_parameter_values.items():
        if parameter.smirks == final_parameter.smirks:
            for term in range(1, 5):
                k_before = initial_values[f"k{term}"].value_in_unit(unit.kilocalorie_per_mole)
                k_after = final_values[f"k{term}"].value_in_unit(unit.kilocalorie_per_mole)
                parameter_data.append([f"smirks_{i}_k{term}", k_before, k_after, k_after - k_before])

# make a pandas dataframe
df = pd.DataFrame(parameter_data, columns=["parameter", "before", "after", "change"])

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rc('font', size=16)


ax = plt.figure(figsize=(5,8))
#add start points
ax = sns.stripplot(data=df,
                   x='before',
                   y='parameter',
                   orient='h',
                   order=df['parameter'],
                   size=10,
                   color='black')
ax.grid(axis='y', color='0.9')
#add arrows to plot only if the parameter changed by more than 1e-3 kcal/mol
for i in range(len(df.index)):
    term = df.iloc[i]
    if abs(term["change"]) > 1e-3:
        if term["after"] > term["before"]:
            arrow_color = '#347768'
        elif term["after"] < term["before"]:
            arrow_color = 'red'
        else:
            arrow_color = 'black'
        ax.arrow(term["before"],        
                 i,                      
                 term["change"],    
                 0,                     
                 head_width=0.3,         
                 head_length=0.2,       
                 width=0.1,              
                 fc=arrow_color,            
                 ec=arrow_color)     
ax.axvline(x=0, color='0.9', ls='--', lw=2, zorder=0)
ax.set_xlabel('Force Constant (kcal/mol)')                               
ax.set_ylabel('Torsion Parameter')                         
sns.despine(left=True, bottom=True)

parameter-change

Conclusion

To recap in this blog post we have:

Created and configured a general BespokeFit workflow
Built a molecule specific optimization schema
Spun up the BespokeFitExecutor and workers
Fragmented the molecule
Automatically generated torsiondrive reference data using QCEngine
Automatically generated bespoke SMARTS patterns for the molecule
Optimized the torsion parameters using ForceBalance

Hopefully this demonstration has shown just how easy it is to set up a BespokeFit general fitting pipeline with a custom configuration and train bespoke parameters to reference data generated on the fly from just a few imports. Better yet this whole process can now be routinely applied to new molecules from the CLI using our serialized workflow.

# export BEFLOW_KEEP_FILES=True # Uncomment this to skip cleanup of intermediate files
openff-bespoke executor run --input BrCO.sdf --spec-file workflow.json

Tags: force field, infrastructure, bespoke fit

open force field

An open and collaborative approach to better force fields

How to train your force field 2: Bespoke fit

Posted on 20 Oct 2021 by Joshua Horton

(Optional) Installation instructions

Building the general workflow

Stage 1 Fragmentation

Stage 2 Reference Generation

Stage 3 Parameter Optimization

Building the Molecule Specific Schema

Setting up the BespokeExecutor

Inspecting the Optimized Parameters

Conclusion