Data-Driven Chemistry

Like most scientists, chemists are drowning in data from laboratory experiments and from calculations. We are developing tools using machine learning to automate the analysis of quantum-chemistry. Another area in need of automation is in the development of quantitative structure-property relationships, particularly where flexible molecules are concerned.


Matt Sigman (Utah), Tom Rovis (Columbia); Steven Fletcher (Oxford)

Key Papers

A Quantitative Metric for Organic Radical Persistence Using Thermodynamic and Kinetic Features.

Sowndarya, S. S. V.; St. John, P. C.; Paton, R. S. Chem. Sci. 2021, Advance Article, DOI: 10.1039/D1SC02770K

Real-time Prediction of 1H and 13C Chemical Shifts with DFT accuracy using a 3D Graph Neural Network.

Guan, Y.; Sowndarya, S. S. V.; Gallegos, L. C.; St. John, P. C.; Paton, R. S. Chem. Sci. 2021, 12, 12012-12026.


CASCADE stands for ChemicAShift CAlculation with DEep learning. It is a stereochemically-aware graph network for the prediction of NMR chemical shifts. Model training was performed against 8,000 DFT structures followed by transfer learning with experimental  spectra. A web-server has been created to access CASCADE predictions from SMILES or by drawing structures in the graphical interface. An automated workflow executes 3D structure embedding and MMFF conformer searching. The full ensemble of optimized conformations are passed to a trained graph neural network to predict the NMR chemical shifts (in ppm) for C and H atoms. The underlying data and code are available. The program is described in the publication: Real-time Prediction of 1H and 13C Chemical Shifts with DFT accuracy using a 3D Graph Neural Network


DBSTEP is a python package for obtaining DFT-Based Steric Parameters from 3-dimensional chemical structures. It can parse the outputs from most computational chemistry programs and other common molecular structure file formats. Steric properties can either be obtained exactly or by using a Cartesian grid, the latter approach being amenable to the featurization of a molecular isodensity surface (DBSTEP can process wavefunction files) rather than using classical atomic radii. Currently,  traditional Sterimol parameters (L, Bmin, Bmax) and percent buried volume parameters are implemented, as well as  our novel steric parameter vectors Sterimol2vec and vol2vec. This package is designed for use on the command line or alternatively implemented in a Python script for use in a computational workflow to collect steric parameters.

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.

Gallegos, L. C.; Luchini, G.; St. John, P. C.; Kim, S.; Paton, R. S. Acc. Chem. Res. 2021, 54, 827–836


A Python program to compute quasi-harmonic thermochemical data and potential energy surface diagrams from frequency calculations at a given temperature/concentration, corrected for the effects of vibrational scaling-factors. All (electronic, translational, rotational and vibrational) partition functions are recomputed and can be correct to any temperature or concentration. The first public version of GoodVibes was released in 2016 and it has undergone several revisions since, during which time it has been used by many groups around the world. The program is described in the publication: GoodVibes: automated thermochemistry for heterogeneous computational chemistry data

[Zenodo] [GitHub]

A program to generate Boltzmann-weighted Sterimol Steric Parameters for conformationally-flexible substituents that integrates with PyMol. The program contains an automated computational workflow which computes multidimensional Sterimol parameters. For flexible molecules or substituents, the program will generate & optimize a conformational ensemble, and produce Boltzmann-weighted Sterimol parameters. It has been developed as a PyMol plugin and can be run from within the graphical user interface. The wSterimol code is described in more detail in Conformational Effects on Physical-Organic Descriptors – the Case of Sterimol Steric Parameters

[Zenodo] [GitHub]
Effects of substituents X and Y on the NMR chemical shifts of 2-(4-X phenyl)-5-Y pyrimidines.

Yuan, H.; Chen, P.-W.; Li, M.-Y.; Zhang, Y.; Peng, Z.-W.; Liu, W.; Paton, R. S.; Cao, C. J. Mol. Struct. 2020, 1204, 127489

GoodVibes: automated thermochemistry for heterogeneous computational chemistry data.

Luchini, G.; Alegre-Requena, J. V.; Funes-Ardoiz, I.; Paton, R. S. F1000Research 2020, 9, 291

Prediction of homolytic bond dissociation enthalpies for organic molecules at near chemical accuracy with sub-second computational cost.

St John, P.; Guan, Y.; Kim, Y.; Kim, S.; Paton, R. S. Nat. Commun. 2020, 11, 2328