The size of chemical compound space is too large to be probed exhaustively. This leads high-throughput protocols to drastically subsample and results in sparse and nonuniform datasets. Rather than arbitrarily selecting compounds, we systematically explore chemical space according to the target property of interest. We first perform importance sampling by introducing a Markov chain Monte Carlo scheme across compounds. We then train a machine learning (ML) model on the sampled data to expand the region of chemical space probed. Our boosting procedure enhances the number of compounds by a factor 2 to 10, enabled by the ML model's coarse-grained representation, which both simplifies the structure-property relationship and reduces the size of chemical space. The ML model correctly recovers linear relationships between transfer free energies. These linear relationships correspond to features that are global to the dataset, marking the region of chemical space up to which predictions are reliable; this is a more robust alternative to the predictive variance. Bridging coarse-grained simulations with ML gives rise to an unprecedented database of drug-membrane insertion free energies for 1.3 million compounds.

Controlled exploration of chemical space by machine learning of coarse-grained representations / Hoffmann, C.; Menichetti, R.; Kanekal, K. H.; Bereau, T.. - In: PHYSICAL REVIEW. E. - ISSN 2470-0045. - 100:3(2019), p. 033302. [10.1103/PhysRevE.100.033302]

Controlled exploration of chemical space by machine learning of coarse-grained representations

Menichetti R.;
2019-01-01

Abstract

The size of chemical compound space is too large to be probed exhaustively. This leads high-throughput protocols to drastically subsample and results in sparse and nonuniform datasets. Rather than arbitrarily selecting compounds, we systematically explore chemical space according to the target property of interest. We first perform importance sampling by introducing a Markov chain Monte Carlo scheme across compounds. We then train a machine learning (ML) model on the sampled data to expand the region of chemical space probed. Our boosting procedure enhances the number of compounds by a factor 2 to 10, enabled by the ML model's coarse-grained representation, which both simplifies the structure-property relationship and reduces the size of chemical space. The ML model correctly recovers linear relationships between transfer free energies. These linear relationships correspond to features that are global to the dataset, marking the region of chemical space up to which predictions are reliable; this is a more robust alternative to the predictive variance. Bridging coarse-grained simulations with ML gives rise to an unprecedented database of drug-membrane insertion free energies for 1.3 million compounds.
2019
3
Hoffmann, C.; Menichetti, R.; Kanekal, K. H.; Bereau, T.
Controlled exploration of chemical space by machine learning of coarse-grained representations / Hoffmann, C.; Menichetti, R.; Kanekal, K. H.; Bereau, T.. - In: PHYSICAL REVIEW. E. - ISSN 2470-0045. - 100:3(2019), p. 033302. [10.1103/PhysRevE.100.033302]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/315288
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 3
  • Scopus 16
  • ???jsp.display-item.citation.isi??? 17
social impact