## AI applications to science, beyond biology

## Scientists have just set up a roadmap for the future of ML applied to chemistry and material sciences, in an article that also tracks the state of the art along several fronts of AI

We have witnessed in the last decade a revolution in the field of machine learning (ML) and artificial intelligence (AI), and in the last 5 years its never-seen impact on other sciences. While probably the best known case of this is AlphaFold and its impact on biology, other fields have also felt the potential of these new computational technologies. In particular, computational chemistry and computational materials science have begun to transition what could soon be AlphaFold-akin paradigm shifts, whereby traditional approaches to execute calculations are being superseded by quicker, easier, and frequently more precise methods stemming from direct applications of ML models.

Just now, a group of scientists working on the application of ML to materials sciences and chemistry published a review/opinion (or “roadmap”) article about the emerging subfield and its possible immediate evolution. The article discusses viewpoints on existing and upcoming difficulties in several of the ongoing and upcoming applications of ML to computational chemistry and materials science.

You’ll surely find this article and my blog summary interesting even if you work with ML models but not with chemistry and materials, as the article explores the state of the art along many ends of the broader field.

The article discusses 5 main issues encompassing more concrete points such as the creation of better and faster ML-based forcefields trained on data or on complex-and-accurate-but-slow-to-compute quantum calculations, new ways to compute exchange correlation functionals for density-functional theory, ML-based solutions to the many-body problem, and ways to deal with the big amounts of data required for proper training of ML models.

In a nutshell, the article covers these points, to which I highlight one or two key points:

**1. Predicting material properties**

**Using machine learning to accelerate computational materials design**

Fostered by the success of AlphaFold, we are now experiencing a surge in the application of ML to protein design -right now a hot topic and very promising (example here).

It turns out that ML can also be used to design new materials, as the article begins by. And it’s not just about designing the positions of atoms in space, but also the resulting electronic properties, as the first section of the article discusses.

**Machine learning for material property prediction**

Like in any other domain of science where ML has made an impact, it can help to predict properties better than through regular models, or with same accuracy but far faster.

ML for predicting physical or chemical properties has been around for decades, in fact, especially using rather simple neural networks. See for example the early programs to compute NMR chemical shifts on molecules, or the several applications documented in this (by the way excellent, I have it at home!) 1999 book:

Of course we now have far better network architectures, activation functions, and training methods; and also as important, databases are nowadays orders of magnitude bigger and much easier to access than 23 years ago when this book came out. All these aspects are discussed along other sections of the article.

**Predicting thermodynamically stable materials**

One of they key practical outcomes of materials science is, of course, the discovery and development of new materials with useful properties. One very desirable property is durability, or stability in other terms. Scientists are interested in predicting compositions and crystal structures of stable materials that can be synthesized in a lab, and ML models can perfectly help for this.

**Learning rules for materials properties and functions**

For example through interpretable neural networks or through networks that do symbolic regression, as applied here to quantum calculations or also more broadly in science.

The specific point about interpretable ML models is also touched upon later on in the article and in this blog entry.

**Deep learning for spectroscopy**

Spectroscopy deals with the interaction of radiation with matter, especially how to retrieve information about a piece of matter from spectra that describe its effect on radiation. Getting this information, or simulating the interaction, is not trivial, and ML could help here of course.

**Machine learning for disordered systems**

Disordered materials are those characterized by extreme structural and chemical disorder. Glasses, plastics and amorphous crystals are examples of disordered materials, that are of course the important focus of much research and development.

**2. Construction of accurate force fields and beyond**

**Machine learning for molecular quantum simulations**

In molecular simulations, forces are propagated into accelerations and motions to create “movies” of how the atoms of a system are likely to move together, giving place to the properties of a piece of matter -and hence for example the function of a protein or the flexibility of a material. (To know more about simulations in general, check out the introduction of this article.)

Computing the force fields using classical equations require lots of parameters and intensive calculations, that neural nets can simplify (and accelerate) tremendously. See for example this paper of an “all purpose” network that computes potential energies and its derivatives (i.e. forces) for small molecules.

Many similar networks exist, and more are coming up.

**Bayesian machine learning for microscopic interactions**

Bayesian ML allows for adaptive models that can better describe systems in simulations. Thanks to the availability of reliable and efficient software packages of quantum calculations, microscopic data can be generated in abundance and then used for fitting flexible interatomic potential models -as opposed to macroscopic observables which cannot. The idea isn’t new, as it has been applied to non-ML methods, but together with ML methods the idea is pushed to the maximum, where the ML model behaves as a non-parametric regressor that imposes few or no constraints on the mathematical form of the interaction and relies directly on actual data. A Bayesian technique then imposes a prior in the form of a distribution of functions and uses the data and the ML model to provide predictions.

**Spectroscopically accurate potential energy surfaces from machine learning**

As defined above, spectroscopy deals with the interaction of radiation with matter, and one usually spectroscopic methods to retrieve information about a piece of matter. One of the main methods to extract this information from spectra is by simulating it. At the most basic level this requires solving the Schrodinger equation, but this is very hard to achieve from first principles. for some calculations this is even harder, because one needs to know the so-called potential energy surface, which is a hyperdimensional surface that quantifies the potential energy for different configurations (positions and states) of the atoms and electrons that make up the system. Today, several ML methods can assist these calculations for small molecules, and the goal is to improve these methods and make them suitable for larger systems.

**High-dimensional neural network potential energy surfaces in chemistry and materials science**

Following form the previous point, neural networks that compute potential energies have lots of broader applications, for example to quickly (possibly even interactively) test the energies and conformations of molecules and sets of molecules.

**Transferable neural network force fields**

As described above, molecular simulations is about propagating forces into accelerations and motions to create “movies” of how the atoms of a system are likely to move together, giving place to the properties that we are interested in. Quantum calculations have become a workhorse of computational organic chemistry and are the most accurate kinds of *ab initio* simulations, i.e. that do not rely on the parametrization of specific atoms, bonds, etc. The problem with these calculations is that they are extremely expensive, certainly much more than classical molecular mechanics -at the expense that the latter need to be parametrized.

It turns out that the same ML models that produce potential energies, can provide forces, because the force acting on an atom is the gradient of the energy surface along the spatial dimensions. There is an emerging field exploiting this to produce new force fields that run nearly at the speed of the classical calculations, but nearly as accurate as the quantum calculations. Here’s one of the most famous examples:

**Integrated machine learning models: electronic structure accuracy beyond local potentials**

Modeling of atomic-scale systems considering the electrons is increasingly more predictive as methods advance. Still, the accessible simulation time and length scales are constrained by the expensive computer requirements and sharp scaling with the amount of electrons included in the simulation. A relatively flexible functional form and a small number of reference computations can be used to fit structure-property interactions using ML approaches. Many research lines seek to close the gap between the capabilities of electronic structure calculations and their data-driven counterparts. The main strategy they follow adapt either the atomistic features that are used as input, or the mathematical structure of the model itself to reflect the underlying physics of the problem and the specific structure of the target property.

## 3. Solving the many-body problem with machine learning

**Unifying machine learning and electronic structure methods**

Part of the roadmap commented above is about predicting potential energy surfaces with ML and then forces to simulate its mechanics, and/or the chemical properties directly with ML. These models do not explicitly model the electrons of the system, i.e. they do not consider the quantum mechanics -but clearly ML has potential to assist science about this.

As this section of the article describes, there has been a recent surge of ML methods applied to quantum chemistry: predicting electron densities, distributions, spins, Hamiltonians and wavefunctions. If ML can have on quantum calculations the same impact it had say on protein structure prediction or molecular mechanics forcefields, we can expect immense advances in practical applications.

**Using machine learning to find new density functionals**

Density functionals are an essential part of quantum calculations, and ML is now impacting their calculation. I have discussed recently work that even Google and Deepmind are addressing:

**Machine learning Kohn–Sham exchange–correlation potentials**

This is exactly what the above articles deal with, as these calculations can be much accelerated with ML. In particular, Google’s article above utilized an interesting method based on symbolic regression, which makes the overall calculation more interpretable and also easily embeddable into other software packages.

**Deep-learning quantum Monte Carlo for molecules**

As already said above, the properties of molecules and materials can in principle be described with the Schrodinger equation, but this is extremely costly and requires approximations. As the number of electrons in the system increases, the main challenge is finding approximations that strike a good balance between accuracy and computational cost. We have seen earlier that many ML methods attempt to predict the outcome of quantum calculations directly, and on the other hand there are *ab initio *ML methods that aim to directly simplify the quantum calculations themselves. This section is about the latter, and includes examples like this one that approaches a solution of the Schrodinger equation by alternating between optimizing parameters and sampling from the wave function to generate data:

**Disordered quantum systems**

Disordered systems, already introduced above, might also benefit from studies through quantum calculations -and again their acceleration through ML methods is beneficial.

The lack of regularity in these systems introduces an additional challenge, because ti might well happen that the training set misses arrangements that are feasible -just that they are too many to account for all of them. Still, there is progress; for example, the article describes some ML models trained on datasets including many random realizations can then generate accurate predictions for previously unseen instances.

## 4. Big data for machine learning

**Challenges and perspectives for interoperability and reuse of heterogenous data collections**

More heard of for structural biology, big data analytics and ML approaches are being increasingly applied to various problems in the chemical and material sciences. Even high-throughput screening, typically associated to finding biologically functional molecules, is increasingly used to discover chemicals and materials, on large-scale datasets.

This section of the article discusses the NOMAD (NOvel MAterials Discovery) laboratory, an European effort to offer an open platform for sharing data within the entire community of chemists and material scientists. NOMAD allows users to upload results produced with most programs for quantum calculations, hosting over a hundred million of calculations contributed by researchers and gathered from other databases:

Such resource serves obviously to train new ML models for chemistry and material sciences, and also to compare the performance of different methodologies, finding trends in data, etc.

The article further discusses the limited interoperability when dealing with such vast amounts of data especially when coming from different sources, problems inherently linked to reproducibility, which is not only a problem of experimental science but also of computational sciences.

**The AFLOW framework for computational materials data and design**

This section discusses another large data repository called AFLOW, this one centered in the US and differing from the above one in that it performs its own calculations. Given input structures generated from experimentally observed materials or generated from crystallographic prototypes that are then decorated with different elements to generate large libraries of related hypothetical materials, AFLOW then computes quantum calculations and stores the results for posterior retrieval. Different AFLOW submodules further calculate various properties that get archived too.

Just like the data in NOMAD, AFLOW enables training of ML models, discovery of trends, etc. Currently it has over 3.5 million entries, each with over 200 calculated properties; all data can be accessed programatically through APIs.

## 5. Frontier developments of machine learning in materials science

**Adaptive learning strategies for electronic structure calculations**

Adaptive learning aims to enable rapid and efficient navigation of the vast parameter spaces typically associated with the training of ML models. It’s no wonder that it’s also emerging in chemical and materials informatics.

The basis of adaptive learning is carrying out the learning procedure with an algorithm that can autonomously select data points from the wide unexplored or unknown area in an optimal way that decreases the number of steps for convergence and/or the number of training points used, without sacrificing the models’ capacity for prediction and generalization. When dealing with quantum calculations as the training data, adaptive learning can help tremendously as such data is expensive to produce. This section of the article discusses ways to achieve adapting learning precisely when using quantum calculations, particularly stressing some ways to achieve this and, essential, how to prepare the input data.

Just like behavior can be shaped through the iterative application of reward and punishment in animals, reinforcement learning consists in iteratively training models within a context that rewards correct predictions. The environment thus “selects” for ML models that maximize reward. This kind of learning is more applicable to the tuning of programs that must achieve goals in subsequent steps, as you might have seen in methods training virtual robots that must learn to walk, for example.

There are so far not too many applications of reinforcement learning in physics, chemistry, and biology, but the authors of this section put forwarf some potential applications. In particular, since reinforcement learning excels at control of dynamics, it could be applied to ML models that predict actions, such as the effects of changing temperatures, apply electric fields, etc.

**Interpretability of machine learning models in physical sciences**

The last section of the article deals with a problem that is actually central to all ML approaches: interpretability. A trained model can perfectly reproduce the training data and correctly predict the test data, and it can even make new predictions correctly. But what has it truly learned about the underlying physics and chemistry, that we as humans can leverage upon?

This very interesting section of the article, and I mean this not just for scientists but for everybody using ML models, explains that the literature on interpretability is quite vast yet there’s no full consensus. In particular, there’s no clear consensus on what exact fundamental questions need to be asked, let alone clear quantities that can be measured to infer what the model has learned. The article then goes on two explain two points associated to interpretability: transparency and explainability.

Transparency connects directly to the fact that in science, a phenomenon is considered to be totally understood when a predictive mathematical law is formulated that can in principle work with no exception at least in a given domain of applicability -and here’s where ML through symbolic regression can help most. Moreover, such laws are usually expected to be relatively simple, such that we can relate it to fundamental physics or chemistry. I discussed some examples of this in this article:

The other aspect, explainability, refers to the possibility of at least inspecting a model that is in general too complex to be grasped by the human mind (working as a “black box”) to investigate, and ideally thus reveal, how inputs and outputs are connected inside of it -for example by testing which inputs affect the output to a larger extent.

Scientists ideally need to reach deep transparency and/or explainability in order to truly trust an ML model as they would trust a simple analytical model. The article ends up discussing current and future work and challenges about this. Thus, again, very interesting for everybody working on ML models.