Machine learning solves complex solvent selection challenge

Man standing at a lab bench in front of a machine encased in metal and glass

Graduate student Surajudeen Omolabake with a liquid-handling robot used in a recent project that used machine learning to speed the selection of green solvent mixtures for extracting valuable molecules from plant biomass. Scientists trained computer model to select 40 mixtures predicted to have the best properties then used the liquid-handling robot to test them. The results of the experiments were then used to train the model. Chelsea Mamott/Wisconsin Energy Institute

Plant fibers contain valuable chemicals that can be used to make biofuels, plastics, medicines, and other products, but separating and purifying them is challenging, especially without using toxic solvents.

Now University of Wisconsin–Madison scientists have used machine learning to streamline the process of finding the best solvents for the job, balancing selectivity, efficiency, and environmental impact.

Van Lehn

A collaboration between scientists with the Great Lakes Bioenergy Research Center (GLBRC) and Wisconsin Energy Institute, the process uses Bayesian experimental design, a framework used to make experiments more efficient and informative in uncertain situations.

The Bayesian framework uses statistical models to guess what a design space looks like based on existing knowledge and to decide which areas of the space to explore next. By balancing exploration of unknown areas and exploitation of promising ones, the framework can be used to improve predictive models.

That means instead of testing thousands of mixtures, researchers can instead focus on dozens of the most promising candidates, said Shannon Stahl, a professor of chemistry who led the project with Reid Van Lehn, a professor of chemical and biological engineering.

Van Lehn said the process, which combines computer modeling with lab experimentation, could speed innovation of bioproducts as well as pharmaceuticals.

“We think of this as a flexible technique that could be applied in multiple different contexts,” Van Lehn said. “And this is really just a proof of concept.”

Green solvents for plant-based products

What is Bayesian experimental design?

Imagine you want to find the best campsite in a park that’s shrouded in fog. You can hike to different points, but each trip takes time and energy, so you want to make them count. You know there is some high ground in one corner of the park, but you don’t know anything about the rest: it might be a high plateau or a swamp full of alligators. You can explore the unknown regions or exploit the promising ones. The goal of Bayesian optimization is to choose which spot to evaluate based on what you already know, which is done by creating a probabilistic guess of the landscape based on known measurements and using a formula (acquisition function) to decide where to look next. The idea is that you don’t need to know all the details about unpromising areas, so it’s better to focus resources on the promising ones.

Lignin, one of three main parts of plant cell walls, contains valuable aromatic compounds (molecules with rings of carbon atoms) typically produced from petroleum.

These bioproducts can be used to make plant-based plastics, fibers, pharmaceuticals, and other products, but that requires separating them from other byproducts in the broths used to break down the biomass.

Liquid-liquid extraction is a way to separate and purify chemicals using two liquids that don't mix, like water and oil. When the liquids are stirred or shaken together, the target substance will dissolve in its preferred solvent. Once the liquids separate into layers, the heavier liquid can be drained from the bottom.

The key is finding the right solvent, which is challenging when there’s more than one product.

“The mixture of the bioproducts contains a lot of different chemicals,” said Amy Qin, a senior research specialist with Dow Chemical who collaborated on the project while earning her PhD at UW–Madison. “Some of the chemical properties are very similar to each other, which makes it difficult to separate one from the other.”

Chlorinated solvents are effective at dissolving these products, but they are also toxic, which limits their use in industrial biorefineries.

So the researchers set out to find something less hazardous.

A pair of hands in blue rubber gloves put test tubes into a machine

Detail of the liquid-handling robot used to test solvent mixtures selected by the computer model. Chelsea Mamott/Wisconsin Energy Institute

Knowing they would need a blend of green solvents to match the performance versatility of chlorinated solvents, the team identified eight candidates, including water, alcohols, and ethers.

But with nearly infinite possible combinations, traditional trial and error methods weren't practical.

The researchers used Bayesian experimental design to train a computer model to predict a certain property — in this case a measure of a substance’s preference between two solvents.

The framework has three stages.

The first step (design) is to identify a set of solvent mixtures, in this case some combination of eight green solvents. Then researchers test a selection of solvent mixtures the model suggests to get the actual values (observe). Those values are then used to train the model to improve its predictions (learn).

In selecting batches for experimental tests, researchers have to balance two principles: exploration and exploitation.

In the initial exploration rounds, they selected mixtures with high uncertainty.

“I just know very little about how that mixture will perform,” Van Lehn said. “If I do an experiment for that mixture, I’ll learn a lot, and my model will improve substantially.”

diagram illustrates a Bayesian experimental design process with three main interconnected stages: Design, Learn, and Observe. The Design stage consists of two parts: "Optimize to obtain the first batch sample" (showing 3D surface plots with principal components) and "Iteratively generate fantasy samples" (with a circular workflow of obtaining labels, updating models, and optimization). It includes COSMO-RS simulation for log10 Kp values.

Schematic of the Bayesian experimental design framework used to aid selection of green solvent mixtures. Courtesy of the autors

As the model becomes more accurate, researchers shift to exploitation, focusing on the mixtures predicted to have the best performance.

For the observation phase, the researchers used a liquid-handling robot that could test 40 samples at a time. But taking advantage of that batch automation presented another challenge: how to choose 40 different mixtures that result in the biggest improvements to the model?

Normally with this type of machine learning, you choose a data point that is expected to maximize improvement. Once the model updates, you choose another point, which is likely very different.

Choosing mixtures one at a time isn’t very efficient, but neither is picking the top 40.

“All 40 points would probably be mixtures that are similar because the model is uncertain of its predictions for some region of the space of possible mixtures,” Vah Lehn said. “(So) it’s unlikely to add much information compared to one mixture.”

Qin

To get around this problem, the researchers created a loop in the design phase using a physics-based model known as COSMO-RS to sequentially generate 40 “fantasy samples” that temporarily update the model. COSMO-RS predictions aren’t as accurate as experiments, but they are good enough to identify the mixtures most likely to improve performance.

The inner loop is only used to improve predictions; the model is trained only on experimental data.

A side-by-side comparison showed the new model is far more accurate than COSMO-RS on its own, according to results published in the journal ACS Sustainable Chemistry & Engineering.

GLBRC researchers are now working to apply the framework to other reaction conditions, such as temperature and pressure, and to consider practical parameters like cost and whether a solvent can be easily recovered for reuse.

“Bayesian optimization is not a new concept in academia, but I feel like, especially in those experimental labs or even in industry, not a lot of people use Bayesian optimization to guide either optimization or experimental design process,” Qin said. “We’re just showing this application as a showcase on this complex problem.”