A New Era for Big Data and Chromatography

November 1, 2017
Gabriel Vivó-Truyols

LCGC Europe

Volume 30, Issue 11
Page Number: 615–616

We have entered a new stage in the era of accelerations. Moore’s law continues its expansion, increasing exponentially the computer power available. Other accelerations are remarkable, particularly easy access to cloud computing and the expansion and influence of artificial intelligence to practically all sectors of our society.

We have entered a new stage in the era of accelerations. Moore’s law continues its expansion, increasing exponentially the computer power available. Other accelerations are remarkable, particularly easy access to cloud computing and the expansion and influence of artificial intelligence to practically all sectors of our society.

And what about chromatography? Does Moore’s law apply in this discipline? My answer is yes! If one examines the amount of data produced by chromatographic instruments versus time, an exponential trend is clearly visible. We have changed from producing a few kilobytes from a gas chromatography (GC) chromatogram in the 1960s towards a few gigabytes from a modern high-resolution liquid chromatography–mass spectrometry (LC–MS) chromatogram. The advent of MS and, particularly, high-resolution MS has been a clear contributor, but sophisticated instrumentation, such as comprehensive two-dimensional chromatography systems and second-order detectors, such as ion mobility mass spectrometers, have also contributed.

Chemometrics and Chromatography

During the late 1960s and the 1970s, research at the interface between statistics and analytical chemistry started to gain momentum, planting the seed of a new discipline in analytical chemistry: chemometrics. Professor Massart in Europe and Professor Kowalski in America (to name just a few of these pioneers) were visionaries at that time, contributing to make chemometrics a mature discipline in the 1980s and the 1990s. 

Most of the chemometric techniques (but not all) were developed with the application of multivariate statistics, and this idea grew along with the development of “multivariate” instrumentation. The application of partial least squares (PLS) to model properties of interest with near infrared (NIR) spectroscopy is one of the most cited examples. Overtones in NIR make it impossible to isolate a single wavelength that describes the property of interest, but taking into account all the wavelengths together (that is, multivariate information) can help to model such a property of interest.

In chromatography, chemometrics has been successful at solving many problems, yet its popularity has not been as high as in other disciplines. There are two main reasons for this. First, chemometrics is viewed as a complementary (but not an “essential”) technique. In chromatography we want to create enough resolution to separate peaks, and the call to use multivariate techniques to separate mathematically what the chromatographic column could not complete is viewed as a kind of “last resource” option. Second, spectroscopic instrumentation shows a repeatability much higher than chromatography because of effects such as column ageing, pumping instabilities, and sample injection. These effects are particular to chromatography, jeopardizing the use of multivariate techniques and making the alignment of peaks necessary.

Despite this, some important areas have developed at the interface of chromatography and chemometrics. For example, the development of peak models for peak deconvolution to mathematically separate overlapped peaks is still an option. A review paper in 2001 published a long table with more than 100 peak models (1). On the other hand, the use of the peak resolution techniques that do not force the peaks to follow a mathematical model is a viable technique when peaks are not completely resolved and multichannel detection is available (with different variants [2,3]). Other areas have experienced important developments. In retention prediction, the so-called “Abraham models” (4) are applied to get an approximate idea about the retention times of molecules without any laboratory experiment by using molecular descriptors. Methods for chromatographic optimization (5) are now well accepted in the chromatographic community.

 

Big (Chromatographic) Data and Artificial Intelligence

Despite the development of methods for data treatment in separation science, there is still a gap between the potential and the actual information obtained from (raw) chromatographic data. With Moore’s law applying to separation science, this gap is becoming wider. There are several reasons for this.

First, as the data grow exponentially, the demand for automation grows bigger. During the 1980s and 1990s, the need for automation was not as high as now. We could deal with chromatograms with several peaks, and human supervision was possible to apply techniques of peak resolution like multivariate curve resolution–alternating least squares (MCR-ALS) (2). Nowadays, with thousands or tens of thousands of peaks (and with the high-resolution MS data) produced, it is not viable to ask for user intervention as before-automation is now a must.

Second, we need to re-educate ourselves in the way we conceive and manage the concept of information. This can be rooted in a basic statistical definition: probability. To be able to understand this, we have to revise the theory of the great mathematicians of the end 18th century. For Bayes, Bernoulli, and Laplace, probability represented a degree-of-belief: How much they thought that something was true (6). If we conceive “information” as a probability distribution (either continuous or discrete), we are able to manage and update this information (that is, probability) with the data at hand in a very elegant way. For example, suppose we want to know if a certain toxic compound was present in a blood sample. We may think about an LC–MS method and try to find a peak at a certain retention time and at a certain mass-to-charge ratio (m/z). Our answer will be “the compound is there” or “the compound is not there”, basically depending on the peak height surrounding this m/z value. However, if we apply the Bayesian view described earlier, this information changes to “the probability that the compound is there is x%”. This changes everything. Because then, we can update this probability based on new data or evidence that we have at hand. Suppose that we have information about the number of isomers that may appear at this m/z channel. If the molecule is common (so many other molecules may appear at this m/z), the information of this particular m/z peak is not that high, diminishing the percentage of probability that the molecule is there (7). Suppose that we add information about isotopes, adducts, retention times, other experiments, and so on. As we didn’t have a definitive answer that “the compound is there”, but a probabilistic one, we can keep on updating these probabilities based on the new data considered. As the data grow exponentially, it is a matter of updating the probability, that is, the information about the variable that we want to infer.

Bayesian statistics is experiencing a revolution and it is being applied in many disciplines in our society. Big data (from self-driving cars to web searches) involves Bayesian statistics to manage (and update) the information and take automated decisions. The concept of Bayesian statistics is deeply rooted in artificial intelligence. The reason is simple: In the time of Bayes, we did not have the computers to solve the complicated equations for the Bayes theorem, and the amount of information available was limited. In the internet era, we now have both the information and the computing power to digest it. Moreover, by treating information as a probability distribution, models can get updated easily, as new information becomes available.

I expect a revolution in the use of artificial intelligence in the chromatography area. There is a need to digest this enormous quantity of data, sort it out, and show it to the chromatographer in an efficient manner. However, this will require re-educating ourselves on how we define the concept of information.

References

  1. V.B. Di Marco and G.G. Bombi, J Chromatogr A. 931, 1–30. (2001). 
  2. A. de Juan and R. Tauler, Anal. Chim. Acta 500, 195–210 (2003).
  3. R. Bro, Chemom. Intell. Lab. Syst.36, 149–171 (1997). 
  4. M.H. Abraham, in Quantitative Treatment of Solute/Solvent Interaction, P. Politzer and J.S. Murray, Eds. (Elsevier, Amsterdam, The Netherlands, 1994), pp. 83–134.
  5. A.M. Siouffi and R.Phan-Tan-Luu, J. Chromatogr. A892, 75–106 (2000). 
  6. S. Sivia and J. Skilling, Data Analysis. A Bayesian Tutorial (Oxford University Press, 2012).
  7. M. Woldegebriel and G. Vivó-Truyols, Anal. Chem. 88, 9843–9849 (2016). 

Gabriel Vivó-Truyols is a principal scientist at Tecnometrix, in the Ciutadella de Menorca, Spain, and a guest researcher at the Analytical Chemistry Group at the University of Amsterdam, in The Netherlands.