|Articles|November 1, 2017

LCGC Europe-11-01-2017
Volume 30
Issue 11

A New Era for Big Data and Chromatography

We have entered a new stage in the era of accelerations. Moore’s law continues its expansion, increasing exponentially the computer power available. Other accelerations are remarkable, particularly easy access to cloud computing and the expansion and influence of artificial intelligence to practically all sectors of our society.

And what about chromatography? Does Moore’s law apply in this discipline? My answer is yes! If one examines the amount of data produced by chromatographic instruments versus time, an exponential trend is clearly visible. We have changed from producing a few kilobytes from a gas chromatography (GC) chromatogram in the 1960s towards a few gigabytes from a modern high-resolution liquid chromatography–mass spectrometry (LC–MS) chromatogram. The advent of MS and, particularly, high-resolution MS has been a clear contributor, but sophisticated instrumentation, such as comprehensive two-dimensional chromatography systems and second-order detectors, such as ion mobility mass spectrometers, have also contributed.

Chemometrics and Chromatography

During the late 1960s and the 1970s, research at the interface between statistics and analytical chemistry started to gain momentum, planting the seed of a new discipline in analytical chemistry: chemometrics. Professor Massart in Europe and Professor Kowalski in America (to name just a few of these pioneers) were visionaries at that time, contributing to make chemometrics a mature discipline in the 1980s and the 1990s.

Most of the chemometric techniques (but not all) were developed with the application of multivariate statistics, and this idea grew along with the development of “multivariate” instrumentation. The application of partial least squares (PLS) to model properties of interest with near infrared (NIR) spectroscopy is one of the most cited examples. Overtones in NIR make it impossible to isolate a single wavelength that describes the property of interest, but taking into account all the wavelengths together (that is, multivariate information) can help to model such a property of interest.

In chromatography, chemometrics has been successful at solving many problems, yet its popularity has not been as high as in other disciplines. There are two main reasons for this. First, chemometrics is viewed as a complementary (but not an “essential”) technique. In chromatography we want to create enough resolution to separate peaks, and the call to use multivariate techniques to separate mathematically what the chromatographic column could not complete is viewed as a kind of “last resource” option. Second, spectroscopic instrumentation shows a repeatability much higher than chromatography because of effects such as column ageing, pumping instabilities, and sample injection. These effects are particular to chromatography, jeopardizing the use of multivariate techniques and making the alignment of peaks necessary.

Despite this, some important areas have developed at the interface of chromatography and chemometrics. For example, the development of peak models for peak deconvolution to mathematically separate overlapped peaks is still an option. A review paper in 2001 published a long table with more than 100 peak models (1). On the other hand, the use of the peak resolution techniques that do not force the peaks to follow a mathematical model is a viable technique when peaks are not completely resolved and multichannel detection is available (with different variants [2,3]). Other areas have experienced important developments. In retention prediction, the so-called “Abraham models” (4) are applied to get an approximate idea about the retention times of molecules without any laboratory experiment by using molecular descriptors. Methods for chromatographic optimization (5) are now well accepted in the chromatographic community.

Big (Chromatographic) Data and Artificial Intelligence

Despite the development of methods for data treatment in separation science, there is still a gap between the potential and the actual information obtained from (raw) chromatographic data. With Moore’s law applying to separation science, this gap is becoming wider. There are several reasons for this.

First, as the data grow exponentially, the demand for automation grows bigger. During the 1980s and 1990s, the need for automation was not as high as now. We could deal with chromatograms with several peaks, and human supervision was possible to apply techniques of peak resolution like multivariate curve resolution–alternating least squares (MCR-ALS) (2). Nowadays, with thousands or tens of thousands of peaks (and with the high-resolution MS data) produced, it is not viable to ask for user intervention as before-automation is now a must.

Second, we need to re-educate ourselves in the way we conceive and manage the concept of information. This can be rooted in a basic statistical definition: probability. To be able to understand this, we have to revise the theory of the great mathematicians of the end 18th century. For Bayes, Bernoulli, and Laplace, probability represented a degree-of-belief: How much they thought that something was true (6). If we conceive “information” as a probability distribution (either continuous or discrete), we are able to manage and update this information (that is, probability) with the data at hand in a very elegant way. For example, suppose we want to know if a certain toxic compound was present in a blood sample. We may think about an LC–MS method and try to find a peak at a certain retention time and at a certain mass-to-charge ratio (m/z). Our answer will be “the compound is there” or “the compound is not there”, basically depending on the peak height surrounding this m/z value. However, if we apply the Bayesian view described earlier, this information changes to “the probability that the compound is there is x%”. This changes everything. Because then, we can update this probability based on new data or evidence that we have at hand. Suppose that we have information about the number of isomers that may appear at this m/z channel. If the molecule is common (so many other molecules may appear at this m/z), the information of this particular m/z peak is not that high, diminishing the percentage of probability that the molecule is there (7). Suppose that we add information about isotopes, adducts, retention times, other experiments, and so on. As we didn’t have a definitive answer that “the compound is there”, but a probabilistic one, we can keep on updating these probabilities based on the new data considered. As the data grow exponentially, it is a matter of updating the probability, that is, the information about the variable that we want to infer.

Bayesian statistics is experiencing a revolution and it is being applied in many disciplines in our society. Big data (from self-driving cars to web searches) involves Bayesian statistics to manage (and update) the information and take automated decisions. The concept of Bayesian statistics is deeply rooted in artificial intelligence. The reason is simple: In the time of Bayes, we did not have the computers to solve the complicated equations for the Bayes theorem, and the amount of information available was limited. In the internet era, we now have both the information and the computing power to digest it. Moreover, by treating information as a probability distribution, models can get updated easily, as new information becomes available.

I expect a revolution in the use of artificial intelligence in the chromatography area. There is a need to digest this enormous quantity of data, sort it out, and show it to the chromatographer in an efficient manner. However, this will require re-educating ourselves on how we define the concept of information.

References

V.B. Di Marco and G.G. Bombi, J Chromatogr A. 931, 1–30. (2001).
A. de Juan and R. Tauler, Anal. Chim. Acta 500, 195–210 (2003).
R. Bro, Chemom. Intell. Lab. Syst.36, 149–171 (1997).
M.H. Abraham, in Quantitative Treatment of Solute/Solvent Interaction, P. Politzer and J.S. Murray, Eds. (Elsevier, Amsterdam, The Netherlands, 1994), pp. 83–134.
A.M. Siouffi and R.Phan-Tan-Luu, J. Chromatogr. A892, 75–106 (2000).
S. Sivia and J. Skilling, Data Analysis. A Bayesian Tutorial (Oxford University Press, 2012).
M. Woldegebriel and G. Vivó-Truyols, Anal. Chem. 88, 9843–9849 (2016).

Gabriel Vivó-Truyols is a principal scientist at Tecnometrix, in the Ciutadella de Menorca, Spain, and a guest researcher at the Analytical Chemistry Group at the University of Amsterdam, in The Netherlands.

Articles in this issue

about 8 years ago

Article

Hitting Thirty

about 8 years ago

Article

15th International Symposium on Hyphenated Techniques in Chromatography and Separation Technology (HTC-15)

about 8 years ago

Article

LC Column Technology: The State of the Art

about 8 years ago

Article

GC: The State of the Art

about 8 years ago

Article

LC Instrumentation: The State of the Art

about 8 years ago

Article

Sample Preparation: The State of the Art

about 8 years ago

Article

The Evolution of 3D Printing

about 8 years ago

Article

The Revival of Supercritical Fluid Chromatography in Pharmaceutical Analysis

about 8 years ago

Article

Recent Developments of Comprehensive Two‑Dimensional Liquid Chromatography Coupled to Mass Spectrometry in Food Analysis

about 8 years ago

Article

Capillary Electrophoresis: The Past, Present, and Future

Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.

Subscribe Now!

A New Era for Big Data and Chromatography

Articles in this issue

Newsletter

Related Content

Thermo Fisher Launches Chromeleon 7.4 for Networked Mass Spectrometry, Chromatography, and Enterprise Data Management

How to Meaningfully Describe and Display Analytical Data? A Dive Into Descriptive Statistics

Researchers Explore Gas Chromatography to Improve Forensic Ink Analysis

18 Scientists from the US and Canada Win Scialog Awards for Advancing Chemical Lab Automation

Data Science Tools For the Prediction of VUV Spectra: An HPLC 2025 Video Interview with Kevin Schug

Trending on LCGC International

Best of the Week: Profiling Endoenous Protein Complexes, Previewing Riva 2026

Restek Introduces GC Columns Featuring TriMax Technology for Trace-level Sensitivity

Riva Returns Alive and Kicking

DoE-Optimized GC–FID Method for Robust Terpene Profiling in Essential Oils

Chromatographic Profiling and Antibacterial Activity of Solvent-Extracted Shiitake Mushroom Compounds