News|Articles|June 2, 2026

LCGC Blog: Who Will Handle the Data? Training Data Wranglers in Analytical Chemistry

Listen

0:00 / 0:00

Key Takeaways

High-dimensional GC×GC/LC×LC/HRMS workflows demand training in chemometrics and multivariate reasoning, because visual peak-centric evaluation cannot scale to thousands of features.
A scaffolded unit integrated hypothesis testing and error concepts with preprocessing decisions such as baseline correction, normalization, and retention-time alignment before introducing PCA, HCA, DA, and feature selection.
Browser-based platforms reduced computational and coding barriers across heterogeneous student backgrounds, enabling focus on analytical intent, data structure, and consequences of processing choices.
GustieChrom exercises used HPLC caffeine data to show how integration boundaries, blank subtraction, and calibration design propagate into quantitative bias and uncertainty.
MetaboAnalyst exploration with GC×GC–MS essential oil peak tables linked filtering, transformation, and scaling to distribution changes and downstream clustering/ordination, stimulating peer comparison of alternative pipelines.

In this month's LCGC Blog, Katelynn Perrault describes a new initiative to promote the importance and practice of chemometrics through an active-learning programme that provides hands-on experience with chemometric workflows using real analytical datasets.

Modern separation science is no longer limited by our ability to separate compounds. It is limited more often by our ability to interpret the data we generate. Techniques like GC×GC, LC×LC, and high-resolution mass spectrometry routinely produce datasets containing thousands of features, far exceeding what can be meaningfully evaluated by visual inspection alone. Yet, many students are still trained to recognize peaks rather than to interrogate data. As chemometrics and multivariate analysis become essential tools for extracting chemical insight, the question is no longer whether students should learn these approaches, but rather how we prepare them to use them effectively.

After introducing chemometrics in my Advanced Analytical Chemistry course in 2025, the feedback was consistent and striking. Of all the topics we covered, students were most intrigued by our coverage of Chemometrics. While they recognized the relevance, they felt disconnected from the material because they had no opportunity to apply it themselves within our lecture format. They didn’t just want to learn what these tools were, they wanted to use them. Given that chemistry is inherently a discipline that lends itself to active learning, I decided it would be worthwhile attempting to transform my lecture classroom into a data laboratory for a class period, where students could attempt to solve problems using chemometrics. I developed an active learning experience for a class period that would give students a hands-on “playground” to explore chemometric strategies using real data.

As I began designing the activity, one of the biggest challenges was selecting tools that would be both accessible and effective for a diverse group of students. The class included primarily senior chemistry majors, along with a few juniors and Master’s students, all with varying levels of experience in coding and data analysis. While many powerful chemometric workflows rely on programming environments, I wanted to avoid creating barriers for students who had little or no coding background. I initially explored software platforms that offered free trials or free student licenses but quickly encountered a practical limitation: the laptops students rely on for everyday coursework are not typically configured for computationally intensive data processing. Rather than forcing a solution that might be cumbersome, I shifted toward browser-based tools that could run efficiently on standard machines. By providing chromatographic datasets and .csv outputs compatible with these platforms, I was able to create a more inclusive and seamless experience that allowed students to focus on learning the concepts rather than navigating technical hurdles.

To prepare students for this activity, I structured a four-class unit that progressively built from foundational statistical concepts to applied chemometric workflows. The first class focused on core statistical principles, including hypothesis testing, outlier analysis, t-tests, ANOVA, variance, normality, and statistical error. In the second class, I introduced chemometrics with an emphasis on data structure and preprocessing strategies, including defining analytical goals (targeted versus nontargeted), baseline correction, smoothing, peak detection, deconvolution, normalization, and retention time alignment. The third class shifted toward data processing and interpretation, covering both supervised and unsupervised approaches such as principal component analysis (PCA), hierarchical cluster analysis (HCA), discriminant analysis (DA), and feature selection strategies. These three lectures laid the conceptual groundwork, culminating in a fourth class period dedicated to an active learning activity where students could apply these concepts directly to real datasets.

I selected a two-part activity that employed two browser-based tools. The tools allowed students to engage directly with chromatographic data without requiring specialized software or computational resources. The activity was designed as a guided, exploratory learning experience that balanced structure with flexibility. I created a page in Blackboard that would be the activity guide. Step-by-step instructions provided students with an accessible entry point into key tasks, while open-ended prompts encouraged them to test different approaches, compare outcomes, and reflect on how their choices influenced results. By emphasizing experimentation rather than a single “correct” workflow, the activity helped students develop intuition for chemometric methods and a deeper understanding of how data processing decisions shape analytical conclusions.

Active Chemometrics Practice File - Download Here

Essential Oils Data Set - Download Here

For foundational data preprocessing, students used GustieChrom¹, a lightweight, web-based platform for visualizing chromatograms and performing tasks such as peak picking and integration. GustieChrom was developed by Prof. Dwight Stoll at Gustavus Adolphus College and was released in 2026 (Version 1.4). This provided an intuitive entry point into how data processing decisions directly influence quantitative outcomes. In the activity, students uploaded .csv files of single peak caffeine chromatograms from HPLC. They performed peak integration in the platform and constructed a calibration curve in excel to determine the concentration of an unknown sample. They also repeated the activity by (1) choosing “poor” integration boundaries, (2) applying a blank subtraction, and (3) adding additional standards to their calibration curve. I think they genuinely enjoyed trying to pick “poor” peak integration boundaries; it is rare that instructors ask them to do their worst at a task. These steps allowed students to explore how different strategies impacted quantitative results.

Students then transitioned to MetaboAnalyst,² an established, freely available platform that supports data normalization, statistical analysis, and visualization tools such as PCA and hierarchical clustering. Students worked with a GC×GC-MS dataset of sandalwood essential oils, chosen because it was intentionally complex and a somewhat familiar sample type from everyday life. I have the impression that food, beverages, personal care products, or other everyday products would also spark enthusiasm. After uploading a preprocessed .csv peak table into the analysis platform, students first evaluated data integrity and compatibility, reinforcing the importance of proper data structure. They then explored a range of preprocessing strategies, including filtering, normalization, transformation, and scaling. A particularly impactful feature of the platform allowed students to visualize these choices in real time through box-and-whisker plots for each analyte, directly illustrating how different approaches reshape data distributions and influence downstream analysis. Building on these decisions, students generated unsupervised models, including principal component analysis (PCA) and hierarchical cluster analysis (HCA), to identify patterns and draw conclusions about similarities and differences among the essential oil samples. To further encourage exploration, students were prompted to test additional statistical tools beyond those covered in class and to compare their findings using example datasets within the platform. Metaboanalyst provides a range of 1H NMR, LC–MS, and GC–MS data that can be used openly, reinforcing transferable data analysis skills.

From my perspective, the activity was a clear success. Students remained engaged for the full 80-min class period, so much so that I ultimately had to ask them to stop working so the next class could come in. The idea of chemometrics as a “playground” resonated strongly, and this was reflected in positive feedback on course evaluations. I did not fully anticipate the degree to which the activity would become collaborative. Because students pursued different analytical pathways, testing various preprocessing and processing strategies, it naturally created opportunities for comparison and discussion. Students began asking one another what choices they made, why they made them, and what alternative approaches might reveal. These peer-to-peer exchanges led to thoughtful conversations about the impact of methodological decisions, and ultimately, deeper conceptual understanding. Observing this dynamic was particularly rewarding, as it underscored that students were engaging with the material in a more meaningful and self-directed way than would have been possible through lecture or instructor demonstration alone.

Reflecting on the experience, one of the most unexpected outcomes was how much I enjoyed the class itself. For the full 80 min, I was on my feet, circulating, asking students what they were trying, and having them walk me through their results. It was an energizing shift from a traditional lecture format. In many ways, it reminded me of my own introduction to multivariate analysis in graduate school: the long buildup of working through data, followed by that moment when an ordination plot finally reveals structure and meaning. Those moments of clarity, when patterns emerge and the data begin to “make sense”, are powerful. It was rewarding to see students experience a version of that in real time. Opportunities like this are often rare in content-heavy courses, where there is constant pressure to move quickly through material. This upper-level elective created space to explore alongside my students, to revisit concepts from a fresh perspective, and to engage with the material in a more dynamic way. It was a reminder that learning can be both rigorous and genuinely enjoyable for students and instructors alike.

I recognize that this is just one approach, and I would be very interested to hear how others are introducing chemometrics and data analysis in their own classrooms, especially at the undergraduate level where this content is not standard. I am happy to share additional materials or discuss how this activity might be adapted to different courses or student populations. At the same time, I would welcome the opportunity to learn from others, whether through new activity ideas, alternative teaching strategies, or recommendations for browser-based tools that can lower barriers to engaging with complex data. As our field continues to generate increasingly rich datasets, finding accessible and effective ways to teach data interpretation is a shared challenge, and one that benefits from collective input across the separation science community.

Acknowledgments

I would like to thank my colleagues Dr. Dwight Stoll (Gustavus Adolphus College), Dr. Pierre-Hugues Stefanuto (University of Liège), and Dr. Anais Rodrigues (LECO Corporation) for their support in helping me develop these activities through sharing knowledge of GustieChrom and Metaboanalyst platforms. Thank you also to Cynthia Cheung who originally collected the Essential Oil dataset at Chaminade University of Honolulu. This dataset can be freely shared for those who want an example of data to use in their classrooms for similar activities.

References

Dwight Stoll. GustieChrom (Version 1.4). https://homepages.gac.edu/~dstoll/GustieChrom.html (accessed 2026-03-03).
Pang, Z.; Lu, Y.; Zhou, G.; et al. MetaboAnalyst 6.0: Towards a Unified Platform for Metabolomics Data Processing, Analysis and Interpretation. Nucleic Acids Res 2024, 52 (W1), W398–W406. DOI: 10.1093/nar/gkae253

Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.

LCGC Blog: Who Will Handle the Data? Training Data Wranglers in Analytical Chemistry

Key Takeaways

Acknowledgments

References

Related Content

GC-MS and GC-O Tracks Soybean Off-Flavor Compounds

The PFAS Analyst’s Wish List

HTC-19 Insights: Practical Considerations for Using Pyr-GC×GC to Study Fluoropolymer Degradation

HRLC-MS/MS Reveals Why Ancient Brains Preserve

Thin-Layer Chromatography Tracks Herbicide Fate

Trending on LCGC International

Chromatography's Role in Spotting False Leachables

Sample Preparation Strategies for PFAS Analysis

AI/ML in Practice: Machine-learning Prediction of Chromatographic Retention Times for Small Molecules in Pharmaceutical Applications

HRLC-MS/MS Reveals Why Ancient Brains Preserve

Fast GC-MS/MS Detects PFAS in Food Packaging