
LCGC Blog: Who Will Handle the Data? Training Data Wranglers in Analytical Chemistry
Key Takeaways
- High-dimensional GC×GC/LC×LC/HRMS workflows demand training in chemometrics and multivariate reasoning, because visual peak-centric evaluation cannot scale to thousands of features.
- A scaffolded unit integrated hypothesis testing and error concepts with preprocessing decisions such as baseline correction, normalization, and retention-time alignment before introducing PCA, HCA, DA, and feature selection.
In this month's LCGC Blog, Katelynn Perrault describes a new initiative to promote the importance and practice of chemometrics through an active-learning programme that provides hands-on experience with chemometric workflows using real analytical datasets.
Modern separation science is no longer limited by our ability to separate compounds. It is limited more often by our ability to interpret the data we generate. Techniques like GC×GC, LC×LC, and high-resolution mass spectrometry routinely produce datasets containing thousands of features, far exceeding what can be meaningfully evaluated by visual inspection alone. Yet, many students are still trained to recognize peaks rather than to interrogate data. As chemometrics and multivariate analysis become essential tools for extracting chemical insight, the question is no longer whether students should learn these approaches, but rather how we prepare them to use them effectively.
After introducing chemometrics in my Advanced Analytical Chemistry course in 2025, the feedback was consistent and striking. Of all the topics we covered, students were most intrigued by our coverage of Chemometrics. While they recognized the relevance, they felt disconnected from the material because they had no opportunity to apply it themselves within our lecture format. They didn’t just want to learn what these tools were, they wanted to use them. Given that chemistry is inherently a discipline that lends itself to active learning, I decided it would be worthwhile attempting to transform my lecture classroom into a data laboratory for a class period, where students could attempt to solve problems using chemometrics. I developed an active learning experience for a class period that would give students a hands-on “playground” to explore chemometric strategies using real data.
As I began designing the activity, one of the biggest challenges was selecting tools that would be both accessible and effective for a diverse group of students. The class included primarily senior chemistry majors, along with a few juniors and Master’s students, all with varying levels of experience in coding and data analysis. While many powerful chemometric workflows rely on programming environments, I wanted to avoid creating barriers for students who had little or no coding background. I initially explored software platforms that offered free trials or free student licenses but quickly encountered a practical limitation: the laptops students rely on for everyday coursework are not typically configured for computationally intensive data processing. Rather than forcing a solution that might be cumbersome, I shifted toward browser-based tools that could run efficiently on standard machines. By providing chromatographic datasets and .csv outputs compatible with these platforms, I was able to create a more inclusive and seamless experience that allowed students to focus on learning the concepts rather than navigating technical hurdles.
To prepare students for this activity, I structured a four-class unit that progressively built from foundational statistical concepts to applied chemometric workflows. The first class focused on core statistical principles, including hypothesis testing, outlier analysis, t-tests, ANOVA, variance, normality, and statistical error. In the second class, I introduced chemometrics with an emphasis on data structure and preprocessing strategies, including defining analytical goals (targeted versus nontargeted), baseline correction, smoothing, peak detection, deconvolution, normalization, and retention time alignment. The third class shifted toward data processing and interpretation, covering both supervised and unsupervised approaches such as principal component analysis (PCA), hierarchical cluster analysis (HCA), discriminant analysis (DA), and feature selection strategies. These three lectures laid the conceptual groundwork, culminating in a fourth class period dedicated to an active learning activity where students could apply these concepts directly to real datasets.
I selected a two-part activity that employed two browser-based tools. The tools allowed students to engage directly with chromatographic data without requiring specialized software or computational resources. The activity was designed as a guided, exploratory learning experience that balanced structure with flexibility. I created a page in Blackboard that would be the activity guide. Step-by-step instructions provided students with an accessible entry point into key tasks, while open-ended prompts encouraged them to test different approaches, compare outcomes, and reflect on how their choices influenced results. By emphasizing experimentation rather than a single “correct” workflow, the activity helped students develop intuition for chemometric methods and a deeper understanding of how data processing decisions shape analytical conclusions.
For foundational data preprocessing, students used
Students then transitioned to
From my perspective, the activity was a clear success. Students remained engaged for the full 80-min class period, so much so that I ultimately had to ask them to stop working so the next class could come in. The idea of chemometrics as a “playground” resonated strongly, and this was reflected in positive feedback on course evaluations. I did not fully anticipate the degree to which the activity would become collaborative. Because students pursued different analytical pathways, testing various preprocessing and processing strategies, it naturally created opportunities for comparison and discussion. Students began asking one another what choices they made, why they made them, and what alternative approaches might reveal. These peer-to-peer exchanges led to thoughtful conversations about the impact of methodological decisions, and ultimately, deeper conceptual understanding. Observing this dynamic was particularly rewarding, as it underscored that students were engaging with the material in a more meaningful and self-directed way than would have been possible through lecture or instructor demonstration alone.
Reflecting on the experience, one of the most unexpected outcomes was how much I enjoyed the class itself. For the full 80 min, I was on my feet, circulating, asking students what they were trying, and having them walk me through their results. It was an energizing shift from a traditional lecture format. In many ways, it reminded me of my own introduction to multivariate analysis in graduate school: the long buildup of working through data, followed by that moment when an ordination plot finally reveals structure and meaning. Those moments of clarity, when patterns emerge and the data begin to “make sense”, are powerful. It was rewarding to see students experience a version of that in real time. Opportunities like this are often rare in content-heavy courses, where there is constant pressure to move quickly through material. This upper-level elective created space to explore alongside my students, to revisit concepts from a fresh perspective, and to engage with the material in a more dynamic way. It was a reminder that learning can be both rigorous and genuinely enjoyable for students and instructors alike.
I recognize that this is just one approach, and I would be very interested to hear how others are introducing chemometrics and data analysis in their own classrooms, especially at the undergraduate level where this content is not standard. I am happy to share additional materials or discuss how this activity might be adapted to different courses or student populations. At the same time, I would welcome the opportunity to learn from others, whether through new activity ideas, alternative teaching strategies, or recommendations for browser-based tools that can lower barriers to engaging with complex data. As our field continues to generate increasingly rich datasets, finding accessible and effective ways to teach data interpretation is a shared challenge, and one that benefits from collective input across the separation science community.
Acknowledgments
I would like to thank my colleagues Dr. Dwight Stoll (Gustavus Adolphus College), Dr. Pierre-Hugues Stefanuto (University of Liège), and Dr. Anais Rodrigues (LECO Corporation) for their support in helping me develop these activities through sharing knowledge of GustieChrom and Metaboanalyst platforms. Thank you also to Cynthia Cheung who originally collected the Essential Oil dataset at Chaminade University of Honolulu. This dataset can be freely shared for those who want an example of data to use in their classrooms for similar activities.
References
- Dwight Stoll. GustieChrom (Version 1.4). https://homepages.gac.edu/~dstoll/GustieChrom.html (accessed 2026-03-03).
- Pang, Z.; Lu, Y.; Zhou, G.; et al. MetaboAnalyst 6.0: Towards a Unified Platform for Metabolomics Data Processing, Analysis and Interpretation. Nucleic Acids Res 2024, 52 (W1), W398–W406. DOI: 10.1093/nar/gkae253




