News|Articles|April 14, 2026

AI/ML in Practice: Enhanced Identification Probability for Non-Targeted Workflows

Listen
0:00 / 0:00

Key Takeaways

  • Calibrant-free RTI prediction addresses retention-time nontransferability across chromatographic conditions, enabling method-independent confidence refinement beyond spectral similarity alone.
  • Dual RF regressors learn orthogonal RTI surrogates: RTI\_MF from 790 fingerprints and RTI\_CNL from empirical MS/MS neutral-loss patterns plus precursor mass.
SHOW MORE

Saer Samanipour from the University of Amsterdam, The Netherlands used machine-learning (ML)-predicted retention time indices to estimate true-match probabilities to boost identification confidence when retention time calibrants are unavailable.

What was the rationale behind applying machine learning (ML) models to improve identification probability in the absence of retention time calibrants in your paper Machine Learning for Enhanced Identification Probability in RPLC/HRMS Nontargeted Workflows?1
The primary rationale was to address the critical challenge of securing high chemical identification confidence (IC) in non-targeted analysis (NTA)2,3,4,5,6 when retention time (RT) data is absent or non-transferable due to varying chromatographic conditions. The core solution introduced was an innovative approach to calculate the class probability of true positives (P(TP)) for individual spectral matches. This was achieved by leveraging calibrant-free predicted retention time indices (RTIs) derived from three ML models (two based on quantitative structure-retention relationship [QSRR]) to ultimately enhance the overall identification probability (IP).

What is innovative about using predicted retention time indices (RTIs) and class probability of true positives (P(TPs)) compared to traditional spectral library matching methods in nontargeted analysis (NTA)?
The innovation lies in providing a calibrant-free, method-independent, and transferable approach to enhance identification confidence. Unlike traditional methods that rely solely on spectral similarity and often struggle with ambiguity from multiple reference spectra, this methodology integrates predicted RTIs to calculate a statistically robust P(TP) for each spectral match. By averaging the P(TP) for all reference spectra associated with a candidate compound, the overall identification probability (IP) is calculated, accounting for both spectral evidence and expected chromatographic behavior.

Quantitatively, the ML-integrated approach significantly improved identification performance: the average IPs for pesticides increased by 54.5%, 52.1%, and 46.7% when spiked in blank, 10x diluted, and 100x diluted tea matrices, respectively, compared to solely using library matching.

What practical benefits can separation scientists gain from incorporating these machine learning models into routine workflows, especially when working with complex sample matrices or trace-level analytes?
The practical benefits for separation scientists include achieving significantly higher chemical identification confidence (IC) and better recall for trace-level analytes, particularly in complex sample matrices. The ML-aided workflow demonstrated a strong recall for the analysis of pesticides spiked in diluted black tea matrices. Furthermore, the calibrant-free nature of the predicted RTIs and the use of a harmonized scale are crucial for cross-laboratory data integration and harmonization, a key challenge in large-scale NTA studies, providing a critical benefit for non-targeted screening of trace contaminants like pesticides and per- and polyfluoroalkyl substances (PFAS) in environmental samples.

How can machine learning models predict RTIs without the need for calibrants, and what are the benefits for chromatographic workflows?
The ML models predict RTIs without requiring in-run calibrants by employing the QSRR principle. This is achieved using two random forest (RF) regression models: one model predicts a structure-based RTI (RTIMF) from molecular fingerprints (MFs), a 2D chemical descriptor, and a second model predicts a fragmentation-based RTI (RTICNL) from empirical cumulative neutral loss (CNL) masses derived from tandem mass spectrometry (MS/MS) spectra. The key benefit is the provision of a calibrant-free, method-independent, and transferable harmonized RTI scale, which allows for the integration of data from different chromatographic methods or laboratories.

What role do molecular fingerprints (MF) and cumulative neutral losses (CNL) play in training ML models for RTI prediction in HRMS datasets?
MFs and CNL serve as the essential input features for the two QSSR regression models. MFs are 2D chemical descriptors derived from the structure (SMILES) and are used in Model 1, which was trained on 4,713 calibrants, to predict the structure's expected RTI value (RTIMF). Conversely, CNL utilizes masses of neutral losses derived from empirical MS/MS spectra to predict an actual RTI value (RTICNL) in Model 2, which was trained on a massive set of 485,577 query tandem mass spectrometry (MS/MS) spectra. Crucially, the difference between these two predicted values, known as the RTI error, is a central feature used in the final Model 3 to classify individual spectral matches as a true positive (TP) or true negative (TN).

How does incorporating ML-derived RTI predictions improve the identification probability (IP) of compounds compared to traditional library-based spectral matching?
Incorporating ML-derived RTI predictions dramatically improves IP by providing a structure- and fragmentation-informed filter on ambiguous spectral matches. The ML workflow generates an RTI Error by calculating the difference between the structure-derived RTI and the fragmentation-derived RTI. This error is a powerful feature in Model 3 (k-nearest neighbors classification model): a smaller RTI error strongly correlates with a TP spectral match. By using Model 3 to compute the P(TP) and then averaging the P(TP) for all reference spectra, the overall IP is refined. This process effectively removes false positives by filtering out structural candidates whose predicted chromatographic behavior does not match their observed MS/MS fragmentation pattern, yielding more reliable and less ambiguous annotations.

In what ways does the k-nearest neighbors (k-NN) algorithm contribute to reducing false positives/negatives in complex sample matrices like tea extracts?

The workflow utilizes three sequential machine learning models. The first is a random forest (RF) regression model (Model 1) that predicts a structure-based RTIMF on a harmonized scale using 790 preselected MFs derived from the compound's SMILES. The second is also an RF regression model that predicts RTI from the experimental MS/MS spectrum and the mono isotopic mass of the parent ion (MS1). The final model (Model 3) is a k-nearest neighbors (KNN) binary classification model, which calculates the P(TP) for an individual reference spectral match using the RTI Error, the monoisotopic mass, and four parameters from the ULSA spectral matching algorithm.

How does class probability of true positives (P(TP)) enhance decision-making confidence in compound identification during nontargeted screening?
The P(TP) fundamentally shifts the basis of annotation confidence from a simple, often ambiguous, spectral matching score to a statistically rigorous probability of correctness. By calculating the P(TP) for each individual reference spectral match (using Model 3), the entire identification process is improved. The study then measures ambiguity in a candidate compound by calculating an average P(TP) for the compound hit. This final metric, the identification probability (IP), provides a robust and transferable confidence score that accounts for multiple reference spectra and helps address the inherent ambiguity in traditional spectral library searches. A decision threshold (e.g., P(TP) > 0.50) can be applied to retain or exclude a hit, providing a clear basis for decision-making.

What are the key challenges in applying ML models to chromatographic data, and how does this study address them in terms of accuracy and reproducibility?
The key challenges in applying ML to chromatographic data are the limited transferability across various instrumental and chromatographic conditions and the need for structurally diverse training data. The study addresses the challenge of limited transferability by ensuring the models provide a calibrant-free, transferable RTI on a harmonized scale, which was achieved by training on three comparable RPLC RTI systems. For example, the Model 2 achieved a strong R2 of 0.8824 for the testing data set. To handle the need for structural diversity, Model 1 used stratification based on MFs rather than RTIs, which ensures the training accounts for chemical and structural diversity, resulting in a mean relative error (MRE) of 27.53% on untrained calibrants

References
1. Ngan, H.L.; Turkina, V.; van Herwerden, D.; Yan, H.; Cai, Z.; Samanipour, S. Machine Learning for Enhanced Identification Probability in RPLC/HRMS Nontargeted Workflows. Anal. Chem.2025, 97 (33). 18028–18035. DOI: 10.1021/acs.analchem.5c01873
2. Manz, K. E.; Feerick, A.; Braun, J. M.; Feng, Y. L.; Hall, A.; Koelmel, J.; Manzano, C.; Newton, S. R.; Pennell, K. D.; Place, B. J.; Godri Pollitt, K. J.; Prasse, C.; Young, J. A. Non-Targeted Analysis (NTA) and Suspect Screening Analysis (SSA): A Review of Examining the Chemical Exposome. J. Exposure Sci. Environ. Epidemiol. 2023, 33, 524–536. DOI: https://doi.org/10.1038/s41370-023-00574-6
3.Samanipour, S.; Barron, L. P.; van Herwerden, D.; Praetorius, A.; Thomas, K. V.; O’Brien, J. W. Exploring the Chemical Space of the Exposome: How Far Have We Gone? JACS Au 2024, 4 (7), 2412–2425. DOI: https://doi.org/10.1021/jacsau.4c00220
4. Guo, Z.; Zhu, Z.; Huang, S.; Wang, J. Non-Targeted Screening of Pesticides for Food Analysis Using Liquid Chromatography High-Resolution Mass Spectrometry—A Review. Food Addit. Contam., Part A 2020, 37 (7), 1180–1201. DOI: https://doi.org/10.1080/19440049.2020.1753890
5. Zweigle, J.; Bugsel, B.; Zwiener, C. FindPFΔS: Non-Target Screening for PFAS—Comprehensive Data Mining for MS2 Fragment Mass Differences. Anal. Chem. 2022, 94 (30), 10788–10796. DOI: https://doi.org/10.1021/acs.analchem.2c01521
6. Grunfeld, D. A.; Gilbert, D.; Hou, J.; Jones, A. M.; Lee, M. J.; Kibbey, T. C. G.; O’Carroll, D. M. Underestimated Burden of Per- and Polyfluoroalkyl Substances in Global Surface Waters and Groundwaters. Nat. Geosci. 2024, 17, 340–346. https://doi.org/10.1038/s41561-024-01402-8