
- November/December 2025
- Volume 2
- Issue 9
- Pages: 13–14
A Novel Machine Learning Method for Predicting Retention Time of Small Molecule Pharmaceutical Compounds Across Reversed-phase HPLC Columns
Key Takeaways
- Traditional QSRR models are limited to single-column predictions, hindering adaptability across diverse LC setups in pharmaceutical settings.
- The new ML-based approach predicts retention times using analyte structures, column selectivity descriptors, and mobile phase conditions, enhancing adaptability.
Jessica Lin and Zhenqi (Pete) Shi from Genentech describe a novel machine learning approach to predicting retention times for small molecule pharmaceutical compounds across reversed-phase high perfromance liquid chromatography (HPLC) columns.
What specific challenges or unmet needs in pharmaceutical chromatography inspired you to embark on this research project (1)?
The use of reversed-phase high performance liquid chromatography (RP-HPLC) in pharmaceutical chemistry, manufacturing, and controls (CMC) often involves the development of multiple methods in different laboratories on the same project for drug substance (DS) and drug product (DP), as well as re-development with process change, which requires frequent method bridging. Existing computational tools frequently fail to predict retention times (Rts) accurately across different stationary phases (SPs) and mobile phases (MPs) (2–7).
This presents a major bottleneck in streamlining method development, life cycle management, and improving the efficiency of pharmaceutical CMC. We identified this unmet need and addressed it by creating a computational framework that could generalize retention time prediction across these diverse liquid chromatography (LC) setups, enabling method bridging and simplification of chromatographic workflows.
Why was it important to move beyond traditional single-column quantitative structure retention relationship (QSRR) models for retention time prediction, especially in the context of pharmaceutical method development?
Traditional QSRR models are often constrained to single-column retention predictions (8), limiting their utility in pharmaceutical settings where methods often require adaptation across SPs and MPs. The single-column QSRR fails to account for the diverse nature of column selectivity and solvent interactions. Moving beyond single-column models and ensuring that predictions remain robust and generalizable, ultimately reducing experimental burden and enabling seamless adoption across laboratories and equipment, is an important topic under pharmaceutical CMC.
What limitations have you encountered when using traditional QSRR models for predicting retention times, and how do you think this multi-column machine learning (ML-based) approach addresses those issues?
Traditional QSRR models (3,8) are trained on a database of selected analytes and stationary phases, making them highly limited and specific. They lack the flexibility to predict retention time on new SPs and MPs outside the training database, especially when ionization states of analytes vary with pH and when the new stationery phase is significantlydifferent in nature compared to the one in the original model. Our approach uses SP selectivity descriptors (9) to take into consideration SPs of different properties and leverages a retrainable ML model framework. This allows it to predict retention times without requiring retention data on the new SPs and MPs, making it more adaptive to new LC conditions.
How does your approach differ from previously published retention time transfer models, and what makes it more generalizable across different columns and conditions?
Our approach employs advanced ML techniques to predict retention times solely based on analyte structures, column selectivity descriptors, and mobile phase conditions, rather than relying on existing Rt databases and correlation of retention time among those databases. This makes it highly adaptable across SPs and MPs. These features collectively make it more versatile and generalizable.
This approach eliminates the need for pre-existing retention time data on the target column. How game-changing could that be for developing or transferring methods across different laboratories or phases of drug development?
Eliminating the need for pre-existing retention time data significantly reduces the resources and time required for method development or transfer. It removes the requirement to acquire data on the target column, therefore enabling the prediction of a greater number of columns and improving the method development success rate.
In pharmaceutical CMC settings, method robustness and reproducibility are critical. How could this framework improve method transferability and impurity profile tracking across development stages?
The proposed framework allows for robust retention time predictions/transfer across multiple SPs and MPs. In CMC settings, this helps maintain consistency in identifying impurities regardless of method changes. The predictive power reduces discrepancies in impurity profiles, ensuring greater alignment between research/development findings and large-scale manufacturing. This reduces risks and improves analytical consistency across the development stages.
Do you foresee regulatory or quality-control challenges in adopting a predictive framework like this into validated analytical procedures?
We view this as a development tool for streamlined method development and to facilitate method transfer across different laboratories. Based upon its prediction power, it can predict which columns are worthy of investigation and which are not. Once again, its prediction power will need to be empirically verified. Thus, we do not foresee any regulatory and quality control challenges regarding the use of predicted retention time for any good manufacturing practice (GMP) and/or quality decision-making.
With column selectivity data being publicly accessible, for example, on hplccolumns.org, how valuable is that transparency to you in terms of selecting or switching columns in your laboratory?
Public access to selectivity data is invaluable. It democratizes access to critical chromatographic properties, allowing researchers to quickly evaluate and select columns best suited for their analytes. This transparency complements the ML-based approach, enabling informed decisions around method optimization/transfer, even in resource-constrained laboratories.
This model focuses on small-molecule pharmaceuticals. Do you think the framework could be adapted or extended to cover biopharmaceutical modalities like peptides or small polar metabolites?
Extending this framework to biopharmaceuticals is possible but would need additional customization to account for the unique physicochemical properties of peptides or small polar metabolites. Factors like secondary structure stability, hydrophobicity, and post-translational modifications would need to be incorporated into the ML model for peptides or metabolites.
What would make you or your laboratory adopt a predictive tR model such as this? Are there specific hurdles, for example, integration, regulatory validation, or data format, that need to be addressed?
A significant driver would be seamless software integration into existing lab workflows and chromatographic platforms. There are no regulatory concerns at the moment, given that the intended purpose of the ML model is for development. In addition, ensuring compatibility with diverse data formats used by various LC instruments would alleviate barriers to adoption and promote widespread use.
Do you have any predictions for the use of ML in practice in biopharmaceutical analysis?
The authors agree that machine learning holds immense potential in biopharmaceutical analysis, from accelerating purification development for complex biologics to optimizing impurity profiling in biotherapeutics. ML-driven tools could integrate with analytical quality-by-design (QbD) workflows, forecast product stability under stressed conditions, and even predict immunogenicity based on structural data. As the field matures, an improved understanding of how high-order structure impacts those molecular descriptors will be a critical linchpin for further advancement of ML for biopharmaceuticals.
References
(1) Shi, Z. Q.; Yi, Y, Y.; Madrigal, E.; et al. A Generalizable Methodology for Predicting Retention Time of Small Molecule Pharmaceutical Compounds Across Reversed-phase HPLC Columns. J. Chromatogr. A 2025, 1742, 465628. DOI: 10.1016/j.chroma.2024.465628.
(2) Bouwmeester, R.; Martens, L.; Degroeve, S. Generalized Calibration Across Liquid Chromatography Setups for Generic Prediction of Small-Molecule Retention Times. Anal. Chem. 2020, 92 (9), 6571–6578. DOI: 10.1021/acs.analchem.0c00233
(3) Fedorova, E. S.; Matyushin, D. D.; Plyushchenko, I. V.; Stavrianidi, A. N.; Buryak, A. K. Deep Learning for Retention Time Prediction in Reversed-phase Liquid Chromatography. J Chromatogr. A 2022, 1664, 462792. DOI: 10.1016/j.chroma.2021.462792
(4) Stanstrup, J.; Neumann, S.; Vrhovsek, U. PredRet: Prediction of Retention Time by Direct Mapping Between Multiple Chromatographic Systems. Anal. Chem. 2015, 87 (18), 9421–9428. DOI: 10.1021/acs.analchem.5b02287
(5) Wiczling, P.; Kamedulska, A. Comparison of Chromatographic Stationary Phases Using a Bayesian-Based Multilevel Model. Anal. Chem. 2024, 96 (3), 1310–1319. DOI: 10.1021/acs.analchem.3c04697
(6) Wiczling, P.; Kubik, L.; Kaliszan, R. Maximum A Posteriori Bayesian Estimation of Chromatographic Parameters by Limited Number of Experiments. Anal. Chem. 2015, 87 (14), 7241–7249. DOI: 10.1021/acs.analchem.5b01195
(7) Zhang, Y.; Liu, F.; Li , X. Q.; et al. Generic and Accurate Prediction of Retention Times in Liquid Chromatography by Post-projection Calibration. Commun. Chem. 2024, 7 (1), 54. DOI: 10.1038/s42004-024-01135-0
(8) Domingo-Almenara, X.; Guijas, C.; Billings ,E.; et al. The METLIN Small Molecule Dataset for Machine Learning-based Retention Time Prediction. Nat. Commun. 2019, 10 (1), 5811. DOI: 10.1038/s41467-019-13680-7
(9) HPLC Columns. HPLC Columns Home Page. https://www.hplccolumns.org/
Jessica Lin earned her Ph.D. in analytical chemistry from the University of Michigan, Ann Arbor, USA, in 2013. She began her industry career as a CMC analytical scientist, gaining valuable experience at Amgen and Gilead before joining Genentech in 2016. At Genentech, her career has progressed through diverse roles, including serving as a CMC analytical lead for clinical projects and spearheading analytical methodology and technology development. Most recently, her work is focused at the intersection of science and technology, where she leads initiatives in CMC digitalization and data science.
Zhenqi (Pete) Shi graduated from Duquesne University in 2009 with a degree in pharmaceutical sciences. After a postdoc, he joined Lilly, where he championed real-time sensing and machine learning for process optimization across drug substance and product development. He led the team delivering real-time monitoring and RTRt for Verzenio, Lilly’s first continuous manufacturing process for drug products. In 2021, he joined Genentech as an analytical CMC lead, where he is also responsible for building an early-phase PAT program. So far, he has authored 50+ peer-reviewed articles and has been a key leader in PAT and ML-focused activities across professional symposia and non-profit organizations, such as ETC. In his spare time, Shi enjoys woodworking, camping, and family time.
Articles in this issue
Newsletter
Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.




