News|Articles|October 30, 2025

Predicting Retention Times and Peak Widths for Oligonucleotide Separations

Listen
0:00 / 0:00

Key Takeaways

  • The workflow integrates automation, scalability, and predictive modeling to address the complexity of oligonucleotide analysis, enhancing method development and impurity profiling.
  • It provides data-driven predictions of retention time, peak width, and resolution, reducing trial-and-error and supporting efficient method development.
SHOW MORE

Torgny Fornstedt, Jörgen Samuelsson, and Martin Enmark discuss a novel machine learning workflow for oligonucleotide analysis that is helping to enhance method development and impurity profiling efficiency.

A novel machine-learning (ML)-based workflow that integrates automation, scalability, and predictive modeling to meet the growing complexity of oligonucleotide analysis has been developed. By providing data-driven predictions of retention time, peak width, and resolution, this new approach offers chromatographers the potential to accelerate method development, improve impurity profiling, and readily adapt to new chemistries and separation conditions.

What was the main motivation behind developing a machine learning-based workflow for oligonucleotide chromatography (1)?

Torgny Fornstedt:The increasing analytical complexity of therapeutic oligonucleotides—particularly with diverse modifications such as phosphorothioates—demanded a new strategy for handling the vast data sets required to characterize these compounds. Conventional, manual approaches were too limited and error-prone for this task. To address this, we developed a machine learning (ML)-based workflow that can systematically process thousands of chromatograms, automatically extracting high-quality retention and peak-width data. This enables prediction, not only for retention times but also for peak widths, and thereby peak resolutions, even for unseen sequences. The overarching goal was to create a scalable and robust framework that supports efficient, reliable method development and better reflects the wide chemical diversity of therapeutic oligonucleotides.

What is innovative about this approach compared to traditional methods used to predict oligonucleotide retention times and peak widths?

Jörgen Samuelsson: The innovation lies in integrating rigorous data quality assurance with advanced machine learning into a single, semi-automatic workflow. Unlike traditional approaches that rely on manual peak assignment and limited data sets, this method systematically curates large-scale chromatographic data through automated quality checks and rule-based preprocessing. This minimizes false positives, ensures consistent retention time and peak width determination, and delivers high-quality training sets for model development. Coupling these curated data sets with benchmarking of multiple ML algorithms improves prediction accuracy—highlighting gradient boosting as especially effective—while also extending predictions to peak width and resolution, which have not previously been addressed in oligonucleotide analysis. Moreover, the workflow uncovers sequence-dependent trends, such as the influence of composition and terminal nucleobases, that are difficult to access with conventional methods, while providing a scalable framework for efficient method development.

In what practical ways could this workflow help chromatographers in method development or routine analysis of oligonucleotides?

Martin Enmark: The workflow reduces trial-and-error in the laboratory by predicting retention and resolution in silico, allowing chromatographers to assess in advance whether gradients will separate key impurities. Rule-based preprocessing ensures consistent data handling and more reliable peak identification. The approach also scales seamlessly with quality control (QC) pipelines handling thousands of chromatograms, where the massive amount of data further strengthens the predictive models. In the end, this saves time, reduces workload, and supports more robust method development and routine analysis.

How does this research address common challenges faced when analyzing modified oligonucleotides, such as phosphorothioated variants, in complex samples?

TF: To effectively model oligonucleotide separation behavior, we need data that capture the relevant modifications as well as descriptors in the machine learning model that represent these modifications. In this study, we introduced descriptors to capture partially phosphorothioated sequences as well as phosphodiester (PO) contaminations. If other modifications are desired, we need data that capture these modifications and corresponding adjustments to the machine learning model.

Phosphorothioate (PS)-rich oligonucleotides generate many impurities and broad, heterogeneous peaks. Our workflow predicts in advance where coelution is likely, allowing chromatographers to adjust conditions proactively. By including PS-sequences in the training sets, we ensure that their distinct behavior is captured. The workflow also highlights limitations under shallow gradients, where complementary methods such as mass spectrometry (MS) detection can be helpful.

How adaptable is the proposed workflow to other ion-pair reagents or stationary phases beyond tributylamine and C18 columns?

JS: Because the workflow is data-driven, it can be retrained for other ion-pair reagents or stationary phases as soon as suitable data sets are available, and earlier studies have already demonstrated this flexibility. While some rule-based components may need adjustment, the overall framework is broadly transferable. Moreover, machine learning can reveal new retention patterns in systems less dominated by size and charge effects. A limitation is that weaker ion-pair systems may struggle with partial separation of diastereomers, particularly in highly phosphorothioated oligonucleotides.

What level of retention time reproducibility and signal-to-noise ratio (S/N) is required for the rule-based data acquisition method to be effective?

ME: For the rule-based method to be effective, retention times should be reproducible within just a few seconds of the total run time, and the signal-to-noise ratio (S/N) must be sufficient for reliable peak detection and width determination. Higher S/N directly improves model robustness. For low abundant peaks such as phosphodiester variants (P=O) impurities, MS in selective ion monitoring (SIM) mode can boost S/N, though at the expense of broader impurity coverage. Under these conditions, the workflow delivers consistent and scalable results.

How does the machine learning model handle coeluting oligonucleotides, and how accurate is the peak deconvolution using multiple Gaussian probability density functions (PDFs) in complex mixtures?

TF: Coelution is addressed through systematic preprocessing and peak deconvolution. Elution profiles are fitted to PDFs, and overlapping peaks are resolved using multiple Gaussian models, with the number of components determined by an F-test. Applied to data sets of nearly 900 sequences per gradient, this approach yielded reproducible retention times and peak widths even in complex mixtures, providing a solid basis for reliable resolution predictions.

Given the lower prediction accuracy for P=O sequences under shallow gradients, how would you recommend optimizing gradient conditions to support model accuracy for such oligonucleotide types?

JS: P=O impurities often follow complex retention patterns, making them harder to predict. Better data quality is key. Collecting additional data and training dedicated models for P=O sequences improves performance, while MS SIM mode can boost S/N for weak signals. Together these strategies enhance reproducibility and strengthen predictions under challenging conditions such as shallow gradients.

Can the workflow integrate mass spectrometry data, for example, in SIM mode for improved peak detection and resolution modeling in real-time or near-real-time applications?

ME: Yes. MS data, especially in SIM mode, can greatly enhance sensitivity and selectivity. Their integration improves peak detection and width determination, strengthens resolution predictions, and enables near-real-time monitoring in QC or process development. Since the workflow is semi-automatic and scalable, incorporating MS data is straightforward.

How scalable is the semi-automatic data extraction and model training process when dealing with tens of thousands of chromatograms, such as in QC or development pipelines?

TF: The workflow is built for scalability. Rule-based preprocessing and efficient ML algorithms such as gradient boosting enable handling of tens of thousands of chromatograms—well beyond the ~900 sequences per gradient already demonstrated. A future goal is to extend the system to account for column performance variations and even column switching to further enhancing robustness.

What specific sequence features or modifications (phosphorothioation, length, guanine–cytosine [GC] content) have the highest impact on retention time predictions, and are these interpretable from the model?

ME: Key factors include phosphorothioation vs. P=O modifications, overall sequence length, and GC content, with higher GC often linked to broader peaks. Terminal motifs, such as cytosine at the ends, can also strongly influence retention. These features are captured and interpretable in the models, providing chromatographers with valuable insight into why specific sequences behave as they do.

For impurity profiling, how reliable are the model-generated predictions of peak widths and resolutions under user-defined gradients, especially for non-standard or proprietary oligonucleotide formats?

JS: Retention times are predicted with high accuracy, while peak width and resolution remain more challenging and are still under development. For non-standard or proprietary formats, accuracy depends on similarity to the training data; retraining improves performance, but even without it, the models provide useful guidance for ranking conditions and optimizing methods.

Reference

(1) Samuelsson, J.; Enmark, M.; Szabados, G.; et al. Improved Workflow for Constructing Machine Learning Models: Predicting Retention Times and Peak Widths in Oligonucleotide Separation. J. Chrom A 2025, 1747, 465746. DOI: 10.1016/j.chroma.2025.465746

Torgny Fornstedt is a professor of analytical chemistry at Karlstad University, Sweden, and head of the Fundamental Separation Science Group. His research focuses on the fundamentals of liquid chromatography, mechanistic adsorption studies, and predictive modeling, with particular emphasis on biopharmaceuticals and therapeutic oligonucleotides. He has authored almost 200 peer-reviewed publications and book chapters, serves on several editorial boards, and is an invited speaker at international symposia. His work bridges fundamental science with industrial applications.

Jörgen Samuelsson is an associate professor of analytical chemistry at Karlstad University. His research interests include chromatographic theory, adsorption modeling, and the integration of machine learning into separation science. He has published widely on ion-pair chromatography of oligonucleotides and peptides, with a particular focus on retention modeling, resolution prediction, and mechanistic method development. He has co-developed new computational workflows for predicting oligonucleotide separations and collaborates with both academia and industry on advanced biopharmaceutical analysis.

Martin Enmark is a researcher in analytical chemistry at Karlstad University. He has a particular expertise and interest in industry–academia collaborations. His recent research focuses on data-driven and computational approaches to chromatography, including machine learning models for oligonucleotide separations. He has contributed to the development of novel semi-automated workflows for processing large-scale chromatographic data sets, enabling predictive method development and impurity profiling. Enmark has co-authored several influential publications advancing both the theory and practice of liquid chromatography, and continues to explore innovative approaches to preparative and analytical separations.

Fundamental Separation Science Group: www.FSSG.se

Newsletter

Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.