News|Articles|September 23, 2025

Developing Deep Learning Models to Identify More Peaks in GC–MS Chromatograms

Author(s)Will Wetzel
Listen
0:00 / 0:00

Key Takeaways

  • MASSISTANT predicts molecular structures from GC-EI-MS spectra using SELFIES encoding, overcoming traditional database limitations.
  • SELFIES encoding ensures valid molecular predictions, unlike SMILES, enhancing de novo structure prediction accuracy.
SHOW MORE

LCGC International spoke to the author of the study, John Mommers, who is a scientist of chromatography and mass spectrometry (MS) at Envalior, about the key challenges faced when developing MASSISTANT and how it can be integrated in GC-EI-MS workflows.

A recent study introduced MASSISTANT, a deep learning model designed to predict de novo molecular structures directly from low-resolution gas chromatography with electron ionization mass spectrometry (GC-EI-MS) spectra using self-referencing embedded strings (SELFIES) encoding (1). Traditional EI-MS interpretation relies on database matching or expert-driven analysis, which is time-consuming and often inconclusive for unknown compounds. MASSISTANT addresses this challenge by learning fragmentation patterns to generate chemically valid structures (1).

LCGC International spoke to the author of the study, John Mommers, who is a scientist of chromatography and mass spectrometry (MS) at Envalior, about the key challenges faced when developing MASSISTANT and how it can be integrated in GC-EI-MS workflows.

What key challenges in interpreting EI-MS spectra motivated the development of the deep learning model (MASSISTANT), and how does it differ from traditional database-dependent modeling approaches?

The main reason we started developing a deep learning model to predict molecular structures from electron impact mass spectra was to identify more peaks in our GC–MS chromatograms. For peak identification, we mainly rely on large databases containing over one million spectra of known molecules. However, many peaks may still remain unknown because they show no match in the database. Especially for regulatory cases, such as food contact materials, peak identification is of critical importance. Manual interpretation of spectra is often difficult, time-consuming, requires expert (GC–MS and product) knowledge, and does not always result in a positive identification. Our model now assists in the interpretation of unknown peaks, speeding up the identification process and enabling the identification of more peaks.

Why did you choose SELFIES encoding as the molecular representation for this model, and how does it improve structure prediction compared to other encodings like simplified molecular input line entry system (SMILES)?

We chose SELFIES because it guarantees syntactically valid molecules, something that SMILES cannot ensure in generative modeling. When a model outputs SMILES, even small prediction errors can produce invalid or chemically impossible structures, so the validity of de novo predictions must be checked afterwards. By contrast, SELFIES are 100% robust: every SELFIES sequence corresponds to a valid chemical structure. This makes SELFIES well-suited for de novo structure prediction directly from spectra.

Your results showed a large performance increase when using a chemically homogeneous subset of the National Institute of Standards and Technology (NIST) data set. Can you explain why data set curation had such a strong impact on model prediction accuracy?

Fragmentation patterns in EI-MS depend strongly on molecular class and composition. Training on the full data set exposes the model to a very diverse chemical space with many competing fragmentation pathways, making generalization difficult. Restricting the training data to a chemically homogeneous subset yields a more consistent fragmentation behavior and less structural variability, allowing the model to learn relationships between spectra and structures and thereby improving accuracy.

We are now working on a hierarchical approach: first classifying an unknown spectrum into the appropriate chemical subset, then applying a model trained specifically on that subset.

While MASSISTANT achieved up to a 54% exact match rate, about half of predictions were still partial or approximate. What do you see as the main limitations of the current model, and how might future iterations address them?

In my opinion, the main limitations of the current model originate from both the nature of EI-MS data and the availability of high-quality training data. EI-MS is a hard ionization technique that causes extensive fragmentation (including the loss of neutral fragments, so loss of information) and often the absence of the molecular ion peak, which means that some structural information is missing from the spectrum. This makes full reconstruction difficult, or even impossible, especially for larger or more complex molecules. In addition, the availability of high-quality EI-MS data is limited. One million spectra may seem like a lot but compared to the entire chemical space of interest it is only a tiny fraction, which reduces the model’s ability to generalize.

However, we could also make use of synthetically generated spectra to expand the training set and improve prediction accuracy further by integrating complementary techniques such as retention index (chromatographic data) and infrared (IR) spectroscopy. By combining information from multiple analytical sources, future models could achieve more reliable structure prediction.

How do you envision MASSISTANT being integrated into GC-EI-MS workflows in research labs, and what types of scientific fields or industries would benefit the most from this modeling tool?

For now, we use MASSISTANT as a complementary tool within our existing GC–MS workflow. The spectra of unknown peaks that fail to match the database are directly imported (as a list of unknown spectra in a JCAMP [Joint Committee on Atomic and Molecular Physical Data] format) and analyzed by MASSISTANT. The model then provides a candidate structure for each unknown, offering clues about its overall structure and key substructures, and serving as a starting point for further interpretation. I am convinced that these types of predictive/generative deep learning tools will soon be fully integrated into analytical GC–MS software. Such integration would allow automated structural suggestions thereby reducing interpretation time, increasing throughput and increase identification quality/accuracy.

Sectors that would benefit most include regulatory material testing (for example, food contact materials), forensics and toxicology, and pharmaceutical research/metabolomics. For example, in food-contact material assessments, migration extracts are analyzed and peaks above regulatory thresholds (for example, >10 µg/kg food simulant, according to EU regulations) must be evaluated. If a peak cannot be identified and assessed, it is treated as potentially hazardous, which can block material approval. In such cases, tools like MASSISTANT can play an important role by providing candidate structures or functional groups.

Beyond predicting whole molecular structures, MASSISTANT also generates information about substructures and functional groups. How valuable is this capability for real-world compound identification?

As stated before, full reconstruction of a molecule based only on its EI low-res mass spectrum is difficult, or even impossible, especially for larger or more complex molecules. Even if the prediction of the exact molecule is incorrect, the model is often capable of predicting key substructures, including functional groups and atoms. These predictions can already provide important clues about the real molecule and thereby support the interpretation process. For example, recognizing an aromatic ring, a carbonyl group, or a primary amine can immediately narrow down the list of possible candidates. Moreover, in some cases, it is not strictly necessary to predict the exact molecule, knowing which substructures or functional groups are present can already provide valuable insights.

Do you see potential in expanding MASSISTANT beyond molecules under 600 Da, or coupling it with other spectrometric or chromatographic techniques to broaden its applicability?

We intentionally restricted the data set to molecules with a molecular weight of ≤600 Da. This range reflects the practical operating limits of standard GC-EI-MS. In addition, larger molecules are both more complex and less represented in the data set, which would lead to lower prediction accuracy.

Coupling multiple analytical techniques for structure prediction of unknown peaks in GC-EI-MS chromatograms, indeed has great potential. A low-hanging fruit is the incorporation of available retention index data from gas chromatography, as it provides orthogonal information to mass spectra and is automatically obtained when performing GC–MS (so no extra analysis required). Retention indices could be used as an additional feature during model training to help the model learn relationships between spectra, structures, and chromatographic behavior.

Reference

  1. Mommers, J.; Barta, L.; Pietrasik, M.; Wilbik, A. MASSISTANT: A Deep Learning Model for De Novo Molecular Structure Prediction from EI‑MS Spectra via SELFIES Encoding. J. Chromatogr. A. 2025, 1759, 466216. DOI: 10.1016/j.chroma.2025.466216

Newsletter

Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.