News|Articles|May 12, 2025

Evaluating the Accuracy of Mass Spectrometry Spectral Databases

Mass spectrometry (MS) can be effective in identifying unknown compounds, though this can be complicated if spectra is outside of known databases. Researchers aimed to test MS databases using electron–ionization (EI)–MS.

University of Maryland and Czech Academy of Sciences researchers tested spectral prediction algorithms and their capability for predicting spectra in mass spectrometry (MS)-based databases. Their findings were published in Analytical Chemistry (1).

Mass spectrometry (MS) is widely known as one of the most effective analytical methods for identifying unknown compounds. Widespread use of MS has been found in various research industries, such as toxicology and education. This ubiquity can be attributed to the high sensitivity and specificity of EI–MS, especially when used alongside gas chromatography (GC). Unknown compounds can be identified with MS by measuring the mass-to-charge (m/z) of ions in a sample.

While MS is useful, it comes with limitations, especially when considering unknown compound identification. Notably, in MS spectral databases, there is reliance on previously established reference data. If a compound’s spectrum does not exist within known databases, compound identification can become difficult. Multiple MS spectra prediction algorithms have been created to address this limitation, though there is still refinement work to be done.

In this study, the scientists evaluated the accuracy of the neural electron-ionization mass spectrometry (NEIMS) spectral prediction algorithm. This algorithm was trained on EI–MS spectra from roughly 300,000 molecules. After training and validation, the algorithm was reported to quickly predict spectra with varying degrees of accuracy. The analyses were focused on monosubstituted α-amino acids given their significance as important targets for astrobiology, synthetic biology, and diverse biomedical applications. The scientists hoped to inform those using generated spectra for detecting unknown biomolecules.

While NEIMS performed well for the molecules and measures it was trained for, accuracy decreased for molecules (amino acids) outside of the training set and measured in other ways. The pattern was consistent across all four accuracy metrics and all three libraries tested, though to varying degrees. Given the small proportion of the National Institute of Standards and Technology (NIST)–MS amino acid spectra database (~0.01%), this finding may be intuitive; that said, the scientists aimed to quantify what degree the problem manifested to.

The data also showed that neither derivatization nor physicochemistry (molecular weight and hydrophobicity or physical chemistry) correlate with accuracy. This may stem from the small fraction of amino acids within the NEIMS training set, though no clear insights could be made into why NEIMS struggles to reliably predict MS spectra for these amino acids. In terms of derivatization, the NEIMS algorithm proved just as accurate for “free” amino acids than their derivatized counterparts.

The scientists found that predicted spectra were inaccurate for amino acids beyond the algorithm’s training data. However, these inaccuracies were not explainable through physicochemical differences, or the derivatization state of the amino acids measured. As such, the scientists highlighted the need to improve both current machine learning-based approaches and further optimization of ab initio spectral prediction algorithms to expand databases for structures beyond what is currently experimentally possible, including theoretical molecules.

Once MS spectral prediction algorithms are validated, there is a critical need for comprehensive libraries of predicted spectra for unknown and theoretical molecules (amino acids). Whenever reliable theoretical databases of predicted mass spectra are formed, they allow for greater expansion of the potential search space for an unknown molecule in a sample. These tools would hold broad uses, with the scientists providing examples of informing NASA mission data and expanding public health surveillance (2,3). Beyond amino acids, predictions that enable extension to other classes of biomolecules would further advance many disciplines.

References

(1) Brown, S. M.; Allgair, E.; Kryštůfek, R. Mapping the Edges of Mass Spectral Prediction: Evaluation of Machine Learning EIMS Prediction for Xeno Amino Acids. Anal. Chem. 2025. DOI: 10.1021/acs.analchem.5c00286

(2) Sarli, B.; Bowman, E.; Cataldo, G.; et al. NASA’s Capture, Containment, and Return System: Bringing Mars Samples to Earth. Acta. Astronaut. 2024, 223, 270–303. DOI: 10.1016/j.actaastro.2024.05.048

(3) Lasch, P.; Stämmler, M.; Schneider, A. A MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI). DOI: 10.5281/zenodo.163517

Join the global community of analytical scientists who trust LCGC for insights on the latest techniques, trends, and expert solutions in chromatography.

Subscribe Now!

Evaluating the Accuracy of Mass Spectrometry Spectral Databases

References

Newsletter

Related Content

High-throughput Headspace Analysis of Volatile Nitrosamines and their Secondary Amine Precursors

Unlocking Discovery Data: Why a Digital Ecosystem Matters for HT-MS

Intact and Subunit LC-MS Characterization of mAbs and msAbs using Advanced Protein A Columns

From DART to Data: Robert B. (Chip) Cody Reflects on Innovation, Impact, and the Future of Analytical Science

Separation Science —The State of The Art in Life Science Analysis: A Virtual Symposium

Trending on LCGC International

The Evolving Landscape of Chromatography: Trends Shaping 2026 and Beyond

From Gila Monster to Global Phenomenon: The Peptide Revolution

Chromatography Industry Trends 2025: AI, Automation, and Workforce Challenges

GC-MS–Informed Aroma Profiling and Sensory Lexicon Development for Cannabis Inflorescence

Ep. 42: Did You Look at the Raw Data?