Machine Learning and Nontargeted Liquid Chromatography–Mass Spectrometry to Assess Ecotoxicity

Published on: 
LCGC Europe, January 2023, Volume 36, Issue 01
Pages: 29–31

Anneli Kruve and Pilleriin Peets from Stockholm University in Sweden, discuss their latest research in machine learning and nontargeted liquid chromatography–mass spectrometry (LC–MS) to assess ecotoxicity.

Q. Why is there an incentive to predict the ecotoxicity of unidentified chemicals in water?

Anneli Kruve: Most water samples are very complex and contain thousands of chemicals. Identification of all of these chemicals is very complicated and time-consuming, or even impossible. As a result, only 1–20% of the features detected from water samples with liquid chromatography–high-resolution mass spectrometry (LC–HRMS) are actually identified. These identified chemicals explain only a small fraction of the toxicity of these samples. To close the gap in toxicity evaluation, we need to pay attention to the features that are not identified. Predicting toxicity based on tandem mass spectrometry (MS2) spectra helps to pinpoint features with higher toxicity, as well as evaluate the mixture toxicity of the sample. The most toxic feature can then be prioritized for further identification, and could later help to design environmental remediation strategies.

Q. What techniques are currently used to assess ecotoxicity and what are the disadvantages of these techniques?

Pilleriin Peets: One way to learn about a sample’s toxicity is to do in vivo or in vitro toxicity tests on the whole water sample or fractioned sample, also called effect-directed analysis. These approaches are time-consuming, expensive, and do not pinpoint the chemicals contributing to the toxicity. In addition, when using lethality evaluation for organisms such as fish, ethical questions arise.

An alternative possibility is to try to identify the individual features detected in the sample and then find the experimental toxicity information for these chemicals (if available) from databases, such as CompTox (1). If this is not possible, one could also use in silico predictions, read‑across, or quantitative structure-activity relationship (QSAR) methods to predict toxicity (2). The problem is that these methods rely on the correct structural identification of the detected features.

Q. Your team developed a technique combining nontargeted screening using LC–MS and machine learning (3). Why did you think this approach would be successful and what benefits does it offer the analyst?

AK: First, the toxicity predictions have up until now been limited to only a small fraction of features structurally identified in LC–HRMS analysis. Obviously, an approach is also needed to estimate the toxicity for the chemicals left unidentified.

Second, the empirical analytical properties of the chemicals and their toxic effects are all interconnected by the structure of these chemicals. So the idea was to avoid the structure in the middle and go directly from empirical analytical data to the adverse effect predictions. This idea was supported by two key pieces of information: (i) functional groups of chemicals, called structural alerts, are already related to the toxicity of the chemical, and (ii) the fragments and neutral losses in a MS2 spectrum are directly related to the functional groups of the chemicals. Our idea interlinks these two approaches to predict ecotoxicity.

This approach offers the benefit that the ecotoxicity can also be predicted for the unidentified features in LC–HRMS.

Q. What were the main challenges you encountered when developing this method and how did you overcome them?

PP: One of the main challenges in machine learning for toxicity predictions is the lack of high‑quality experimental LC50 values that are comparable with each other. In addition, despite many databases for mass spectrometry data, a relatively small number of MS2 data is publicly available, and an even smaller number has toxicity data available. Although both CompTox (for toxicity) (1) and MassBank (for MS2 data) (4) databases contain tens of thousands of measurements, the overlap of the MS2 data and fish LC50 values is less than 300 unique chemicals—clearly insufficient for training machine learning models. Additionally, we trained our model on structural fingerprints calculated from structure, using a simplified molecular‑input line-entry system (SMILES) (5). To overcome these roadblocks, we combined the LC50 values for three fish—fathead minnow, bluegill, and rainbow trout—with correlated endpoint values. In addition, we trained our model on structural fingerprints calculated from SMILES, instead of MS2 data.
These strategies increased the dataset to 800 individual chemicals and allowed training predictive machine learning models. The chemicals that had MS2 data available were used for validation of the approach and the structural fingerprints were predicted by SIRIUS+CSI:FingerID software (6,7).

Q. What are the limitations of this approach and are you planning to continue to develop this research?

AK: One of the limitations is that MS2Tox only predicts ecotoxicity as fish LC50 values. We are interested in expanding towards other human‑relevant endpoints. In addition, we are exploring other possibilities to extract toxicity-relevant information from the empirical analytical data.


The toxicity alone is insufficient to evaluate the risk possessed by the chemical. We also need information about the exposure—that is, the concentration—to these chemicals. We will soon be launching an extension to MS2Tox that will also allow the concentration of unidentified chemicals to be estimated.

Q. Can you elaborate on the machine learning method you used?

PP: We predicted the LC50 values for fish, water fleas, and algae, which is a continuous parameter. We therefore tested and optimized the parameters for several regression algorithms. Several of the tested algorithms yielded models that performed equally well. The final model that is available in the R-package MS2Tox is based on a gradient‑boosting algorithm. This algorithm is quite similar to the well‑known random forest regression, but the individual regression trees have only a small learning power and are built in a stepwise manner on top of each other.

Q. Could this approach be useful in other areas of environmental screening?

AK: Yes, there is good reason to believe that empirical analytical data contain a lot of information about the biological activity, persistence, and exposure of the detected chemicals. We foresee that in the next 10 years empirical analytical information will be increasingly used in environmental modelling.

In addition, MS2Tox can be used in different environmental applications. In this paper (3), we investigated water samples, but they can also work for soil samples, biota, food, or even human biofluids if human exposomics is of interest.

Q. Is it possible to use nontargeted screening in environmental analysis on a routine basis?

A: Nontargeted analysis is becoming more and more used in routine monitoring. In the Norman Network organized “NTS workshop on analytical techniques and implementation” in Odense, Denmark, in November 2022, the possibility of using nontarget screening for regulatory purposes was extensively discussed, and regulators from several EU countries were involved. In some EU countries, the regulators are already using nontargeted screening on its own for routine monitoring or in parallel to targeted analysis. The analysis part of nontargeted screening is fairly well-established and routinely applicable. What needs more attention is the usage of the data.

Q. Anything else you would like to add?

PP: The current version of MS2Tox is available as an R package for everyone to use and test (8). We are very happy about the feedback and happy to improve the package further.


  1. CompTox Chemicals Dashboard.
  2. Raies, A. B.; Bajic, V. B. In Silico Toxicology: Computational Methods for the Prediction of Chemical Toxicity. Computational Molecular Science 2016, 6 (2), 147–172. DOI: 10.1002/wcms.1240
  3. Peets, P.; Wang, W. -C.; Macleod, M.; et al. MS2Tox Machine Learning Tool for Predicting the Ecotoxicity of Unidentified Chemicals in Water by Nontarget LC-HRMS. Environmental Science and Technology 2022, 56 (22), 15508–15517. DOI:10.1021/acs.est.2c02536
  4. MassBank Database.
  5. Weininger, D. SMILES, A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. Chem. Inf. Comput. Sci. 1988, 28(1), 31–36. DOI:10.1021/ci00057a005
  6. SIRIUS.
  7. Dührkop, K.; Fleischauer, M.; Ludwig, M.; et al. SIRIUS 4: A Rapid Tool for Turning Tandem Mass Spectra into Metabolite Structure Information. Nature Methods 2019, 16, 299–302. DOI:10.1038/s41592-019-0344-8
  8. MS2TOX.

Anneli Kruve graduated in 2011 from the University of Tartu, Estonia, and continued her studies as a postdoc in Technion, Israel. She was a Humboldt fellow at Freie Universität Berlin, Germany (2017–2018). In 2019, she joined Stockholm University, Sweden, and she is in charge of the mass spectrometry laboratory at the university. Her field of study lies in the fundamentals and applications of mass spectrometry.

Pilleriin Peets graduated in 2020 from the University of Tartu and continued as a postdoctoral fellow at Stockholm University. During her studies, she also worked for five years at the Estonian Environmental Research Centre as an analytical chemist. At Stockholm University her research includes developing machine learning-based models for predicting ecotoxicity from mass spectrometric data.