Turning Metabolomics Data Processing from a “Black Box” to a “White Box”

Turning Metabolomics Data Processing from a “Black Box” to a “White Box”

LCGC Supplements, Hot Topics in Mass Spectrometry, Volume 40, Issue s9
Pages: 20–22

Extracting thousands of metabolic features from liquid chromatography–mass spectrometry (LC–MS)–based metabolomics data is not easy. Although many feature extraction algorithms have been developed over the past few decades, automated feature extraction is still not a “white box” process. For instance, it is challenging to quickly determine the optimal parameters for the best feature extraction outcome. It is also impossible to extract every true metabolic feature. Moreover, there is contamination from false metabolic features of different sources, such as signal noise and in-source fragmentation. Our laboratory has recently developed a suite of bioinformatics tools to address these metabolic peak-picking challenges. The goal is to improve the peak-picking outcome quality, so we can effectively obtain biological information from the metabolomics data.

Liquid chromatography–mass spectrometry (LC–MS) is a prominent analytical platform that has been widely used in untargeted metabolomics. Owing to its high sensitivity and throughput, LC–MS-based metabolomics generates a large amount of data, containing the identities of and quantitative information for thousands of metabolic features from a single biological sample.

The first step of processing the megabytes or gigabytes of metabolomics data is to automatically extract all the metabolic features. The past decade has witnessed the development of many bioinformatics programs for automated metabolic feature extraction. Some of the most commonly used academic software include XCMS, MS-DIAL, and MZmine 2 (1–3). These data processing programs have been successfully applied to thousands of metabolomics studies, revealing the metabolic signatures associated with aging, cancer, and host-microbiome interactions, among others (4). However, none of the existing metabolomics peak-picking algorithms can completely fulfill its task, the main reason being the complexity and large scale of LC–MS-based metabolomics data. In particular, given the diverse chemical structures and broad concentration ranges (sub-femtomolar to sub-millimolar) (5), metabolic features present various extracted ion chromatogram (EIC) peak shapes. As a result, it is challenging to find the best peak-picking parameters that can recognize all of the true metabolic features. Therefore, we often find that some high-confidence true metabolic features are missed by the automated peak-picking process. The same challenge also leads to various false metabolic features that are picked up and included in the feature list. These false features can be contributed by signal noise, system contamination, and in-source fragmentation. The combination of an incomplete list of true features with the existence of many false features diminishes the peak-picking quality, and makes the downstream data interpretation problematic. Unfortunately, because of the large size of metabolomics data and the many features that can be picked up, it is very difficult to manually check the results to fix the problems. All these issues show that current metabolomics peak picking is still a “black box” process.

Figure 1 illustrates the workflow of metabolomics peak picking and summarizes the abovementioned issues generated during automated peak picking via conventional data processing software. Our recent work from the past two years aims to overcome the hurdles in metabolomics data processing so that we can obtain high-quality metabolomics data for downstream metabolite annotation and biological interpretation (Figure 2). Notably, all the developed programs are freely available on GitHub (https://github.com/HuanLab), along with revised and regularly updated content.

Measuring Optimal Data-Processing Parameters: Paramounter

The first challenge in processing metabolomics data is choosing an appropriate set of peak-picking parameters. Essentially, the parameters in all data processing programs are a set of thresholds defining what types of chromatographic peaks should be recognized as metabolic features. The parameters heavily rely on the LC–MS analytical conditions and can dramatically influence the number and quality of the extracted features. Given the different chromatographic peak shapes of metabolic features, it is difficult to directly decide the best parameters unless users manually check all the features. Conventionally, researchers use the design of experiments (DOE) strategy to test many different parameter combinations until the best performance is reached (6). However, the DOE approach is time-consuming, and does not provide a mechanistic explanation of the chosen parameters. To address the challenge, we tried to view this question from a different angle. Because the parameters are a set of definitions based on the features’ chromatographic attributes, we believe that it is possible to directly measure these parameters if we can plot the chromatographic attribute distributions from the raw LC–MS data. The distribution plots allow us to turn optimization from a time-consuming DOE trial into a direct measurement. To achieve this, we studied the peak-picking algorithms of five commonly used open-access data processing programs, including XCMS, MS-DIAL, MZmine 2, El-MAVEN, and OpenMS (1–3,7,8). We summarized and discovered four universal parameters, namely mass tolerance, peak height, peak width, and instrumental shift. We then developed an R script, Paramounter, to automatically measure these universal parameters directly from the raw data (9). More importantly, these universal parameters can be automatically translated to the specific parameters of the data processing software. The Paramounter developed in this work is a much more advanced approach than using time-consuming DOE trials.

Rescuing True Features Missed by Peak Picking Algorithms: JPA

Although peak-picking algorithms can automatically extract thousands of metabolic features from raw LC–MS data, many true metabolic features are missed, which is because of the diversity in mass accuracy, peak shapes, and abundances. At the current stage, no existing feature extraction algorithms can pick up 100% of the features. It is important to recognize the high cost of missing true features in metabolomics. If a true feature relevant to the given biological question is missed, it will be very hard to retrieve later, and the downstream biological interpretation will be affected. Through manually investigating missed real metabolic features, we found that many have high-quality tandem mass spectrometry (MS/MS) spectra collected during the LC–MS analysis using data-dependent acquisition. To rescue these true positive features, we proposed an integrated feature extraction method for features with good and bad chromatographic peak shapes (10). Our results show that a significant number of high-quality metabolic features can be extracted using our integrated approach. To streamline automatic peak picking, precise quantification, and metabolite annotation, we further developed joint metabolic feature extraction (JPA) and automated metabolite annotation. JPA integrates three peak picking strategies for the comprehensive extraction of true positive features (11). The first incorporated algorithm is centWave, a conventional method that extracts features with Gaussian peaks. The second algorithm rescues the metabolic features that are missed by the first algorithm but have associated MS/MS spectra. The third strategy is a targeted approach to extract metabolites with known m/z values and retention times. This module can further rescue the metabolic features that are important to the given biological question but not found by conventional peak picking. JPA can drastically reduce the number of missed true features and facilitate much higher metabolite coverage with twofold more metabolic features.

Filtering Out Background Noise: EVA

An important step in processing metabolomics data is to manually check the extracted metabolic features to ensure that they are not false. A significant portion of false metabolic features are noise signals. These false features are extracted as their peak shapes accidentally pass the peak-picking criteria. Although these features can be eas- ily recognized by a trained expert, manually checking over thousands of metabolic features in a raw metabolomics data file takes days. It is even less realistic if the study is made up of tens or hundreds of samples. Because the pattern of false features can be easily recognized by humans, our laboratory sought to replace manual labor with automated recognition through deep learning. In particular, we leveraged a state-of-the-art convolutional neural network (CNN) and trained a CNN model with over 25,000 manually recognized EIC peaks from data of various sample types, LC–MS configurations, and spectra acquisition rates (12). The trained model was then developed into a Windows application, termed EVA (short for evaluation). The diversity of the training EIC plots ensures that EVA has over 90% accuracy in true and false positive feature recognition. Furthermore, EVA has a user-friendly interface, easily accessed by users with minimum programming expertise, to filter out false positive features with a few clicks.

Recognizing Features from In-Source Fragmentation: ISFrag

Besides background noise, false metabolic features can also be contributed by in-source fragmentation (ISF), where metabolic precursor ions are inevitably fragmented during ionization. These fragment ions are then falsely recognized as true metabolic features. In general, it is difficult to differentiate ISF features from true metabolic features in an LC–MS analysis as both can have good chromatographic peak shapes. Traditionally, ISF features can be identified by common neutral losses (loss of H2O, CO2, and NH3, for example) and MS/MS spectral similarity comparison against the chemical standards of the parent metabolites. However, the diverse neutral loss patterns and the limited MS/MS of metabolite standards make the confirmation of ISF features hard. To address this challenge, we developed a program that can automatically identify ISF features in a de novo manner (13). In brief, we rely on three useful patterns for automated ISF feature recognition: (a) ISF features are coeluted with their precursor ions; (b) the m/z of an ISF feature exists in the MS/MS spectra of their precursor ion; and (c) ISF features and their parent features have similar fragmentation patterns. Following these rules, we developed ISFrag, an R package that can precisely recognize ISF features. Tested using a standard mixture of 125 metabolites, ISFrag achieves 100% accuracy in identifying the ISF features.


The development of bioinformatics tools is critical to the further advancement of metabolomics in the post-genomic era of biology. While the developments in instrumentation and analytical methods are critical to high-quality metabolomics data generation, it is equally important to develop bioinformatics tools to enable efficient extraction of biologically relevant metabolic information for downstream data interpretation. As demonstrated in this short article, advanced bioinformatics programs can significantly improve the quality of metabolomics data, thus making its biological applications more convenient and confident. We hope that this article can also help researchers become more aware of the bioinformatics challenges in metabolomics, and encourage the metabolomics community to further develop bioinformatics solutions to address them.


(1) C.A. Smith, E.J. Want, G. O’Maille, R. Abagyan, and G. Siuzdak, Anal. Chem. 78, 779–787 (2006). DOI: 10.1021/ac051437y.

(2) T. Pluskal, S. Castillo, A. Villar-Briones, and M. Orešič, BMC Bioinform. 11, 1–11 (2010). DOI: 10.1186/1471-2105-11-395.

(3) H. Tsugawa et al., Nat. Methods 12, 523–526 (2015). DOI: 10.1038/nmeth.3393.

(4) C.H. Johnson, J. Ivanisevic, and G. Siuzdak, Nat. Rev. Mol. Cell Bio. 17, 451–459 (2016). DOI: 10.1038/nrm.2016.25.

(5) D.S. Wishart et al., Nucleic Acids Res. 35, D521–D526 (2007). DOI: 10.1093/nar/gkl923.

(6) M. Eliasson et al., Anal. Chem. 84, 6869–6876 (2012). DOI: 10.1021/ac301482k.

(7) H.L. Röst et al., Nat. Methods 13, 741–748 (2016). DOI: 10.1038/nmeth.3959.

(8) E. Melamud, L. Vastag, and J.D. Rabinowitz, Anal. Chem. 82(23), 9818–9826 (2010). DOI: 10.1021/ac1021166.

(9) J. Guo, S. Shen, and T. Huan, Anal. Chem. 94, 4260–4268 (2022). DOI: 10.1021/acs.analchem.1c04758.

(10) Y. Hu, B. Cai, and T. Huan, Anal. Chem. 91, 14433-14441 (2019). DOI: 10.1021/acs.analchem.9b02980.

(11) J. Guo et al., Metabolites 12, 212 (2022). DOI: 10.3390/metabo12030212.

(12) J. Guo et al., Anal. Chem. 93, 12181–12186 (2021). DOI: 10.1021/acs.analchem.1c01309.

(13) J. Guo, S. Shen, S. Xing, H. Yu, and T. Huan, Anal. Chem. 93, 10243– 10250 (2021). DOI: 10.1021/acs.analchem.1c01644

Jian Guo and Tao Huan are with the Department of Chemistry at the University of British Columbia, in Vancouver, British Columbia, Canada. Direct correspondence to: thuan@chem.ubc.ca