In November 2022, OpenAI’s launch of ChatGPT transformed Artificial Intelligence (AI),1 and more specifically, generative AI, from science fiction to an everyday reality. While there is no doubt that generative AI is going to be a disruptive technology across a multitude of industries, AI techniques like machine learning (ML) and deep learning have been utilized within the scientific laboratory setting for several years, particularly in drug discovery and development. In fact, there have been several highly publicized collaborations between pharmaceutical companies and AI technology providers that have started to yield results.2,3 For example, In Silico Medicine entered clinical trials with the first new therapeutic where AI was used to identify the target and generate the design.4 Other recent partnerships, like the new collaboration between Imperial College, BASF, and Sterling Pharma, are deploying AI to improve continuous manufacturing processes.5,6 While scientific organizations are recognizing and capitalizing on the potential of AI in many ways, there are still many challenges in the present day analytical laboratory that AI can be leveraged to overcome. AI has the potential to change the way that we do science inside and outside of the laboratory but, before that ambition can be realized, organizations need to unlock their data to turn AI from science fiction into science.
Identifying the problem
The first step in leveraging AI is identifying the problem that needs to be solved and what data is available to inform the solution. AI is particularly well suited for data-rich processes where understanding the data that has been generated can inform and improve that process over time.For example, instrument telemetry data can be used to inform AI models that warn and ultimately even prevent common run failures. Analytical instruments produce vast amounts of telemetry data, instrument readouts that do not contain any intellectual property (IP) about what is being run, but describe the instrument conditions before, during, and after runs. Of course, the reason for instrumental analysis is the scientific data, and AI can help here too. AI algorithmic techniques like anomaly detection can help identify common analytical challenges in chromatographic data by monitoring for variances, such as spikes, baseline noise and retention time drift, from the typical patterns’ baselines and peak detection. A human in the loop approach would allow chromatographers to confirm or reject the algorithm’s suggestions that there is a trace impurity, gas in the line, or it is time for a new column. Over time, the algorithm learns how the lab-specific workflow, method, and compounds impact the various chromatogram properties, becoming more accurate and improving its recommendations.
Addressing common chromatographic challenges is one potential application AI-driven improvements in the laboratory, but there are numerous other opportunities. There are many tedious and human error-prone steps to follow in an analytical workflow that are better suited for AI-powered automation to perform, such as data review and instrument maintenance. Additionally, complex and highly manual tasks such as setting up runs for method development and data analysis that are currently done in third party applications such as Excel can benefit from AI solutions. Identifying a problem that AI could solve is likely the easiest step in the process. The harder challenge is getting the data.
Building the data set
The challenge is that to build AI, ML and advanced analytics solutions, data scientists need large data sets to train their models.Given the focus on generation of reproducible, high-quality data in science, analytical applications seem to be an obvious opportunity for AI. Unfortunately, the reality is that the scientific data needed to train AI models is often siloed in disparate systems. In recognition of these challenges and the growing value of data science, in 2016, the FAIR Guiding Principles for scientific data management and stewardship were published.They focused on the “machine-actionability” of data in that it needs to be findable, accessible, interoperable, and reusable.7 Open standards like the Allotrope Ontologies and Data Model seek to break down these siloes by creating “linked data that standardizes experimental parameters so we can remove human error and enhance scientific reproducibility.”8 While Allotrope and other open standards like AnIML9 and MZML10 help to translate scientific data into a common ontology, the reality is that this solution does not address all the challenges organizations face.
Open standards, like MZML, are useful for further data processing enabled through third-party industry solutions, but as organizations appetite for AI-driven solutions grows, so will their need for complete, contextualized data sets. Conversion into a common ontology does not by itself enable organizations to aggregate and correlate processed results with the associated instrument telemetry data since it is often not saved with the result files. Given the variety of analytical techniques and other data sources like Laboratory Information Management Systems (LIMS) and Electronic Laboratory Notebooks (ELNs), it is difficult for organizations to make meaningful associations across all the different data types in their data lake through open standards alone.
Automating data pipelines
Rather than doing manual data conversion and upload, organizations need automated data pipelines that help them to bring data from laboratory systems into their data lakes. With real-time data pipelines that upload, catalog, transform and store data in compliance with FAIR data principles, organizations can utilize different data science methodologies and tools. Dedicated pipelines also have the added advantage that telemetry data, scientific raw data and processed results can all be uploaded and associated, giving a complete picture of the data as it was acquired. Robust data pipelines ensure the data is fully contextualized and avoid what is often referred to as a data swamp - the less-than-ideal state where massive repositories of data exist but are unusable.11
As Kate Wearden recently described in The Digital Revolution: The Connected Lab of the Future, connectivity is critical to realizing the Lab of the Future. Today, multiple mechanisms exist to export data from Waters data systems to open standards or to map and parse data into custom data pipelines. While these solutions meet the needs of many customers today, they are often disconnected, manual processes. Our future vision is to eliminate the burden organizations face by maintaining these solutions and to provide contextualized data that is AI-ready through automated data pipelines.
We are only beginning to scratch the surface of what AI could do to improve laboratory operations and aid scientific discovery. Given the strategic importance of chromatography, there is significant potential to improve laboratory operations and derive insights by incorporating chromatography telemetry and analytical data into an organization’s digital strategy. To realize this potential, organizations need to focus on how to FAIRify their data making it available for scientists and data scientists alike. Robust, automated data pipelines will bring data science to scientists so that they can use their data to solve problems that matter.