Profiles in Practice Series: The High-Speed State of Information and Data Management

Balogh,Michael;

Profiles in Practice Series: The High-Speed State of Information and Data Management

January 1, 2006

By Michael P. Balogh

Article

LCGC Europe

LCGC EuropeLCGC Europe-01-01-2006

Volume 19

Issue 1

Pages: 29–32

The effects of increased data demand coupled with the torrential data outflow of our instruments can overwhelm even the most IT-savvy.

Informatics and data management was an obvious and intuitive topic to include in the last Conference on Small Molecule Science (www.CoSMoScience.org). Indeed, the favourable response of attendees made it clear that the topic needs to be included, in its various forms, each year.

Todd Neville is a senior technical solutions scientist at IBM Healthcare and Life Sciences (White Plains, New York, USA). He is also a featured contributor to this month's column. Neville participated in one of the conference workshops where, he relates, a colleague's comment provided positive proof of data management's central role in industry. Remarking on changing industry trends, he said that until fairly recently, he would weigh a job candidate's training and experience in chemistry as the most important factor, but that now the foremost consideration is the candidate's IT expertise.

In his opening summary of a report, our other featured contributor, Michael Elliott of Atrium Research (Wilton, Connecticut, USA), characterizes a related dilemma in scientific data management. He writes that, "During the period from 1998 to 2002, electronic records generated by laboratory instrumentation and techniques grew at an annual rate of 27% per year. During this same time period, the number of graduates with science degrees increased only by 3% per year."¹

The demands of data management are fast outstripping our ability to meet them. High-resolution, mass-accurate data can issue from a modern mass spectrometer at a prodigious 1 GB/h. Moreover, we like these data. As discussed in an earlier column,² they are generated not just by life science investigators but, increasingly, by industry for high-volume processes such as characterizing the presence of metabolites and their biotransformations. So enormous data files are here to stay. But the effects of increased data demand coupled with the torrential data outflow of our instruments can overwhelm even the most IT-savvy. After 180 days of operation, five mass spectrometers, each producing 24 GB of data per day, will present you with a need to store, retrieve, sort and otherwise make sense of 21.6 terabytes (TB).

But before we consider the future consequence of data overload, we must first address a more immediate problem. That problem, as Neville tells it, is one of classic labour market economics: worker dislocation. Many of his clients tell him they reflexively resort to CD-based or DVD-based operations when they face high data volumes. In addition to being impractical, such a strategy effectively turns a highly qualified PhD chemist into a librarian, resulting in a lamentable loss of scientific expertise. So as Neville asserts, the battlefronts of the future are what happens after the data are collected; they are the necessities of integration, communication and security. IBM offers consulting services that evaluate its clients' requirements vis-à-vis software applications and the ability of developers to satisfy those requirements. Thus, IBM can define storage and computational solutions tailored to its clients' needs.

Neville spends much time with his clients discussing their needs before he even attempts to offer solutions:

We usually step in after the client has adopted MassLynx or Excalibur to manipulate data. IBM can help define the requirements from the perspective of a scientist seeking a research goal and design solutions based on computing, storing, communicating, managing, archiving and regulatory compliance. Typically, a great deal of my time is spent with the researcher. I define the life cycle of the data generated so that I can create a system to manage it — that is, collect, compute, store and archive it — most efficiently. Each of these elements is a separate chapter in the design of a solution. For example, computing entails 64-bit versus 32-bit chip design, parallel versus SMP applications, memory bandwidths, rate of data generation, user skill sets, file systems, Ethernet media versus Infiniband versus FibreChannel connections. Storage and archival involve considering the data life cycle, hierarchical storage management versus current disk prices, logical and physical migration strategies, tape media and various disk media [such as S-ATA, SCSI and so forth].

A number of Elliott's observations from his report "Managing Scientific Data" (February 2005) are represented here. I highly recommend the report to those interested in informatics and regret that I can only highlight it in this column. Aside from drawing its conclusions from extensive data, the report discusses trends toward developing data standards, regulation and compliance, and integration and security issues. To his credit, Elliott takes the time to couch his commentary in an eminently readable, plain-English style, and includes definitions and tutorials.

His report describes the movement from paper to electronic laboratory notebooks (ELN) and clinical electronic data capture (EDC) systems as contributing significantly to the recent 27% proliferation in data. Usefully, the report defines the often confused terms "data management," "information management," "knowledge management" and "content management." Data management is perhaps one of the most nonspecific terms in information technology. It describes everything from data analysis systems to laboratory information management systems (LIMS). Information management refers to the process and systems involved in acquiring, storing, organizing, searching and retrieving data. It takes data from the "disparate and unorganized" to the "logical and organized." Information management consists of the processes and systems involved in the use, analysis and exploitation of data. Knowledge management refers to the process of sharing and distributing information assets throughout an organization. Finally, content management describes the process of integrating asset management companywide.

The first question in any data scenario must address what we intend to do with the data we collect. Figure 1 depicts a simple scenario, an initial step towards understanding the process. Unlike e-mail, for instance, which imparts its message and thereafter serves little further purpose, the value of on-line data increases over time as the biological, pharmaceutical and physicochemical measurements continue to amass within a data file. But this increase in value comes at the significant cost of ensuring the data's accessibility. Given the burgeoning nature of the data files, and the length of time over which they must be accessed, a solution might include some form of hierarchical storage management. Thus, some smaller percentage of the data are immediately accessible, or "active," while the remainder, in successive stages, are in-process or earmarked for long-term archiving.

Figure 1: In drug discovery, data is growing faster than it can be turned into information. (Image courtesy of Atrium Research.)

Often, how the data are made available becomes a hurdle in itself. The complexity of files associated with a single injection varies widely: Bruker's NMR (Billerica, Massachusetts, USA), Agilent Technologies' Chemstation (Wilmington, Delaware, USA), and Waters' MassLynx have numerous associated files, one per injection. Yet close to the reverse is true with some other applications, such as Applied Biosystems Analyst (Foster City, California, USA), where a single file can include repeated injections, making attempts to unify the upstream output difficult if not impossible. Therefore, many users carefully evaluate a device based upon the relative accessibility of its control features and data output. As I reported in the September 2005 column, the accessibility consideration was of paramount concern to the high-speed synthesis operation designed by Neurogen (Branford, Connecticut, USA). After considering all competitive applications, Neurogen decided to adopt MassLynx (Waters Corporation, Milford, Massachusetts, USA) software, because of its inherent accessibility and high degree of interface compatibility with the company's Web-tracking and data management system.³

When a device is integral to the operation, but only marginally compatible with the data-handling platform, a bit of surgery is indicated. Sierra Analytics (Modesto, California, USA) performs such surgery. David Stranz, Sierra's co-founder and president, has served as the bioinformatics interest group organizer for the American Society for Mass Spectrometry (ASMS, Santa Fe, New Mexico) for the past three years. He recently joined the CoSMoS advisory board. From my discussions with him at the conference last August, it's clear that a number of factors must be examined in any fruitful discussion of data management and informatics in general.

Table 1: This month featured scientists.

The lack of an industry standard for data exchange impels Sierra's hybridization and customization service. Standards can be "de jour," in which case, oversight and change are decreed by a standing professional organization such as the American Society for Testing Materials (ASTM, West Conshohocken, Pennsylvania, USA). Or they can be "de facto," in which case, they have been adopted almost universally (the case with Microsoft Windows). Finally, standards can be "mandated" by regulatory decree. Though the reasons behind standardization can vary, in every instance, standardization implies cooperation between and among competitor companies. Unfortunately, in our industry, such cooperation has proved an elusive goal, and it's unlikely we'll see groundbreaking standardization like that which spawned the Musical Instrument Digital Interface (MIDI) in the 1970s. Nevertheless, some recent attempts to unify the variety of data outputs and make them amenable to common analysis has prompted a body of scientists to develop an open, generic version of Extensible Markup Language (XML) specifically for the various MS outputs. Called mzXML, this effort is intended exclusively for proteomic work.⁴ Nevertheless, security continues to be a leading concern when using XML-based platforms, and the Worldwide Web Consortium (W3C) has undertaken some initiatives in encryption and digital signatures.¹

Regulatory and compliance issues are important, even in areas traditionally outside of regulatory control. As this column reported in February,⁵ recent years have seen a spirited initiative by a composite industry group to encourage the FDA to embrace risk-based practice for validation rather than the layered, prescribed regulation currently in place.⁶ The Atrium Research management report includes an extensive review of regulatory requirements, as does a comprehensive IBM Redbook publication, "Installation Qualification of IBM Systems and Storage for FDA Regulated Companies" (www.ibm.com/redbooks),⁷ which provides various forms and a discussion of requirements from an industry perspective. Because computer validation and data storage applies to all parts of the regulated industry, a current reference work by Robert McDowall should also be of interest,⁸ especially those employing Empower and Millennium software platforms.

Figure 2: Data life cycle management by data classification. (Image courtesy of Atrium Research.)

In recent years, providers of scientific instrumentation and services have focused on informatics, both in proteomics and small-molecule practice. Mergers and acquisitions, which combine the strengths of the companies they involve, are paving the way for major changes. Some of the pioneers have departed the scene or metamorphosed into different entities offering different goods and services, an effect similar to that displayed when many MS manufacturers merged in the 1990s.⁹ The Elliott report compiles a market space overview of the current companies based upon their ability to operate in the scientific data management arena. Predictably, major names such as Waters rank prominently when product performance is viewed in the context of its ability to execute in the scientific market space. But what might surprise you is the positioning of IBM alongside such companies. Yet the explanation is straightforward. Like the major companies in our industry, IBM has answered the challenge of efficiently managing vast and ever increasing amounts of data. To this end, aside from addressing client needs through its consulting service, it has developed DiscoveryLink, a powerful integrated data application.

Thermo Electron Corporation (Waltham, Massachusetts, USA) is another example of a well-established manufacturer made stronger through its mergers and acquisitions. A long-time maker of analytical instruments, Thermo began producing mass spectrometers when it acquired Finnigan. The company further evolved when it acquired Innaphase, which had established relationships with some of Thermo's competitors already. This created what might be described in marketing parlance as a homogeneous landscape. Thus, Thermo has maintained a visible if not a leadership position despite earlier attempts at comprehensive informatics with its now defunct eRecordManager (eRM) data management product. It has done so through relationships and purchases such as Galactic (spectroscopy software development) and the development of Sequest, a leading life science library search engine.

Industry leaders have engaged fully encompassed capabilities. For example, after years of developing its Millennium and Empower applications, Waters adopted the Micromass-developed MassLynx data system. Waters then further enhanced its market position by acquiring two informatics providers: Creon and NuGenesis. In the following slot, Agilent entered into a 2004 partnership with Scientific Software Incorporated (SSI). Agilent relabelled the SSI CyberLAB products under the name Cerity ECM (CECM) and so divided the market landscape between itself and SSI. SSI, as SSI ECMS, is being sold by SSI for general content management, nonscientific use. Finally, EMC enjoys a commanding position in relation to enterprise-level documentation management, primarily in life sciences, for having acquired Documentum in 2003.

The world leader in data management services, IBM occupies a unique market position. Its DiscoveryLink software can integrate with numerous data management systems and can develop middleware "wrappers" to suit individual needs. Unfortunately, the impressive power of DiscoveryLink has gone unnoticed by some, an effect of competition with Oracle.

The consultancy concept addresses the needs of small-molecule scientists and pharmaceutical manufacturers, in addition to those pursuing life science endeavours. To satisfy data demands that increase exponentially, while acknowledging the disparity between current platform capabilities and data management needs, major corporations have invested in complementary technologies. The next few years promise to be one of the more interesting periods in recent analytical science. The pent-up demand for improved informatics will continue to ignite interest and creativity in our industry. The far-reaching changes it brings will be rivalled only, perhaps, by the early 1990s commercial development of atmospheric ionization and LC–MS itself.

"MS in Practice" editor Michael P. Balogh is principal scientist, LC–MS technology development at Waters Corp. (Milford, Massachusetts, USA.); an adjunct professor and visiting scientist at Roger Williams University (Bristol, Rhode Island, USA); and a member of LCGC Europe's Editorial Advisory Board. Direct correspondence about this column to "MS in Practice", LCGC Europe, Advanstar House, Park West, Sealand Road, Chester CH1 4RN, UK or e-mail: dhills@advanstar.com