A number of Elliott's observations from his report "Managing Scientific Data" (February 2005) are represented here. I highly
recommend the report to those interested in informatics and regret that I can only highlight it in this column. Aside from
drawing its conclusions from extensive data, the report discusses trends toward developing data standards, regulation and
compliance, and integration and security issues. To his credit, Elliott takes the time to couch his commentary in an eminently
readable, plain-English style, and includes definitions and tutorials.
His report describes the movement from paper to electronic laboratory notebooks (ELN) and clinical electronic data capture
(EDC) systems as contributing significantly to the recent 27% proliferation in data. Usefully, the report defines the often
confused terms "data management," "information management," "knowledge management" and "content management." Data management
is perhaps one of the most nonspecific terms in information technology. It describes everything from data analysis systems
to laboratory information management systems (LIMS). Information management refers to the process and systems involved in
acquiring, storing, organizing, searching and retrieving data. It takes data from the "disparate and unorganized" to the "logical
and organized." Information management consists of the processes and systems involved in the use, analysis and exploitation
of data. Knowledge management refers to the process of sharing and distributing information assets throughout an organization.
Finally, content management describes the process of integrating asset management companywide.
 Figure 1: In drug discovery, data is growing faster than it can be turned into information. (Image courtesy of Atrium Research.)
|
The first question in any data scenario must address what we intend to do with the data we collect. Figure 1 depicts a simple
scenario, an initial step towards understanding the process. Unlike e-mail, for instance, which imparts its message and thereafter
serves little further purpose, the value of on-line data increases over time as the biological, pharmaceutical and physicochemical
measurements continue to amass within a data file. But this increase in value comes at the significant cost of ensuring the
data's accessibility. Given the burgeoning nature of the data files, and the length of time over which they must be accessed,
a solution might include some form of hierarchical storage management. Thus, some smaller percentage of the data are immediately
accessible, or "active," while the remainder, in successive stages, are in-process or earmarked for long-term archiving.
Often, how the data are made available becomes a hurdle in itself. The complexity of files associated with a single injection
varies widely: Bruker's NMR (Billerica, Massachusetts, USA), Agilent Technologies' Chemstation (Wilmington, Delaware, USA),
and Waters' MassLynx have numerous associated files, one per injection. Yet close to the reverse is true with some other applications,
such as Applied Biosystems Analyst (Foster City, California, USA), where a single file can include repeated injections, making
attempts to unify the upstream output difficult if not impossible. Therefore, many users carefully evaluate a device based
upon the relative accessibility of its control features and data output. As I reported in the September 2005 column, the accessibility
consideration was of paramount concern to the high-speed synthesis operation designed by Neurogen (Branford, Connecticut,
USA). After considering all competitive applications, Neurogen decided to adopt MassLynx (Waters Corporation, Milford, Massachusetts,
USA) software, because of its inherent accessibility and high degree of interface compatibility with the company's Web-tracking
and data management system.3
|