The beginning of the age of Fourier transform infrared (FT-IR) spectroscopy meant the availability of digital spectra and
opened the possibility of using computers to compare a single spectrum against a reference database containing thousands of
spectra, thereby allowing enormous efficiency gains in the comparison of unknown spectra to reference materials. Various algorithms
can be used to create a hit quality index (HQI), which is a measure of how well the query spectrum compares against each reference
spectrum. However, HQI does not tell the whole story and specifically does not tell us much about the quality of the match
between query and reference spectra. In a ranked list of database hits, the difference, or gap in the HQI between two successive
hits appears to be a good indicator of the quality of a match. The presence or absence of a significant gap between the first
two or more hits has implications for match quality. While intuitively we might consider the highest ranked hit to be the
"best" match, several similar HQI scores followed by a significant gap can mean a cluster of hits that are similar but not
exact matches. This article looks at the possibility of using an assessment of the gap between hits to determine the quality
of a match, what represents a significant gap, and when this assessment can fail.
Spectral searching is a tool commonly used to identify unknown materials and is occasionally used to help classify or interpret
unknown materials. Algorithms are used to make a comparison between the unknown spectrum and each spectrum in the reference
database. The algorithms return a number called the hit quality index (HQI) and the results are ranked by their HQI. Different
software packages use different numbering systems for their HQI, even when the same algorithm is used. For example, the Euclidean
distance search generates a HQI of 0.0 for the best possible match and the square root of 2 for the worst possible match.
Many companies rescale these numbers to make 100 represent the best possible match and 0 equal to the worst match. For this
article, we will use 100–0 as our scale, with 100 being the best possible match. How the algorithms work is not part of this
article, but the nature of the algorithms means that each reference spectrum matches the unknown spectrum to some degree.
There is no chemical intelligence built into the algorithms; they simply generate an HQI for each comparison and rank the
results by HQI. This means there is always a best match, regardless of whether the unknown material is represented in the
database or not.
Because we always have a number one match, the presence of a number one match is of no value in determining the quality of
a match. To determine the quality of a match, we have relied on either the actual value of the HQI or on a visual comparison
of the sample and reference spectra. There are potential problems with both methods. The actual value of the HQI can be misleading
because there are a number of factors (1) that can significantly reduce the value of the HQI. Common factors include baseline
problems, purge problems, and the presence of other components in the sample. In addition, a high HQI value is not necessarily
an indication of an exact match because there is always the possibility of several compounds in the database that are similar
to the unknown, but not an exact match. Those who compare the sample and reference spectra can also be misled, particularly
if they see the first hit does a good job at matching their unknown and they fail to look at any additional hits. There is
always the possibility that you have a good match, but if there are several similar compounds in the database and they all
match well, then the match is more likely to mean the unknown has been classified rather than having found an exact match.
To evaluate any search results we should always compare several spectra from the top matches to our unknown spectrum.
This article will look at another measure to evaluate the search results, specifically the gap or difference between successive
HQIs.
Experimental
Databases used include ST Japan's Aldrich/ICHEM complete ATR FT-IR library (36,639 compounds) and EPA-NIST Vapor Phase library
(5228 compounds).
Sample spectra were acquired from various sources. The search software used was ACD/Labs Spectrus Optical Workbook and UVIR
Manager.
Algorithms used were the Euclidean distance and first derivative Euclidean distance.
Each IR spectrum was searched by Euclidean distance and by the first derivative Euclidean distance. A total of 18 spectra
were searched resulting in 36 total search results. Of the 18 spectra, eight were pure compounds run in a laboratory, six
were vapor-phase spectra, and four were mixtures. The mixtures were created by digitally adding two spectra together.