Artificial Intelligence for the Chromatography Community

Published on: 

Lately there’s been an awful lot of talking about ChatGPT – the Large Language Model (LLM) generative artificial intelligence engine from OpenAI. Based on 45TB (terabytes) of text-based training data and being the first LLM to surpass 100 billion parameters (learned variables from the training process), ChatGPT-3 (2020) is a powerful and high-profile tool that is changing the way that we interact with digital technology for learning and information retrieval. Or so we are told.

For the purposes of this exercise, I accessed ChatGPT-4 (March 2023) in the period between 25th May 2023 and 3rd June 2023. The training dataset size and the number of parameters generated are not publicly available for this version.

But what have LLMs ever done for chromatographers? Moreover, can a digital refugee like me use the tool effectively to advance my work in separation science?

I’m really a newcomer to this technology, recently having my first foray under the guidance of my son, who “simply couldn’t function without it” as a Physics undergraduate. Apparently, it’s an amazing assistant for writing new computer code! Anyway, after a few lessons in the basics and coming to the realisation that interactions with ChatGPT need to be more conversations than “one and done” questions, I started my investigation. Unlike other articles on the subject, which show the questions, the responses from ChatGPT and a commentary on the suitability of the answers, I’ve focused my research on what the LLM can deliver to the chromatography community and what needs to be done to get there. Of course, if you want to see the answers to each question, you can simply enter the questions/conversations as I’ve documented them. A further qualification is that I deliberately didn’t make this a “narrow and deep”, quantitative-type search. Rather, I wanted to explore various types of interactions that may result in useful information for chromatographers, but this should become obvious as we progress.

My first searches related to troubleshooting issues and the questioning was as follows:

  • What causes peak tailing of late eluting compounds in gas chromatography?
  • Specifically, what causes peak tailing for only late eluting compounds?
  • What if the compound is a polyaromatic hydrocarbon?
  • Is there any manufacturers advice about dealing with peak tailing when analysing acidic compounds?
  • What are the recommended analysis conditions for 2-ethyl hexanoic acid?

The responses were very general, with nothing specific that I could put to action if I had this issue in the laboratory. I had to get more specific with each question to elicit something practical. There was nothing here that I couldn’t have obtained from similar Google searches.

  • How often should I change the liner in GC system?
  • Any specific advice on this subject for food analysis?
  • What liner would you recommend for pesticide screening of vegetable extracts?
  • Can you give me a specific part number?

Whilst these questions were answered with some good general information, once again there was nothing actionable. The search for a specific liner part number yielded only the names of major GC manufacturers and a suggestion to contact them.

  • Can you guide me on how to check my GC inlet for leaks?
  • Will there be gas flowing even if the GC is turned off?
  • I was asking the question with reference to your answer of leak checking.
  • How can the leak detection fluid detect a gas leak if there is no gas flowing?

Here, it was obvious that the LLM was agglomerating information from various elements of its dataset (which is current to September 2021) and “building the response” on the fly. Essentially, it was asking me to use leak detection fluid around the inlet with the GC (and therefore the gas flows) turned off. The final suggestion would not have been usable in terms of a practical process for leak checking a GC inlet; in fact, I would worry about the safety of some of the recommendations.

  • What extractables might I see when a cellulose syringe filter is used to filter 0.1M hydrochloric acid?
  • What extractables might I see when a cellulose syringe filter is used to filter hexane?

I have to say, whilst the answers to these two questions were still general, there were helpful suggestions for the types of extractables that may be seen when each of the different solvents are used. I appreciate that this is a niche question; however, it’s the type of information that is perhaps less accessible from a general web search.

In terms of providing anything useful for troubleshooting problems in the lab, I’m not seeing any benefit to the use of AI. ChatGPT-4 is telling me that it doesn’t have access to specific tables of information or specific literature that may help give more specific responses. Whilst the responses may be good to obtain “directions of further investigation” for the less experienced chromatographer, they tend to be wide-reaching and sometimes contradictory, so the “trust” level that must be established between the user and the technology is not being established in these early exchanges. As a colleague recently described it to me, you tend to put more weight behind someone’s response when you know what they have told you in the past has been correct or useful. I’m not feeling this with ChatGPT at this point. Furthermore, the LLM seems very quick to “fold” when challenged. Even with answers that are more or less accurate, when challenged, it comes back with an apology and an alternative suggestion, which is often less accurate than the previous one!

Let’s see what happens when I search for information which may be of help when planning a separation or analysing data.

  • What is the SMILES string for loratadine?
  • Are you sure this is correct - ChemSpider tells me something different?
  • How can that be correct when the molecular formula is C22H23N2O2Cl?
  • I still believe this is incorrect - can you recalculate, please?

What I got back was pretty much nonsense, and it didn’t get any better with some “coaching”. Let’s try something more straightforward.

  • What is the SMILES string for octadecanoic acid?

Bingo – correct the first time. Thus, further research must be done on the limits of the model’s ability to produce useful SMILES information for more complex molecules.

  • What is this compound? CN(C)C1=CC=C(C=C1)O

For the benefit of readers, this is the SMILES string for 4-Dimethylaminophenol.

  • Are you sure this is correct?
  • Are you sure that the SMILES string corresponds to norepinephrine?

Norepinephrine is (R)-4-(2-amino-1-hydroxyethyl)benzene-1,2-diol (noradrenaline), and really nothing like the compound I was interested in. Once again, I’m losing trust in the model’s ability.

  • What compound has the CAS number 375-22-4?
  • Are you sure that is correct?
  • I thought that CAS number was for heptafluorobutyric acid.

Turns out the model was not that great at looking up CAS numbers either. Eventually, it agreed with me after suggesting dipropylene glycol and pyrroloquinoline quinone disodium salt in the previous two responses. I then wondered, would it fair any better with physicochemical data?

  • What is the pKa of dodecanoic acid?
    • Answer: 4.88

This is close to the literature value that I found (Barratt MD; Toxicol In Vitro 10: 85-94 (1996)).

  • What is the LogP value?
    • Answer: 4.92

This is close to the ACD Labs predicted value that I found via ChemSpider (5.03).

  • What is the LogD value at pH 7.4?
    • Answer: 4.88

ACD Labs predicted values were taken from ChemSpider (2.09) and Chemaxon (2.06).

  • Are you sure about the last answer?

The response indicates that it is not possible to calculate the correct value.

  • Can you not calculate the LogD value at pH 7.4 given that the pKa value is 4.88?
    • Answer: 4.92
  • So, what would the LogD value be at pH 5.5?
    • Answer: 4.92

ACD Labs predicted values were taken from ChemSpider (3.89) and Chemaxon (3.82).

Here, I decided to include the full response from ChatGPT-4:

“To estimate the LogD value at pH 5.5 for dodecanoic acid (lauric acid), we need to consider its acid dissociation constant (pKa) and the ionization behavior at that pH.

The pKa of dodecanoic acid is approximately 4.88. At pH 5.5, which is lower than the pKa, the majority of dodecanoic acid will be in its protonated form (RCOOH) rather than the deprotonated form (RCOO-).

Since LogD represents the distribution coefficient between octanol and water, we need to consider the relative solubilities of the protonated and deprotonated forms in those phases.

As dodecanoic acid is more soluble in the aqueous phase when protonated, the LogD value at pH 5.5 is expected to be lower than the LogP value. While I don't have access to specific data or software to provide an exact LogD value at pH 5.5, it is likely to be lower than the LogP value of 4.92, reflecting the increased solubility of the protonated form in the aqueous phase at lower pH.”

As you can see, the logic is very flawed, and at this point I decided to turn my attention to other search types. I appreciate that all the above information could have been found from a Google search; however, I wanted to see if there was merit in exploring how deep we could go with conversations regarding optimisation of eluent pH. However, we never got out of the shallow end.

Let’s try something more specific on method development for HPLC.

  • Can you suggest some starting conditions to separate dodecanoic acid from its closely eluting homologues in reversed phase HPLC?
  • This answer is very generic, can you suggest something with dodecanoic acid and teteradecanoic acid specifically?

The responses were very general and spectacularly misleading. For example, the following recommendation:

“Adjust the composition of the mobile phase to optimize the separation. For example, you can start with a gradient program ranging from 70% to 90% water and 30% to 10% acetonitrile over a suitable time period (e.g., 20-30 minutes).”

  • Is there any literature you can cite with these compounds being separated by HPLC?

Here, I was pointed in the direction of PubChem and Google Scholar. Nothing trustworthy or useful resulted from this conversation.

Okay, I’m giving this thing one last try. How about a buffer calculation?

  • Calculate the pH of a solution prepared from adding 0.0175 mol of citric acid and 0.0025 mol of dihydrogen citrate to 1.000 L of water.
    • Answer: 1.756
  • Can you repeat the calculation using the Henderson−Hasselbalch equation?

Following some reasonably sensible logic, the final part of the response was:

"Substituting the given values:

pH = 3.128 + log((0.0025 + [H+])/0.0175)

To solve this equation, we need the concentration of H+ ions. However, without additional information about the solution or the pH of the initial solution, it is not possible to provide an exact pH value using the Henderson-Hasselbalch equation alone.”

  • If I said the correct answer is pH 2.63, can you figure out where you went wrong?
    • Answer: 3.268

My head hurts.

I have many more types of searches to try, but in terms of my confidence in my correspondent’s responses, I would have more trust in the responses of P.T. Barnum or Walter Mitty.

I must admit that I’m very much a newbie to so-called “AI”. It’s possible that I just haven’t stumbled upon what LLMs are good for, in terms of helping the aged chromatographer. If you’ve had better experiences with “AI” for analytical sciences, please write and let me know. Perhaps I’m not focussing in on the areas where the model can shine. I dearly wanted this exercise to be a rabbit hole down which I disappeared for a conversation with a learned digital colleague. Either I got hung up in the test questions which establish early trust, or I’m just a digital refugee looking for a digital native to hold my hand and introduce me to the wonders of the new age. One thing is for sure, I am going to tell my son to check his code very carefully – who knows what might happen when executes his programs!

About the Author