Proteoforms: A New Separation Dilemma

August 1, 2017
Fred E. Regnier, JinHee Kim
Special Issues
Volume 35, Issue 8
Page Number: 510-511

Proteins are the workhorses of cells, obviously requiring a high level of complexity; but how complex? Originally it was thought there would be a close relationship between the ~20,000 protein-coding genes in the human genome and the number of expressed proteins. Wrong! Through a variety of new methods including mass spectrometry (MS) sequencing it is now predicted there could be 250,000 to 1 million proteins in the human proteome (1).  

But what does this have to do with chromatography? Liquid chromatography (LC) has played a pivotal role in discovering, identifying, and quantifying the components in living systems for more than a century. The question being explored here is whether that is likely to continue or if LC will become a historical footnote as the MS community suggests. 

First, what is a proteoform? We know that during protein synthesis a protein-coding gene provides the blueprint for a family of closely related structural isoforms arising from small, regulated variations in their synthesis involving alternative splicing (2) and more than 200 types of post-translational modification (PTM) (3). This process can lead to a proteoform family of 100 members (4), many of which differ in biological function. The human genome gets more “bang” per protein-coding gene in this way. Smith and Kelleher proposed the name “proteoform” for these structural isoforms in 2013 (5).  

An important issue is how these high levels of proteoform complexity were predicted. The idea arose from the identification of splice variant sites and large numbers of PTMs in peptides derived from trypsin digests, often supported by top-down sequence analysis of intact proteins by MS (6). The use of gas-phase ions to identify sites and types of modifications in the primary structure of a protein is of great value, but it must be accompanied by structure, function, and interaction partner (7,8) analysis of proteoforms in vivo. This combined analysis is needed because life occurs in an aqueous world.   

There is the impression that the discovery, isolation, and characterization of proteins is highly evolved. Actually, fewer than 100,000 human proteins have probably been isolated and characterized. If the number of proteoforms predicted is accurate, less than half have been isolated and characterized. Protein isolation is inefficient. A breakthrough in separation technology is needed.

Protein peak capacities are no more than a few hundred in most forms of LC; suggesting peaks from a 1-million-component mixture could potentially bear 1000 proteins. Multidimensional separation methods are an obvious approach, but comprehensive structure analysis of a 200 × 200 fraction set to find proteoforms would be formidable. That has always been a problem. Obviously particle size, theoretical plates, and peak capacity tweaks will not solve this problem either. Moreover, structure selectivity of ion-exchange chromatography, hydrophobic interaction chromatography, reversed-phase LC, and immobilized metal affinity chromatography is poor. Proteins of completely different structure are coeluted. 

Probing deeper, there is hope for this seemingly intractable problem. The fact that proteoforms arise from a single gene means they are cognates with multiple, identical structural features. A stationary phase that could recognize these shared features would make it possible to capture a proteoform family; theoretically reducing 1-million-component mixtures to fewer than 100 components in a single step. This possibility is of enormous significance. Species of no interest would be rejected while selected proteins would likely be structurally related with the exception of a few nonspecifically bound (NSB) proteins. In this scenario, the poor structure-specific selectivity of current LC columns would be an asset, fractionating family members based on other structural features. Moreover, top-down MS would identify structural differences and NSB proteins.   

 

 

The big question is how to obtain such a magical, structure-selective stationary phase. Surprisingly, they already exist; an immobilized polyclonal antibody (pAb) interrogates multiple features (epitopes) of a protein, making it highly probable that features common to all proteoforms in a family would be recognized and selected. Family-specific monoclonal antibodies (mAbs) do the same, but only recognize a single shared epitope.  

Production of a pAb targeting common proteome family epitopes can be achieved by using any member of an existing family as an immunogen. Thousands of pAbs are already available. 

Proteins that have never been isolated present a larger problem. There is no family member to use as an immunogen. The new field of antibody-based proteomics (9–11)  addresses this problem by using protein fragment libraries to obtain immunogens. The rationale is that the DNA sequence of a protein coding gene predicts 6–15 amino acid fragments of a protein family that when synthesized and attached to a large immunogen will sometimes produce antibodies that recognize common epitopes of the family.  

Based on the need for fractionation in determining the structure and function of so many proteins, the future of LC in the life sciences seems bright, but with some enjoinments. Clearly, affinity selector acquisition and use is a major opportunity. The application of a family-selective phase in the first fractionation step would allow rejection of untargeted proteins while directing those of interest into higher-order fractionation steps. Fortunately, engineering and production of the requisite antibodies for implementing this approach to protein analysis is receiving increasing attention (9–11). Finally, new ways must be found to use affinity selectors in protein fractionation that circumvent covalent immobilization. The necessity to covalently bind ~20,000 different affinity selectors to achieve the goals noted above is inconceivable. 
 

References

  1. T. Vacik and I. Raska, Protoplasma254, 1201–1206 (2017).
  2. D. Gawron, E. Ndah, K. Gevaert and P. Van Damme, Mol. Syst. Biol.12, 858 (2016).
  3. G. Qing, Q. Lu, Y. Xiong, L. Zhang, H. Wang, X. Li, X. Liang, and T. Sun, Adv. Mater.29, 1604670 (2017).
  4. E.A. Ponomarenko, E.V. Poverennaya, E.V. Ilgisonis, M.A. Pyatnitskiy, A.T. Kopylov, V.G. Zgoda, A.V. Lisitsa, and A.I. Archakov, Int. J. Anal. Chem.2016, 7436849 (2016).
  5. L.M. Smith and N.L. Kelleher, Nat. Methods10, 187 (2013).
  6. K.R. Durbin, L. Fornelli, R.T. Fellers, P.F. Doubleday, M. Narita, and N.L. Kelleher, J. Proteome Res. 15(3), 976–982 (2016).
  7. E.M. Phizicky and S. Fields, Microbiol Rev.59(1), 94–123 (1995).
  8. M.R. Arkin, Y. Tang, and J.A. Wells, Chem. Biol.21(9), 1102–14 (2014).
  9. M. Uhlen and F. and Ponte´n, Mol. Cell. Proteomics 4, 384–393 (2005).
  10. M. Uhlen, Mol. Cell. Proteomics6, 1455–1456 (2007).
  11. M. Uhlen, E. Bjorling, C. Agaton, C.A. Szigyarto, B. Amini, E. Andersen, A.C. Andersson, P. Angelidou, A. Asplund, and C. Asplund, Mol. Cell. Proteomics4, 1920–1932 (2005).

Fred E. Regnier and JinHee Kim are with Novilytic at the Kurz Purdue Technology Center (KPTC) in West Lafayette, Indiana.