OR WAIT 15 SECS
In this column, we introduce the basics of today’s approaches for doing intact protein dissociation with mass spectrometry (MS), or top-down sequencing (that is, rather than the more conventional peptide-based “bottom-up” sequencing where future improvements might occur, advantages and limitations of using top-down sequencing, possible applications, and why it has become such an important and pursued research area for many.
In this column, we introduce the basics of today’s approaches for doing intact protein dissociation with mass spectrometry (MS), or top-down sequencing (that is, rather than the more conventional peptide-based “bottom-up” sequencing) where future improvements might occur, advantages and limitations of using top-down sequencing, possible applications, and why it has become such an important and pursued research area for many.
Protein sequencing has long been performed using wet chemistry, such as that designated as the Edman degradation method, of long renown (1–3). However, slowly but surely, such approaches have given way to more automated instrumental methods, especially involving mass spectrometry (MS), in any number of manifestations, arrangements, and configurations (4–10). It may be fair to say that MS is among the predominant and most widely used analytical technique of any, today. This is especially true when it comes to protein sequencing, proteomics, protein identification, post-translational modifications (PTMs), and myriad other applications so widely used today. For biopharma, these may include single protein characterization, biopharmaceutical method development, biosimilar comparison to proprietary biopharmaceuticals, characterizing antibody–drug conjugates (ADC), host-cell protein detection, and so forth. In addition, more applications for protein characterization are being pursued by MS and liquid chromatography–MS (LC–MS) because of the advent of newer and improved MS instrumentation for characterizing proteins and by having the ability to use fragmentation methods. Some of those methods include electron transfer dissociation (ETD), electron capture dissociation (ECD), electron transfer in the higher-energy collisional dissociation (HCD) cell (EThcD), ultraviolet photodissociation (UVPD), surface induced dissociation (SID), and others.
In this installment, we introduce, in a broad sense, the basics of today’s approaches for doing intact protein dissociation with MS or top-down sequencing (that is, rather than more conventional peptide-based bottom-up sequencing). We also discuss where future improvements might occur, advantages and limitations of using top-down sequencing, possible applications, and why top-down sequencing has become such an important and pursued research area for many. It is hard to attend any recent meetings on MS where this topic is not presented, debated, and discussed. Top-down sequencing has become a hot topic, and somehow it manages to raise the excitement of many (experts as well as novices) in the desire to make it work with 100% sequencing efficiency for any and all proteins, peptides, antibodies, antibody–drug conjugates, and other biopolymers (for example, lipids, oligonucleotides, polysaccharides, and related high- and low-molecular-weight materials). There is also a push to make top-down sequencing effective for trace levels of analytes including not just commercial biopharmaceuticals but biologically active biosamples with accurate and precise sequencing and quantitation.
Top-Down Sequencing and the Current State of the Art
Perhaps one of the very best (simplified) overviews of protein MS today can be found on Wikipedia, together with its suggested reading list (11). There are, of course, various reviews of protein MS and a few that are specifically targeted at top-down sequencing (12–15). Top-down sequencing is just one of many possible approaches for determining the N-to-C sequence of all the amino acids in any protein or peptide, and to identify their PTMs as well, whether these are oxidized, glycosylated, deamidated, or other changes to the original amino acids. Some of the currently used methods for doing so, aside from Edman degradation, which is not used very often today, include bottom-up sequencing, middle-down sequencing, and middle-out sequencing. Some of these measure the intact molecular weight (MW) or ion of the protein (middle-down sequencing, middle-out sequencing, top-down sequencing) while others do not (bottom-up sequencing), Figure 1. Bottom-up sequencing has been the longest used MS-based approach for protein sequencing and proteomics, though it has some inherent substantial problems (16–19).
Figure 1: Schematic representation comparing top-down and bottom-up sequencing for identifying a specific protein or proteins of interest. (Adapted and reprinted with permission from references 12 and 14.)
Although it is hard to argue with the impact of bottom-up sequencing on a large range of research areas, a disadvantage of the bottom-up sequencing strategy is that it rarely links the MW of the parent protein to its digested component peptides. As a result, it has no foolproof way of determining when all of the peptide fragments have been identified or not (19). It relies on a database of known proteins to match its peptide map to a list of most-likely parent proteins or hits. And, it rates these hits without knowing for certain the MW of the parent protein. This may be sufficient for protein identification, but not necessarily for complete sequencing. Middle-out sequencing measures the MW for every protein in a complex mixture before performing peptide mapping. Together with reliable databases of known proteins and their peptide maps, it is possible to absolutely identify a known protein. For a protein not in any database, to fully sequence, one has to isolate and characterize every peptide in that protein’s digest. Their sum MWs must add up to that of the molecular ion of the corresponding, parent protein. This approach will work, but only if all of the peptides have been isolated and sequenced by high performance liquid chromatography–electrospray ionization-MS (HPLC–ESI-MS) or other methods available today. Also, the other problems with middle-out or bottom-up sequencing is that they are materials, reagents, columns, and MS time intensive, as well as being expensive to perform in terms of man hours and reagents.
Middle-down sequencing, on the other hand (Figure 1), relies on measuring the intact MW of a protein and then digesting it into larger sized polypeptides (that is, rather than smaller sized trypsin-digested peptides). By putting together all of these larger polypeptides to produce the known MW of the parent protein, the complete sequence can be more readily deduced. For antibodies, in essence, the use of IdeS or related enzymes to digest them into large pieces, and then separating and characterizing these large pieces is a form of middle-down sequencing (20–22).
So, how does top-down sequencing differ from middle-down, middle-out and bottom-up sequencing? Top-down sequencing is a MS method of measuring the mass of the intact protein (that is, the “top” part of top-down sequencing ) and then dissociating the gaseous intact protein into product ions (the “down” of top-down sequencing) that allows its sequence to be derived, using software tools. Top-down sequencing endeavors to determine the entire sequence with 100% accuracy and reliability. Ideally, top-down sequencing should do this with the minimal amount of protein and manipulations possible, and be able to perform this well from a complex mixture of proteins. Obviously, top-down sequencing should also indicate all PTMs, including disulfide bridge positions, oxidations, deamidation, glycosylation positions and sequences, and any other PTMs. Quantitation information would be a real bonus, but it is not required for absolute sequencing or many proteomics needs (23,24).
For absolute identification of any new or known protein or antibody, then one must have a top-down sequencing method that will identify all modified or unmodified amino acids, their sequence in the protein backbone, and that they add up to the measured MW of the intact protein. This approach will facilitate 100% identification of new, unknown proteins, the confirmation of structures for known proteins in any database, batch-to-batch comparisons of biopharmaceuticals in any firm or regulatory laboratory, and simplify the direct comparison of biosimilars to proprietary biopharmas. Of course, having a successful and reliable top-down sequencing method also makes it simpler than ever to characterize any new protein from any biological source. These are incredibly useful and important applications of a final, 100% successful top-down sequencing method using MS.
Complete sequence determination is not typically required for top-down proteomics because protein identification is a primary goal. By revealing the minimum number of proteoforms present in a sample, and separating many distinct forms for individual analyses, an analyst is better able to differentiate splice variants that are present from those that are not present and to quickly estimate the number of distinct species present.
Several questions arise about the status of top-down sequencing: Where are we on this road to nirvana? Why have we not gotten there yet? How do we get there in the future?
Current Approaches for 100% Top-Down Sequencing
This area of MS is, at the present time and going back at least one decade, an area of intense interest, devotion, and energy for several research groups around the world. These groups are usually led by very respectable, knowledgeable, and serious mass spectrometrists. The groups include the Consortium for Top-Down Proteomics (CTDP). The separations people provide the front-end separations vehicles to permit the MS people to introduce a single protein at a time for top-down sequencing. No separations method can overcome an MS method that does not provide 100% sequencing, so long as it is able to resolve all proteins entering the mass spectrometer from any other coeluted proteins. Today, that is generally not a problem, and the lingering problems are really within the MS domain and approaches. The issue is not about how much of a given protein is needed or the limits of detection in sequencing, but rather, in the methods of sequencing. It is not an inability of getting any protein into the MS instrument, since we have been studying and identifying 150 kDa monoclonal antibodies (mAbs) by MS for at least two decades now.
The issue is sequencing, of course, knowing which amino acids and modifications are present and in what specific sequence, from the N- to the C-termini. It is now relatively simple to do for smaller peptides but becomes amazingly more difficult as the MW increases from a small peptide to a protein to mAbs. If all we needed was to sequence peptides of a few thousand daltons, we could do that all the time. However, if we want to sequence all proteins at 50 kDa, with 100% sequence coverage, we cannot quite yet do that reliably. Despite impressive efforts by researchers such as Tsybin (Spectroswiss) and Marshall, who have demonstrated moderate sequence coverage of mAbs of 150 kDa, it is still a considerable challenge (25,26). Even for smaller proteins of about 30 kDa, sometimes it is possible to get very close to 100% coverage, but with other proteins (of the same MW), it is not trivial. Why is that? And, why is it that as the MW of a protein gets larger, the percent of successful sequencing gets smaller? This is the “Holy Grail”: to understand why this is the current status of top-down sequencing, and how to solve the seeming impasse so that the final method will succeed at doing 100% top-down sequencing.
Major Fragmentation Approaches for Improving Top-Down Sequencing: Past and Present
Top-down proteomics will obviously, rely on using the most successful modes of top-down sequencing, which then depends, in large part, on which fragmentation modes and approaches are most useful for realizing maximum sequence coverage. It also depends on what type of resolution is available on a given MS instrument but, in general, the higher resolution instruments, such as orbital ion trap, Fourier transform ion cyclotron resonance (FT-ICR), and quadrupole time-of-flight (Q-TOF) systems have proven to be the most successful. In general, especially in recent years, most of the recent literature has tended to use FT-ICR and orbital ion trap systems, together with multiple fragmentation modes. Even so, they do not quite yet guarantee 100% top-down sequencing for any and all proteins just because of their price tags. Success may not come by spending more for the highest resolution MS instruments but rather in figuring out how best to present the protein molecules to the various modes of dissociation techniques.
There are some, perhaps more than some, other lingering questions that must be answered before the world will have a fully reliable and effective top-down sequencing method for any and all proteins of any size or MW. Some of these, not necessarily in their order of importance, include
Where Is Top-Down Sequencing Today?
There have been impressive advances in the field of top-down sequencing, almost too many to cover comprehensively in this paper. Lower MW proteins almost always exhibit greater percent sequencing, >50%, often approaching 100% but it is very much MW dependent (but not always). Nobody has conclusively demonstrated why this is so yet, but many believe it could be our choices of how to prepare the protein, solution conditions, pH, active–inactive state of protein, fragmentation routes (ETD, ECD, CID, HCD, UVPD, and others), timing of fragmentation, and so forth. Nobody has published a specific set of conditions that will guarantee any percent fragmentation or sequencing for any and all proteins yet. Perhaps this is an unrealistic goal. Sequencing peptides by tandem MS is somewhat dependent on the amino acid sequence, and therefore it is likely that top-down sequencing of intact proteins will show a sequence dependence (and maybe even a secondary or tertiary structure dependence).
Because top-down sequencing is typically performed with ESI, the multiply charged molecules should be considered. Most fragmentation modes show an increase in fragmentation efficiency (conversion of precursor ions to product ions) with higher charged precursors, but this does not always translate to higher sequence coverage. Increasing analyte charge via “supercharging” or using denaturing solution conditions or both could be considered. Which methods or combination of dissociation methods (ECD, CID, UVPD, and so forth) yield the optimal results? We have not yet found the right or most useful path through these woods, though there are some very encouraging signposts already recognized (27–34).
McLafferty (35) suggested several years ago that perhaps gas phase proteins assume a “spaghetti ball” configuration that prevents almost all fragmentation routes from realizing high percent sequencing. Proteins that are larger than 30 kDa are difficult to analyze by top-down sequencing, perhaps because secondary and tertiary structures interfere with complete fragmentation? The technique does not work on all proteins of any MW and top-down sequencing has a high detection limit. It is not clear if native or denatured proteins both assume this (“spaghetti ball”) temporary conformation, or which form of the protein would lead to more efficient top-down sequencing. Loo (36) published a study that included a comparison of top-down sequencing for a protein that exists as a homotetramer complex in its native state and a monomeric protein in a denaturing solution. Percent sequence coverage is similar for the two conditions, but the regions for which product ions originated was different, presumably because the three-dimensional (3D) structures of the two forms are different. But most studies that examine native proteins by MS do not consider top-down sequencing (37,38). There is also the lingering issue that most top-down sequencing has a high limit of detection (LOD), which can be a serious problem when pursuing top-down proteomics. And, finally, there is also the fact that some proteins just do not undergo any reasonable or acceptable extent of top-down sequencing fragmentation. The method can be protein dependent, which makes it less useful for doing top-down proteomics, if one is not seeing all or nearly all of the important proteins in a particular, biofluid, for example.
The Types of Fragmentations Used in Maximizing Top-Down Sequencing
Of all the operational parameters that seem terribly important, perhaps leading to maximization of percent sequence coverage, surely the choice of fragmentation routes, location of fragmentation, and other parameters of the MS operations, seem paramount. Current, cutting edge technologies very often involve the use of HRMS in the form of FT-ICR-MS, HRQ-TOF, orbital ion trap, and other possible arrangements. High resolution aids charge state assignment for large, multiply charged product ions. Figure 2 illustrates a previous, state-of-the-art approach, using ETD MS-MS for intact 150 kDa mAbs, which was used for measuring intact molecular ion distributions, as well for top-down sequencing (34,39). In this particular instance, depending on which MS system was used, there were small differences in the sequence coverage but nothing more than 50% was realized, to date, for intact mAbs. On the Orbitrap Velos Pro MS system, there was a 32.7% total sequence coverage for a typical mAb, adalimumab. While with the Orbitrap Elite MS system, there was a 27% total sequence coverage with three typical mAbs: adalimumab, trastuzumab, and panitumumab.
Figure 2: Previous state-of-the-art MS for top-down sequencing: ETD and MS-MS of intact mAbs. (Adapted with permission from Neil Kelleher, Northwestern University [34,39].)
However, mAbs may present an unusual challenge for top-down sequencing because of its complex structure: a “Y”-shaped protein composed of two heavy chains and two light chains linked by an array of disulfide bonds and decorated with variable numbers of glycans. Despite the problems of addressing larger MW proteins, even proteins of the same MW may well have different sequencing efficiencies under the very same top-down sequencing conditions. This all suggests that the sequence and conformations of each protein may directly affect the sequencing efficiency in top-down sequencing, whatever the operational conditions (fragmentation approaches) may be.
Over the years, researchers have taken up one approach, such as CID because of its ready availability, and would then move on to other modes as these were invented or commercialized and were demonstrated to be efficient top-down sequencing methods, such as ETD and ECD. Today, UVPD operating at 193 nm, demonstrated by the work of Brodbelt’s group, appears as a most promising advance for top-down sequencing (29,30,40,41). Figure 3 illustrates one of the approaches utilized for UVPD with an orbital trap mass spectrometer (40). At the moment, performing UVPD in the ion trap (can also be done in the HCD cell), where the molecular ions, once formed, can be held for varying periods of time, while they undergo fragmentation-sequencing, and then sending the products to the orbital trap analyzer for high-resolution measurements, appears to be the best approach (40).
Figure 3: Implementation of using UVPD on an Orbitrap Fusion Tribrid mass spectrometer. (Adapted with permission from J.S. Brodbelt, University of Texas at Austin .)
UVPD appears to be the most efficient fragmentation approach today for top-down sequencing, often >90% for many proteins. That number, once again, decreases as the MW of a protein increases. Figure 4, using green fluorescent protein (GFP) as an example, illustrates what might be considered as typical UVPD results, with the normally observed, parabolic-shaped distribution of peaks representing the multiply charged molecules in the mass spectrum, and selection of the 31+ precursor for subsequent fragmentation (41). There is extensive fragmentation along the backbone yielding a collection of ions from dissociation of different bonds throughout the protein (denoted as a-, b-, and c-type product ions that retain the N-terminus, and x-, y-, and z-type products that retain the C-terminus). Specialized software must then interpret all of these ions, and put together the most likely pattern of fragmentations for all possible combinations of amino acids and PTMs. While absolute sequencing of the type indicated is crucial, for real applications in top-down proteomics or single biopharma and biosimilar products, identification of all PTMs is crucial, whether these are glycovariants, phosphorylated amino acids, disulfide linkages, deamidations, and so forth.
Figure 4: UVPD of green fluorescent protein (238 residues, 27 kDa). (Adapted with permission from J.S. Brodbelt, University of Texas at Austin .)
Figure 5 illustrates a near-complete sequence coverage for proteins using UVPD, ubiquitin (8.6 kDa) and myoglobin (17 kDa) (27). Near-complete fragmentation of proteins up to 29 kDa (96%) is also achieved by UVPD, including the unambiguous localization of a single residue mutation and several, protein modifications on protein Pin1. The 5-ns high energy activation afforded by UVPD exhibits far less precursor charge state dependence compared to conventional, collision-based and electron-based dissociation methods (for example, CID, CAD, ETD, ECD).
Figure 5: UVPD spectra of the 11+ charge state of (a) ubiquitin and (b) the 20+ charge state of myoglobin (27). (Reprinted with permission of the copyright holder, American Chemical Society and JACS, Washington, DC, USA.)
Again, the lower MW, usually is the higher percent coverage observed. In studying the individual chains of a mAb, these numbers are usually <50%, perhaps 40%. Figure 6 is a similar UVPD analysis of carbonic anhydrase, 29 kDa, 259 residues, and coverage of 87% for the 34+ charge state and laser pulse of 0.6 mJ (27,29). Coverage is given by the sequence indicated and with blue or red lines, whether it is a b,y, a,x, or c,z type fragmentation. The numbers are indicated for each ion observed under these conditions. This too varies from protein to protein, by MS conditions, and by molecular ions being fragmented (29). It is possible that there is no single set of universal MS conditions that will maximize percent coverage for any and all proteins. It may be necessary to derive these on a case-by-case basis, which makes doing maximized top-down proteomics more difficult.
Figure 6: UVPD of carbonic anhydrase, 259 residues (29 kDa). (Adapted with permission from J.S. Brodbelt .)
Remaining Hurdles in Improving and Advancing Top-Down Sequencing Methods
There are several areas that remain to be developed and refined to allow top-down sequencing to approach 99% sequence coverage for >99% of all proteins. One of these relates to improved software algorithms for the most advanced fragmentation approaches. There are many examples of software for interpreting top-down mass spectra reported in the literature, but few are widely used by the community. As top-down sequencing becomes more mature, it is likely that software development will follow.
Supercharging is a promising approach toward making more highly charged molecular ions that may fragment more efficiently available (31,32). Such improvements may not yield orders of magnitude improvement in top-down sequencing efficiency, but perhaps 20% or more improved coverage may be realized. A recent paper discussed some exciting, improved top-down sequencing using a novel reagent to increase protein charging (42). Here, the fragmentation of six proteins (8.6 to 66.5 kDa) using ECD in a 7-T FT-ICR-MS system was studied as a function of charge state. The addition of 1,2-butylene carbonate to ESI solutions was used to form protein ions with extremely high charge densities. For all six proteins, cleavage of 85–99% of all interresidue sites were identified and fragmentation efficiencies of 75–95% were obtained from tandem MS of the highest charge states that could be readily isolated under these conditions.
Eventually, top-down sequencing needs to be interfaced with on-line chromatographic separations, perhaps two-dimensional liquid chromatography (2D-LC) or 3D-LC, to 100% baseline resolve the majority of proteins, before the MS step. Ideally, each protein should be baseline resolved from others, or interpretation of the MS data becomes more difficult. One would like a final system that permits the bulk of the proteins in the sample to be identified and sequenced, so as to match known sequences or prove an unknown protein. In addition, one would like an approach that will provide absolute quantitation of each, newly identified protein, by an absolute, perhaps isotopically labeled, internal standard protein, after its sequence has been determined. Perhaps this question is unrealistic to ask currently, but one can dream.
Conclusions and Future Directions
We do not yet understand enough about how a typical protein behaves within the MS system, in terms of conformational changes before fragmentation, and to control it for improved fragmentations. It is possible that by using ion mobility spectrometry (IMS) and collisional cross section (CCS) information, we may be able to define the conformation of a typical protein just before it fragments, especially one that yields the most desired fragmentation result. Also, one would have thought that high-intensity UVPD at some wavelength in the vacuum UV region, would have been sufficient to realize 100% sequencing for all proteins. That has not yet happened. We do not yet know how to control or influence protein conformations in the vapor phase of a typical MS study. We may well have optimized everything else but for this one, single issue. If and when we can do this, it may well arise that all proteins, of any size or shape in solution, will become 100% sequenced with all PTMs defined. At that point in time, we will have validated McLafferty’s original 2007 suggestions about what is holding up 100% sequencing abilities. And, we will then have an MS method to surpass all others ever described, perhaps replacing bottom-up sequencing, middle-down sequencing, and all others for the better. It will then become commonly used in those laboratories having the needed MS instrumentation, sample preparation requirements, and software or computers needed to handle the incredible amount of fragmentation information that will produce 100% sequencing, with all PTMs intact from the original, native protein in solution. That day is coming!
We appreciate and acknowledge the technical assistance and information provided to us in the preparation of this particular column, especially from Neil Kelleher, Jenny Brodbelt, John Engen, Jeff Agar, Paul Danis, and others involved in the developing history of top-down sequencing. Mention should be made to the existence of the Consortium for Top Down Proteomics, and more information can be found on-line at their website (43).
Joseph A. Loo is a professor in the Department of Biological Chemistry, David Geffen School of Medicine at UCLA and in the Department of Chemistry and Biochemistry at the University of California, Los Angeles.
Ira S. Krull is a Professor Emeritus with the Department of Chemistry and Chemical Biology at Northeastern University in Boston, Massachusetts, and a member of LCGC’s editorial advisory board.
Anurag S. Rathore is a professor in the Department of Chemical Engineering at the Indian Institute of Technology in Delhi, India.