Fig. 1. Phylogenetic tree of hCov-19 virus (COVID-19)
A phylogenetic tree for microbes is like a human family tree. The key difference is that we list various generations and siblings in a family tree. On the other hand, a phylogenetic tree illustrates how different strains of the microbe are related. A strain is identified by its unique set of accumulated mutations. A group of related strains is called clade (Fig. 1). The vertical lines of the phylogenetic tree show the relationships among strains or clades. The listing is evenly spaced in the vertical direction. However, the length of horizontal lines is important. It is termed the evolutionary distance (ED). The longer the line, the farther the strain has diverged from its parent. Therefore, Clade GV has moved farther away from its parent Clade G than its sibling Clade GH. The formulae for calculating evolutionary distances are complex and they are not standardized. Therefore, the distances are mostly interpreted qualitatively. As we will see below, more mutations and longer EDs do not necessarily translate into better survivability for the virus. GH, with a shorter ED, is far more prevalent than GV. In fact, GV seems to be on an evolutionary dead end under the present analysis,
Nomenclature of viruses
The names of a few viral strains are listed in Figure 1. There are four parts to the name. The first part is the type of virus. hCov-19 is the coronavirus that causes COVID-19. Incidentally, researchers prefer the title of SARS-CoV-2 for the COVID-19 virus. The second and fourth components of the viral name are the location and year of discovering the strain. The third part is the ID of the strain. For example, the strain hCoV-19/Belgium/UGent-200/2020 refers to the hCoV-19 isolated in Belgium in 2020 and UGent-200 is the ID. Presumably, the strain was first discovered at the University of Ghent. It should be noted that the complete four-part name uniquely identifies a strain though individual parts can repeat. Therefore, there can be thousands of hCov-19 strains from Belgium in 2020 but each must have a unique ID for that year. Further, the location can be a country or any other region. For instance, one of the strains in Figure 1 is from Wuhan.
Nomenclature of mutations
We are now ready to tie together some of the concepts we learnt in the previous sections including genomic mutations, protein structure and single-letter amino acid symbols. It may be recalled that mutations occur in the genome (DNA or RNA). However, the mutations do not affect the viruses or humans directly. Instead, it is the changed protein coded by the mutated genome that influences both. Therefore, the mutations are commonly studied at the protein level and not the genome level. Further, we had discussed that proteins are formed by the end-to-end chaining of amino acids and that the amino acids are counted from left to right in the chain. Now, take the example of a hypothetical protein TEST that is formed by chaining four amino acid - T (threonine), E (glutamic acid), S (serine) and T (threonine). Consider that a mutation has occurred in the genome that changes the coded protein from TEST to PEST. That mutation will be termed as T1P, i.e. amino acid in location #1 changed from T to P. If a significant population of the virus further mutes from P to R (arginine) while both mutations coexist then the mutation is called T1P-R, i.e., TEST mutated to PEST and REST. Similarly, if BANK and BAND converge to BANG then the mutation is called K-D4G, i.e. K and D mutated to G at location #4.
Clades of COVID-19
Figure 1 illustrates the major clades of COVID-19 found thus far. A sample strain for each clade is also included in the figure. Most of these clades, especially GH are large. Currently there are over 30,000 strains in GH. hCoV-19 split into S and L by the end of 2019. L further evolved into G and V. While S and V have been quiescent recently, Clade G is dominating the infections worldwide. Clade G was rare before March 2020 but accounts for nearly 75% of all virus samples tested in labs worldwide since June 2020. Interestingly a single mutation, D614G, made the virus so lethal. The Clade G is named for the mutation from D (aspartic acid) to G (glycine). D614G is a part of the spike protein, which is on the surface of the virus. Spike is instrumental in anchoring a virus to the cell and gaining entry. In less than one year since the COVID-19 outbreak, G has divided into subclades GH (mutations D614G + Q57H), GR (D614G + N-G204R) and GV (D614G + S-A222V). That exemplifies how quickly and aggressively the virus is mutating. Analysis shows that GH will further subdivide soon.
Fig. 2. Phylogenetic tree showing the evolution of spike protein of hCoV-19 virus
Evolution of COVID-19
Figure 2 tracks the evolution of spike protein of the hCov-19 virus. While the virus RNA produces numerous proteins, the analysis has been restricted to spike protein because of its significance in developing anti-COVID vaccines. It is the most visible segment of the virus to our immune system and therefore a target for vaccines. Approximately 50,000 strains from all seven major clades were included in the analysis. The trend shows that the proteins in the Clade GH are changing most rapidly. That is in congruence with the number of samples collected over the last few months in the labs worldwide. It is predicted from the model that GH will split into subclades in the next 4-6 months. Even though Clade GV has accumulated more mutations than GH, those changes do not increase its potency. Therefore, its prevalence will continue to reduce.
(UK, Dec 2020)
A radically new strain of COVID-19 has emerged in The United Kingdom in December 2020. It is an interesting case study about the need and challenges of tracking viruses. What makes the new strain unique is the fact that entire amino acids have been deleted. This is a first such incident since the outbreak of COVID-19. In past the mutations have only changed the amino acids at certain locations. We had discussed insertion/deletion of single nucleotides in the Section Viral Strains and Immunity. Insertion/deletion of entire amino acids are rare and can be devastating. To put it in perspective, deletion of a single amino acid in the flu virus caused the 2009 Swine Flu epidemic. The changes in the novel UK COVID-19 virus are N501Y, A570D, P681H, T716I, S982A, D1118H and ΔH69/ΔV70. To reiterate, N501Y means N (arginine) has been replaced with Y (tyrosine) at amino acid location 501 and so on for other mutations. ΔH69/ΔV70 implies that H (histidine) and V (valine) have been deleted at locations 69 and 70, respectively. N501Y and A570D can be of concern because they are located inside the part of spike protein that binds to the human cell. It is called RBD (receptor binding domain). Will N501Y, A570D and ΔH69/ΔV70 make the strain self limiting or devastating (like the Swine Flu)? Only time will tell.
Pereson, M et al. (2020). Evolutionary analysis of SARS-CoV-2 spike protein for its different clades. bioRxiv. doi.org/10.1101/2020.11.24.396671.
Plante, J et al. (2020) Spike mutation D614G alters SARS-CoV-2 fitness. Nature. doi.org/10.1038/s41586-020-2895-3.
Tang, X et al. (2020). On the origin and continuing evolution of SARS-CoV-2. National Science Review.,7(6).,1012-1013. doi.org/10.1093/nsr/nwaa036.