Alignment of beta-coronavirus Spike proteins to a SARS-CoV structure

In this example, we align a set of beta-coronavirus Spike proteins to each other and to the protein structure of a SARS-CoV Spike.

The set of coronavirus Spike proteins are those that had their RBD domains studied by Lekto and Munster (2019), and are from the Genbank accessions in Supplementary figure 1 of their pre-print. The un-aligned set of Spike protein sequences is in input_files/beta_coronavirus_Spikes.fa. Here are the first few lines:

[1]:
! head -n 4 input_files/beta_coronavirus_Spikes.fa
>AY278741|SARS Urbani|human
MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFGNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFNCTFEYISDAFSLDVSEKSGNFKHLREFVFKNKDGFLYVYKGYQPIDVVRDLPSGFNTLKPIFKLPLGINITNFRAILTAFSPAQDIWGTSAAAYFVGYLKPTTFMLKYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIYQTSNFRVVPSGDVVRFPNITNLCPFGEVFNATKFPSVYAWERKKISNCVADYSVLYNSTFFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIADYNYKLPDDFMGCVLAWNTRNIDATSTGNYNYKYRYLRHGKLRPFERDISNVPFSPDGKPCTPPALNCYWPLNDYGFYTTTGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFNFNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNASSEVAVLYQDVNCTDVSTAIHADQLTPAWRIYSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDCNMYICGDSTECANLLLQYGSFCTQLNRALSGIAAEQDRNTREVFAQVKQMYKTPTLKYFGGFNFSQILPDPLKPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLPPLLTDDMIAAYTAALVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKQIANQFNKAISQIQESLTTTSTALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGNCDVVIGIINNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYVWLGFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHYT
>KF367457|WIV1|bat
MKLLVLVFATLVSSYTIEKCLDFDDRTPPANTQFLSSHRGVYYPDDIFRSNVLHLVQDHFLPFDSNVTRFITFGLNFDNPIIPFKDGIYFAATEKSNVIRGWVFGSTMNNKSQSVIIMNNSTNLVIRACNFELCDNPFFVVLKSNNTQIPSYIFNNAFNCTFEYVSKDFNLDLGEKPGNFKDLREFVFRNKDGFLHVYSGYQPISAASGLPTGFNALKPIFKLPLGINITNFRTLLTAFPPRPDYWGTSAAAYFVGYLKPTTFMLKYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIYQTSNFRVAPSKEVVRFPNITNLCPFGEVFNATTFPSVYAWERKRISNCVADYSVLYNSTSFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIADYNYKLPDDFTGCVLAWNTRNIDATQTGNYNYKYRSLRHGKLRPFERDISNVPFSPDGKPCTPPAFNCYWPLNDYGFYITNGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFNFNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNTSSEVAVLYQDVNCTDVPVAIHADQLTPSWRVHSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASYHTVSSLRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDCNMYICGDSTECANLLLQYGSFCTQLNRALSGIAVEQDRNTREVFAQVKQMYKTPTLKDFGGFNFSQILPDPLKPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLPPLLTDDMIAAYTAALVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKQIANQFNKAISQIQESLTTTSTALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGSCDVVIGIINNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYVWLGFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHYT

We would like to align these Spike proteins to each and to the protein chains in a cryo-EM structure of the SARS-CoV Spike, PDB 6crv. That PDB structure is at input_files/6crv.pdb, and has three chains (A, B, and C) each corresponding to monomers in the Spike trimer. Here are the first few lines of the PDB file:

[2]:
! head -n 8 input_files/6crv.pdb
HEADER    VIRAL PROTEIN                           19-MAR-18   6CRV
TITLE     SARS SPIKE GLYCOPROTEIN, STABILIZED VARIANT, C3 SYMMETRY
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: SPIKE GLYCOPROTEIN,FIBRITIN;
COMPND   3 CHAIN: A, B, C;
COMPND   4 SYNONYM: S GLYCOPROTEIN,E2,PEPLOMER PROTEIN;
COMPND   5 ENGINEERED: YES;
COMPND   6 MUTATION: YES

Finally, we need to choose a “reference” sequence in the alignment. All gaps in the alignment are stripped relative to this reference, and in addition to the PDB numbering we will get sites in the reference in 1, 2, … numbering. Since we are most interested in the 2019-nCoV from Wuhan, we will choose that as our reference sequence. It’s sequence header in input_files/beta_coronavirus_Spikes.fa contains the unique substring Wuhan-CoV, which we will use to identify it.

Now we run pdb_prot_align, sending the output files to the subdirectory ./output_files/ (which needs to have already been created):

[3]:
! pdb_prot_align --protsfile input_files/beta_coronavirus_Spikes.fa \
                 --refprot_regex Wuhan-CoV \
                 --pdbfile input_files/6crv.pdb \
                 --chain_ids A B C \
                 --outprefix output_files/beta_coronavirus_Spike

Running `pdb_prot_align` 0.5.0

Parsing PDB input_files/6crv.pdb chains A B C
For chain A, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.
For chain B, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.
For chain C, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.

Read 30 sequences from input_files/beta_coronavirus_Spikes.fa
Reference protein is of length 1273 and has the following header:
MN908947|Wuhan-CoV|human

Using `mafft` to align sequences to output_files/beta_coronavirus_Spike_unstripped_alignment.fa
Stripping gaps relative to reference MN908947|Wuhan-CoV|human
Dropping PDB chains from alignment
Writing gap-stripped alignment to output_files/beta_coronavirus_Spike_alignment.fa

Writing CSV with detailed information to output_files/beta_coronavirus_Spike_sites.csv

Program complete.

The alignment output file (output_files/beta_coronavirus_Spike_alignment.fa) has the Spike alignment with all gaps stripped relative to the reference sequence. It does not include the PDB protein (you can change that with the --drop_pdb option; also there is a file with the PDB and no gaps stripped at output_files/beta_coronavirus_Spike_unstripped_alignment.fa). Here are the first few lines:

[4]:
! head -n 4 output_files/beta_coronavirus_Spike_alignment.fa
>AY278741|SARS Urbani|human
MFIFLLFLTL--DRCTTFDDVQAPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHT-------FGNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHT----MIFDNAFNCTFEYISDAFSLDVSEKSGNFKHLREFVFKNKDGFLYVYKGYQPIDVVRDLPSGFNTLKPIFKLPLGINITNFRAILTAFS------PAQDIWGTSAAAYFVGYLKPTTFMLKYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIYQTSNFRVVPSGDVVRFPNITNLCPFGEVFNATKFPSVYAWERKKISNCVADYSVLYNSTFFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIADYNYKLPDDFMGCVLAWNTRNIDATSTGNYNYKYRYLRHGKLRPFERDISNVPFSPDGKPCTPP-ALNCYWPLNDYGFYTTTGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFNFNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNASSEVAVLYQDVNCTDVSTAIHADQLTPAWRIYSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASYHTVSL----LRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDCNMYICGDSTECANLLLQYGSFCTQLNRALSGIAAEQDRNTREVFAQVKQMYKTPTLKYFGGFNFSQILPDPLKPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLPPLLTDDMIAAYTAALVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKQIANQFNKAISQIQESLTTTSTALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGNCDVVIGIINNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYVWLGFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHYT
>KF367457|WIV1|bat
KLLVLVFATL--EKCLDFDDRTPPANTQFLSSHRGVYYPDDIFRSNVLHLVQDHFLPFDSNVTRFITFGLN-------FDNPIIPFKDGIYFAATEKSNVIRGWVFGSTMNNKSQSVIIMNNSTNLVIRACNFELCDNPFFVVLKSNNTQIPS----YIFNNAFNCTFEYVSKDFNLDLGEKPGNFKDLREFVFRNKDGFLHVYSGYQPISAASGLPTGFNALKPIFKLPLGINITNFRTLLTAFP------PRPDYWGTSAAAYFVGYLKPTTFMLKYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIYQTSNFRVAPSKEVVRFPNITNLCPFGEVFNATTFPSVYAWERKRISNCVADYSVLYNSTSFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIADYNYKLPDDFTGCVLAWNTRNIDATQTGNYNYKYRSLRHGKLRPFERDISNVPFSPDGKPCTPP-AFNCYWPLNDYGFYITNGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFNFNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNTSSEVAVLYQDVNCTDVPVAIHADQLTPSWRVHSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASYHTVSS----LRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDCNMYICGDSTECANLLLQYGSFCTQLNRALSGIAVEQDRNTREVFAQVKQMYKTPTLKDFGGFNFSQILPDPLKPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLPPLLTDDMIAAYTAALVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKQIANQFNKAISQIQESLTTTSTALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGSCDVVIGIINNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYVWLGFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHYT

However, the really “precious” information is in the output CSV file, output_files/beta_coronavirus_Spike_sites.csv. Here are the first few lines of that file:

[5]:
! head -n 15 output_files/beta_coronavirus_Spike_sites.csv
isite,wildtype,pdb_chain,pdb_site,pdb_wildtype,entropy,n_effective,amino_acid,frequency
1,M,,,,1.90540,3.74613,A,0.00000
1,M,,,,1.90540,3.74613,C,0.00000
1,M,,,,1.90540,3.74613,D,0.00000
1,M,,,,1.90540,3.74613,E,0.00000
1,M,,,,1.90540,3.74613,F,0.03846
1,M,,,,1.90540,3.74613,G,0.00000
1,M,,,,1.90540,3.74613,H,0.00000
1,M,,,,1.90540,3.74613,I,0.46154
1,M,,,,1.90540,3.74613,K,0.15385
1,M,,,,1.90540,3.74613,L,0.00000
1,M,,,,1.90540,3.74613,M,0.26923
1,M,,,,1.90540,3.74613,N,0.00000
1,M,,,,1.90540,3.74613,P,0.00000
1,M,,,,1.90540,3.74613,Q,0.00000

This CSV file has the following columns: - isite: 1, 2, … numbering of our reference sequence (in this case, the Spike fof the 2019-nCoV from Wuhan) - wildtype: wildtype residue in the reference sequence - pdb_chain: the PDB chain(s) to which the residue aligns, if any (note PDB chains are in tidy format). - pdb_site: the PDB site to which the residue aligns in PDB numbering, if any - pdb_wildtype: the wildtype residue in the PDB, if any - entropy: entropy at site in alignment, in bits. - n_effective: number of effective amino acids at site - amino_acid: amino-acid identity, in tidy format - frequency: frequency of amino-acid in alignment.

The entropy / effective amino acids includes all sequences in the alignment except the PDB chains (see the --drop_pdb option). Gaps are not included in these calculations by default (see the --ignore_gaps option).

Here are some lines more in the middle of the file, where there is PDB information:

[6]:
! head -n 2010 output_files/beta_coronavirus_Spike_sites.csv | tail -n 35
42,V,B,46,I,0.21084,1.15736,R,0.00000
42,V,B,46,I,0.21084,1.15736,S,0.00000
42,V,B,46,I,0.21084,1.15736,T,0.00000
42,V,B,46,I,0.21084,1.15736,V,0.03333
42,V,B,46,I,0.21084,1.15736,W,0.00000
42,V,B,46,I,0.21084,1.15736,Y,0.00000
42,V,C,46,I,0.21084,1.15736,A,0.00000
42,V,C,46,I,0.21084,1.15736,C,0.00000
42,V,C,46,I,0.21084,1.15736,D,0.00000
42,V,C,46,I,0.21084,1.15736,E,0.00000
42,V,C,46,I,0.21084,1.15736,F,0.00000
42,V,C,46,I,0.21084,1.15736,G,0.00000
42,V,C,46,I,0.21084,1.15736,H,0.00000
42,V,C,46,I,0.21084,1.15736,I,0.96667
42,V,C,46,I,0.21084,1.15736,K,0.00000
42,V,C,46,I,0.21084,1.15736,L,0.00000
42,V,C,46,I,0.21084,1.15736,M,0.00000
42,V,C,46,I,0.21084,1.15736,N,0.00000
42,V,C,46,I,0.21084,1.15736,P,0.00000
42,V,C,46,I,0.21084,1.15736,Q,0.00000
42,V,C,46,I,0.21084,1.15736,R,0.00000
42,V,C,46,I,0.21084,1.15736,S,0.00000
42,V,C,46,I,0.21084,1.15736,T,0.00000
42,V,C,46,I,0.21084,1.15736,V,0.03333
42,V,C,46,I,0.21084,1.15736,W,0.00000
42,V,C,46,I,0.21084,1.15736,Y,0.00000
43,F,A,47,F,0.46900,1.38415,A,0.00000
43,F,A,47,F,0.46900,1.38415,C,0.00000
43,F,A,47,F,0.46900,1.38415,D,0.00000
43,F,A,47,F,0.46900,1.38415,E,0.00000
43,F,A,47,F,0.46900,1.38415,F,0.90000
43,F,A,47,F,0.46900,1.38415,G,0.00000
43,F,A,47,F,0.46900,1.38415,H,0.00000
43,F,A,47,F,0.46900,1.38415,I,0.00000
43,F,A,47,F,0.46900,1.38415,K,0.00000

The information in this file is extremely useful if you want to plot evolutionary conservation or amino-acid identities onto the structure. But do that, you need to use a protein viewer such as pymol or nglview. Some of the other examples show how to do that.

Note that you can also drop the reference protein from the alignment with the --drop_refprot option:

[7]:
! pdb_prot_align --protsfile input_files/beta_coronavirus_Spikes.fa \
                 --refprot_regex Wuhan-CoV \
                 --pdbfile input_files/6crv.pdb \
                 --chain_ids A B C \
                 --outprefix output_files/beta_coronavirus_Spike_drop_refprot \
                 --drop_refprot True \
                 --ignore_gaps False

Running `pdb_prot_align` 0.5.0

Parsing PDB input_files/6crv.pdb chains A B C
For chain A, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.
For chain B, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.
For chain C, parsed 881 residues, ranging from 18 to 1120 in PDB numbering.

Read 30 sequences from input_files/beta_coronavirus_Spikes.fa
Reference protein is of length 1273 and has the following header:
MN908947|Wuhan-CoV|human

Using `mafft` to align sequences to output_files/beta_coronavirus_Spike_drop_refprot_unstripped_alignment.fa
Stripping gaps relative to reference MN908947|Wuhan-CoV|human
Dropping PDB chains from alignment
Dropping reference protein from alignment
Writing gap-stripped alignment to output_files/beta_coronavirus_Spike_drop_refprot_alignment.fa

Writing CSV with detailed information to output_files/beta_coronavirus_Spike_drop_refprot_sites.csv

Program complete.

[8]:
! grep -c '>' output_files/beta_coronavirus_Spike_drop_refprot_alignment.fa
29
[9]:
! head -n 15 output_files/beta_coronavirus_Spike_drop_refprot_sites.csv
isite,wildtype,pdb_chain,pdb_site,pdb_wildtype,entropy,n_effective,amino_acid,frequency
1,M,,,,2.21904,4.65583,-,0.13793
1,M,,,,2.21904,4.65583,A,0.00000
1,M,,,,2.21904,4.65583,C,0.00000
1,M,,,,2.21904,4.65583,D,0.00000
1,M,,,,2.21904,4.65583,E,0.00000
1,M,,,,2.21904,4.65583,F,0.03448
1,M,,,,2.21904,4.65583,G,0.00000
1,M,,,,2.21904,4.65583,H,0.00000
1,M,,,,2.21904,4.65583,I,0.41379
1,M,,,,2.21904,4.65583,K,0.13793
1,M,,,,2.21904,4.65583,L,0.00000
1,M,,,,2.21904,4.65583,M,0.20690
1,M,,,,2.21904,4.65583,N,0.00000
1,M,,,,2.21904,4.65583,P,0.00000