Influenza HAs (H3, H4, H14 subtypes)

In this example, we align a small set of influenza hemagglutinins (HAs): one each from the H3, H4, and H14 subtype.

The un-aligned set of HAs are in input_files/HA_H3_H4_H14.fa:

[1]:
! cat input_files/HA_H3_H4_H14.fa
>cds:CAA29337 A/England/321/1977 1977// HA
MKTIIALSYIFCQVLAQNLPGNDNSTATLCLAHHAVPNGTLVKTITNDQIEVTNATELVQSSSTGRICDSPHRILDGKNCTLIDALLGDPHCDGFQNEKWDLFVERSKAFSNCYPYDVPDYASLRSLVASSGTLEFINEGFNWTGVTQNGGSYACKRGPDNSFFSRLNWLYKSESTYPVLNVTMPNNDNFDKLYIWGVHHPSTDKEQTKLYVQASGRVTVSTKRSQQTIIPNVGSRPWVRGLSSRISIYWTIVKPGDILLINSNGNLIAPRGYFKIRTGKSSIMRSDAPIGTCSSECITPNGSIPNDKPFQNVNKITYGACPKYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTRRQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVVLLGFIMWACQKGNIRCNICI
>cds:BAA14332 A/duck/Czechoslovakia/1956 1956// HA
MLSIVILFLLIAENSSQNYTGNPVICMGHHAVANGTMVKTLADDQVEVVTAQELVESQNLPELCPSPLRLVDGQTCDIINGALGSPGCDHLNGAEWDVFIERPNAVDTCYPFDVPEYQSLRSILANNGKFEFIAEEFQWNTVKQNGKSGACKRANVDDFFNRLNWLVKSDGNAYPLQNLTKINNGDYARLYIWGVHHPSTSTEQTNLYKNNPGRVTVSTKTSQTSVVPDIGSRPLVRGQSGRVSFYWTIVEPGDLIVFNTIGNLIAPRGHYKLNNQKKSTILNTAIPIGSCVSKCHTDKGSLSTTKPFQNISRIAVGDCPRYVKQGSLKLATGMRNIPEKASRGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNDKYHQIEKEFEQVEGRIQDLENYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDKGNGCFEIFHKCDNNCIESIRNGTYDHDIYRDEAINNRFQIQGVKLTQGYKDIILWISFSISCFLLVALLLAFILWACQNGNIRCQICI
>cds:ABI84453 A/mallard/Astrakhan/263/1982 1982// HA
MIALILVALALSHTAYSQITNGTTGNPIICLGHHAVENGTSVKTLTDNHVEVVSAKELVETNHTDELCPSPLKLVDGQDCDLINGALGSPGCDRLQDTTWDVFIERPTAVDTCYPFDVPDYQSLRSILASSGSLEFIAEQFTWNGVKVDGSSSACLRGGRNSFFSRLNWLTKATNGNYGPINVTKENTGSYVRLYLWGVHHPSSDNEQTDLYKVATGRVTVSTRSDQISIVPNIGSRPRVRNQSGRISIYWTLVNPGDSIIFNSIGNLIAPRGHYKISKSTKSTVLKSDKRIGSCTSPCLTDKGSIQSDKPFQNVSRIAIGNCPKYVKQGSLMLATGMRNIPGKQAKGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNEKYHQIEKEFEQVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDQGNGCFEIFHQCDNNCIESIRNGTYDHNIYRDEAINNRIKINPVTLTMGYKDIILWISFSMSCFVFVALILGFVLWACQNGNIRCQICI

We would like to align these HA proteins to each and to the protein chains for a trimer of the H3 HA in PDB 4o5n. This PDB only shows a monomer, so a full trimer was generated using makemultimer.py, and is in input_files/4o5n_trimer.pdb. Chains A, C, and E) correspond to HA1, and chains B, D, and F correspond to HA2. Here are the first few lines of the PDB file:

[2]:
! head -n 13 input_files/4o5n_trimer.pdb
REMARK  Multimer expanded from BIOMT matrix in pdb file 4O5N
REMARK  by MakeMultimer.py (watcut.uwaterloo.ca/makemultimer)
REMARK
REMARK  -------------------------------------------------------------
REMARK  Chain  original  1st resid.  last resid.  1st atom  last atom
REMARK  -------------------------------------------------------------
REMARK      A         A           9          325         1       2498
REMARK      C         A           9          325         1       2498
REMARK      E         A           9          325         1       2498
REMARK      B         B           1          173         1       1431
REMARK      D         B           1          173         1       1431
REMARK      F         B           1          173         1       1431
REMARK  -------------------------------------------------------------

For our reference sequence in the alignment, we choose the H3 HA from A/England/321/1977.

Run pdb_prot_align, sending the output files to the subdirectory ./output_files/ (which needs to have already been created). We add the --reorder command to our call of mafft:

[3]:
! pdb_prot_align --protsfile input_files/HA_H3_H4_H14.fa \
                 --refprot_regex A/England/321/1977 \
                 --pdbfile input_files/4o5n_trimer.pdb \
                 --chain_ids A B C D E F \
                 --outprefix output_files/HA_H3_H4_H14 \
                 --mafft "mafft --reorder"

Running `pdb_prot_align` 0.5.0

Parsing PDB input_files/4o5n_trimer.pdb chains A B C D E F
For chain A, parsed 317 residues, ranging from 9 to 325 in PDB numbering.
For chain B, parsed 173 residues, ranging from 1 to 173 in PDB numbering.
For chain C, parsed 317 residues, ranging from 9 to 325 in PDB numbering.
For chain D, parsed 173 residues, ranging from 1 to 173 in PDB numbering.
For chain E, parsed 317 residues, ranging from 9 to 325 in PDB numbering.
For chain F, parsed 173 residues, ranging from 1 to 173 in PDB numbering.

Read 3 sequences from input_files/HA_H3_H4_H14.fa
Reference protein is of length 566 and has the following header:
cds:CAA29337 A/England/321/1977 1977// HA

Using `mafft` to align sequences to output_files/HA_H3_H4_H14_unstripped_alignment.fa
Stripping gaps relative to reference cds:CAA29337 A/England/321/1977 1977// HA
Dropping PDB chains from alignment
Writing gap-stripped alignment to output_files/HA_H3_H4_H14_alignment.fa

Writing CSV with detailed information to output_files/HA_H3_H4_H14_sites.csv

Program complete.

The alignment output file (output_files/HA_H3_H4_H14_alignment.fa) has the HA alignment with all gaps stripped relative to the reference sequence:

[4]:
! cat output_files/HA_H3_H4_H14_alignment.fa
>cds:CAA29337 A/England/321/1977 1977// HA
MKTIIALSYIFCQVLAQNLPGNDNSTATLCLAHHAVPNGTLVKTITNDQIEVTNATELVQSSSTGRICDSPHRILDGKNCTLIDALLGDPHCDGFQNEKWDLFVERSKAFSNCYPYDVPDYASLRSLVASSGTLEFINEGFNWTGVTQNGGSYACKRGPDNSFFSRLNWLYKSESTYPVLNVTMPNNDNFDKLYIWGVHHPSTDKEQTKLYVQASGRVTVSTKRSQQTIIPNVGSRPWVRGLSSRISIYWTIVKPGDILLINSNGNLIAPRGYFKIRTGKSSIMRSDAPIGTCSSECITPNGSIPNDKPFQNVNKITYGACPKYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTRRQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVVLLGFIMWACQKGNIRCNICI
>cds:BAA14332 A/duck/Czechoslovakia/1956 1956// HA
MLSIVILFLLIAENSSQNYTGN----PVICMGHHAVANGTMVKTLADDQVEVVTAQELVESQNLPELCPSPLRLVDGQTCDIINGALGSPGCDHLNGAEWDVFIERPNAVDTCYPFDVPEYQSLRSILANNGKFEFIAEEFQWNTVKQNGKSGACKRANVDDFFNRLNWLVKSDNAYPLQNLTKINNGDYARLYIWGVHHPSTSTEQTNLYKNNPGRVTVSTKTSQTSVVPDIGSRPLVRGQSGRVSFYWTIVEPGDLIVFNTIGNLIAPRGHYKLNNKKSTILNTAIPIGSCVSKCHTDKGSLSTTKPFQNISRIAVGDCPRYVKQGSLKLATGMRNIPEKASRGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNDKYHQIEKEFEQVEGRIQDLENYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDKGNGCFEIFHKCDNNCIESIRNGTYDHDIYRDEAINNRFQIQGVKLTQGYKDIILWISFSISCFLLVALLLAFILWACQNGNIRCQICI
>cds:ABI84453 A/mallard/Astrakhan/263/1982 1982// HA
MIALILVALALSHTATNGTTGN----PIICLGHHAVENGTSVKTLTDNHVEVVSAKELVETNHTDELCPSPLKLVDGQDCDLINGALGSPGCDRLQDTTWDVFIERPTAVDTCYPFDVPDYQSLRSILASSGSLEFIAEQFTWNGVKVDGSSSACLRGGRNSFFSRLNWLTKATGNYGPINVTKENTGSYVRLYLWGVHHPSSDNEQTDLYKVATGRVTVSTRSDQISIVPNIGSRPRVRNQSGRISIYWTLVNPGDSIIFNSIGNLIAPRGHYKISKTKSTVLKSDKRIGSCTSPCLTDKGSIQSDKPFQNVSRIAIGNCPKYVKQGSLMLATGMRNIPGKQAKGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNEKYHQIEKEFEQVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDQGNGCFEIFHQCDNNCIESIRNGTYDHNIYRDEAINNRIKINPVTLTMGYKDIILWISFSMSCFVFVALILGFVLWACQNGNIRCQICI

However, the really “precious” information is in the output CSV file, output_files/HA_H3_H4_H14_sites.csv. Here are some lines of that file:

[5]:
! head -n 5000 output_files/HA_H3_H4_H14_sites.csv | tail -n 35
99,K,E,83,K,1.58496,3.00000,F,0.00000
99,K,E,83,K,1.58496,3.00000,G,0.00000
99,K,E,83,K,1.58496,3.00000,H,0.00000
99,K,E,83,K,1.58496,3.00000,I,0.00000
99,K,E,83,K,1.58496,3.00000,K,0.33333
99,K,E,83,K,1.58496,3.00000,L,0.00000
99,K,E,83,K,1.58496,3.00000,M,0.00000
99,K,E,83,K,1.58496,3.00000,N,0.00000
99,K,E,83,K,1.58496,3.00000,P,0.00000
99,K,E,83,K,1.58496,3.00000,Q,0.00000
99,K,E,83,K,1.58496,3.00000,R,0.00000
99,K,E,83,K,1.58496,3.00000,S,0.00000
99,K,E,83,K,1.58496,3.00000,T,0.33333
99,K,E,83,K,1.58496,3.00000,V,0.00000
99,K,E,83,K,1.58496,3.00000,W,0.00000
99,K,E,83,K,1.58496,3.00000,Y,0.00000
100,W,A,84,W,0.00000,1.00000,A,0.00000
100,W,A,84,W,0.00000,1.00000,C,0.00000
100,W,A,84,W,0.00000,1.00000,D,0.00000
100,W,A,84,W,0.00000,1.00000,E,0.00000
100,W,A,84,W,0.00000,1.00000,F,0.00000
100,W,A,84,W,0.00000,1.00000,G,0.00000
100,W,A,84,W,0.00000,1.00000,H,0.00000
100,W,A,84,W,0.00000,1.00000,I,0.00000
100,W,A,84,W,0.00000,1.00000,K,0.00000
100,W,A,84,W,0.00000,1.00000,L,0.00000
100,W,A,84,W,0.00000,1.00000,M,0.00000
100,W,A,84,W,0.00000,1.00000,N,0.00000
100,W,A,84,W,0.00000,1.00000,P,0.00000
100,W,A,84,W,0.00000,1.00000,Q,0.00000
100,W,A,84,W,0.00000,1.00000,R,0.00000
100,W,A,84,W,0.00000,1.00000,S,0.00000
100,W,A,84,W,0.00000,1.00000,T,0.00000
100,W,A,84,W,0.00000,1.00000,V,0.00000
100,W,A,84,W,0.00000,1.00000,W,1.00000