utils

Utilities for aligning proteins to PDB.

pdb_prot_align.utils.align_prots_mafft(prots, *, mafft='mafft')[source]

Align protein sequences with mafft.

Parameters
  • prots (list) – List of sequences as Biopython.SeqRecord.SeqRecord objects.

  • mafft (str) – Command that resolves to mafft executable, potentially with arbitrary additional options, such as mafft --reorder.

Returns

The protein alignment.

Return type

Bio.Align.MultipleSeqAlignment

pdb_prot_align.utils.aligned_site_map(aln, heads)[source]

Get mapping between sequential site numbers of aligned sequences.

Parameters
  • aln (Bio.Align.MultipleSeqAlignment) – Alignment of sequences

  • heads (list) – Headers of sequences of interest in aln, must be at least two.

Returns

Columns are sequence names in heads. There is a row for alignment positions that are non-gapped in at least one sequence in heads, and the entry in the row gives the number of that site in 1, 2, … numbering of the sequence, or NaN if there is no residue for sequence at that position in alignment.

Return type

pandas.DataFrame

Example

Create and parse an alignment:

>>> with tempfile.TemporaryFile('w+') as f:
...     _ = f.write(textwrap.dedent(
...             '''
...             >seq1
...             GK-AC-L
...             >seq2
...             -K-AM-L
...             >seq3
...             G-NAC-L
...             >seq4
...             GK-ACSI
...             '''
...             ))
...     f.flush()
...     _ = f.seek(0)
...     aln = Bio.AlignIO.read(f, 'fasta')

Show the alignment site map. The gymnastics with converting to str and replacing ‘nan’ are to make output format the same in both pandas >= and < 1.0 and earlier versions so doctest passes in both:

>>> (aligned_site_map(aln, ['seq1', 'seq2', 'seq3'])
...  .astype(str).replace('nan', '<NA>'))
   seq1  seq2  seq3
0     1  <NA>     1
1     2     1  <NA>
2  <NA>  <NA>     2
3     3     2     3
4     4     3     4
5     5     4     5
pdb_prot_align.utils.alignment_to_count_df(fasta_alignment, *, ignore_gaps=False, as_freqs=False, add_entropy_neff=False)[source]

Convert FASTA alignment to counts data frame.

Parameters
  • fasta_alignment (str) – FASTA file with alignment.

  • ignore_gaps (bool) – Ignore gap characters (‘-‘) in counts.

  • as_freqs (bool) – Return frequencies rather than counts.

  • add_entropy_neff (bool) – Add columns with site entropy (bits) and number effective amino acids. Requires as_freqs to be True.

Returns

First column (‘isite’) is 1, 2, … site numbering. Other columns give counts of each observed alignment character at that site.

Return type

pandas.DataFrame

Example

>>> with tempfile.NamedTemporaryFile('w+') as f:
...     _ = f.write(textwrap.dedent(
...             '''
...             >seq1
...             GK-AC-L
...             >seq2
...             -K-AM-L
...             >seq3
...             G-NAC-L
...             >seq4
...             GK-ACSI
...             '''
...             ))
...     f.flush()
...     counts = alignment_to_count_df(f.name)
...     counts_nogaps = alignment_to_count_df(f.name, ignore_gaps=True)
...     freqs_nogaps = alignment_to_count_df(f.name,
...                                          ignore_gaps=True,
...                                          as_freqs=True)
...     freqs_with_ent = alignment_to_count_df(f.name,
...                                            ignore_gaps=True,
...                                            as_freqs=True,
...                                            add_entropy_neff=True,
...                                            )
>>> counts
   isite  -  A  C  G  I  K  L  M  N  S
0      1  1  0  0  3  0  0  0  0  0  0
1      2  1  0  0  0  0  3  0  0  0  0
2      3  3  0  0  0  0  0  0  0  1  0
3      4  0  4  0  0  0  0  0  0  0  0
4      5  0  0  3  0  0  0  0  1  0  0
5      6  3  0  0  0  0  0  0  0  0  1
6      7  0  0  0  0  1  0  3  0  0  0
>>> counts_nogaps
   isite  A  C  G  I  K  L  M  N  S
0      1  0  0  3  0  0  0  0  0  0
1      2  0  0  0  0  3  0  0  0  0
2      3  0  0  0  0  0  0  0  1  0
3      4  4  0  0  0  0  0  0  0  0
4      5  0  3  0  0  0  0  1  0  0
5      6  0  0  0  0  0  0  0  0  1
6      7  0  0  0  1  0  3  0  0  0
>>> freqs_nogaps.round(2)
   isite    A     C    G     I    K     L     M    N    S
0      1  0.0  0.00  1.0  0.00  0.0  0.00  0.00  0.0  0.0
1      2  0.0  0.00  0.0  0.00  1.0  0.00  0.00  0.0  0.0
2      3  0.0  0.00  0.0  0.00  0.0  0.00  0.00  1.0  0.0
3      4  1.0  0.00  0.0  0.00  0.0  0.00  0.00  0.0  0.0
4      5  0.0  0.75  0.0  0.00  0.0  0.00  0.25  0.0  0.0
5      6  0.0  0.00  0.0  0.00  0.0  0.00  0.00  0.0  1.0
6      7  0.0  0.00  0.0  0.25  0.0  0.75  0.00  0.0  0.0
>>> (freqs_with_ent.drop(columns=['entropy', 'n_effective']) ==
...  freqs_nogaps).all(None)
True
>>> freqs_with_ent[['entropy', 'n_effective']].round(3)
   entropy  n_effective
0    0.000        1.000
1    0.000        1.000
2    0.000        1.000
3    0.000        1.000
4    0.811        1.755
5    0.000        1.000
6    0.811        1.755
pdb_prot_align.utils.pdb_seq_to_number(pdbfile, chain_ids, chain_identity='require_same')[source]

Get sequence and number of chain(s) for some protein in PDB file.

Parameters
  • pdbfile (str) – Path to existing PDB file.

  • chain_ids (list) – List of chains in PDB file. All these chains must correspond to the same underlying molecule (i.e., be monomers in homo-oligomer).

  • chain_identity ({'union', 'intersection', 'require_same'}) – How to parse chains. They are required to share the same wildtype at all sites they have in common. If all sites are not shared between all chains, take the union, the intersection, or raise an error if they are not exactly the same.

Returns

Columns are:
  • ’sequential’ : sequential 1, 2, … numbering of residues

  • ’pdb_site’ : PDB file residue number

  • ’wildtype’ : wildtype residue identity (1-letter code)

Return type

pandas.DataFrame

pdb_prot_align.utils.strip_gaps_to_ref(aln, ref_head)[source]

Strip gaps in alignment relative to reference.

Parameters
  • aln (Bio.Align.MultipleSeqAlignment) – The alignment of sequences.

  • ref_head (str) – Header of reference sequence in aln.

Returns

Alignment as list of 2-tuples of strings (header, sequence) with all gaps relative to reference removed.

Return type

list

Example

>>> with tempfile.TemporaryFile('w+') as f:
...     _ = f.write(textwrap.dedent(
...             '''
...             >seq1
...             GK-AC-L
...             >seq2
...             -K-AM-L
...             >seq3
...             G-NAC-L
...             >seq4
...             GK-ACSI
...             '''
...             ))
...     f.flush()
...     _ = f.seek(0)
...     aln = Bio.AlignIO.read(f, 'fasta')
>>> strip_gaps_to_ref(aln, 'seq1')  
[('seq1', 'GKACL'),
 ('seq2', '-KAML'),
 ('seq3', 'G-ACL'),
 ('seq4', 'GKACI')]