utils¶
Utilities for aligning proteins to PDB.
-
pdb_prot_align.utils.
align_prots_mafft
(prots, *, mafft='mafft')[source]¶ Align protein sequences with
mafft
.- Parameters
prots (list) – List of sequences as Biopython.SeqRecord.SeqRecord objects.
mafft (str) – Command that resolves to
mafft
executable, potentially with arbitrary additional options, such asmafft --reorder
.
- Returns
The protein alignment.
- Return type
Bio.Align.MultipleSeqAlignment
-
pdb_prot_align.utils.
aligned_site_map
(aln, heads)[source]¶ Get mapping between sequential site numbers of aligned sequences.
- Parameters
aln (Bio.Align.MultipleSeqAlignment) – Alignment of sequences
heads (list) – Headers of sequences of interest in aln, must be at least two.
- Returns
Columns are sequence names in heads. There is a row for alignment positions that are non-gapped in at least one sequence in heads, and the entry in the row gives the number of that site in 1, 2, … numbering of the sequence, or NaN if there is no residue for sequence at that position in alignment.
- Return type
pandas.DataFrame
Example
Create and parse an alignment:
>>> with tempfile.TemporaryFile('w+') as f: ... _ = f.write(textwrap.dedent( ... ''' ... >seq1 ... GK-AC-L ... >seq2 ... -K-AM-L ... >seq3 ... G-NAC-L ... >seq4 ... GK-ACSI ... ''' ... )) ... f.flush() ... _ = f.seek(0) ... aln = Bio.AlignIO.read(f, 'fasta')
Show the alignment site map. The gymnastics with converting to str and replacing ‘nan’ are to make output format the same in both pandas >= and < 1.0 and earlier versions so doctest passes in both:
>>> (aligned_site_map(aln, ['seq1', 'seq2', 'seq3']) ... .astype(str).replace('nan', '<NA>')) seq1 seq2 seq3 0 1 <NA> 1 1 2 1 <NA> 2 <NA> <NA> 2 3 3 2 3 4 4 3 4 5 5 4 5
-
pdb_prot_align.utils.
alignment_to_count_df
(fasta_alignment, *, ignore_gaps=False, as_freqs=False, add_entropy_neff=False)[source]¶ Convert FASTA alignment to counts data frame.
- Parameters
fasta_alignment (str) – FASTA file with alignment.
ignore_gaps (bool) – Ignore gap characters (‘-‘) in counts.
as_freqs (bool) – Return frequencies rather than counts.
add_entropy_neff (bool) – Add columns with site entropy (bits) and number effective amino acids. Requires as_freqs to be True.
- Returns
First column (‘isite’) is 1, 2, … site numbering. Other columns give counts of each observed alignment character at that site.
- Return type
pandas.DataFrame
Example
>>> with tempfile.NamedTemporaryFile('w+') as f: ... _ = f.write(textwrap.dedent( ... ''' ... >seq1 ... GK-AC-L ... >seq2 ... -K-AM-L ... >seq3 ... G-NAC-L ... >seq4 ... GK-ACSI ... ''' ... )) ... f.flush() ... counts = alignment_to_count_df(f.name) ... counts_nogaps = alignment_to_count_df(f.name, ignore_gaps=True) ... freqs_nogaps = alignment_to_count_df(f.name, ... ignore_gaps=True, ... as_freqs=True) ... freqs_with_ent = alignment_to_count_df(f.name, ... ignore_gaps=True, ... as_freqs=True, ... add_entropy_neff=True, ... )
>>> counts isite - A C G I K L M N S 0 1 1 0 0 3 0 0 0 0 0 0 1 2 1 0 0 0 0 3 0 0 0 0 2 3 3 0 0 0 0 0 0 0 1 0 3 4 0 4 0 0 0 0 0 0 0 0 4 5 0 0 3 0 0 0 0 1 0 0 5 6 3 0 0 0 0 0 0 0 0 1 6 7 0 0 0 0 1 0 3 0 0 0 >>> counts_nogaps isite A C G I K L M N S 0 1 0 0 3 0 0 0 0 0 0 1 2 0 0 0 0 3 0 0 0 0 2 3 0 0 0 0 0 0 0 1 0 3 4 4 0 0 0 0 0 0 0 0 4 5 0 3 0 0 0 0 1 0 0 5 6 0 0 0 0 0 0 0 0 1 6 7 0 0 0 1 0 3 0 0 0 >>> freqs_nogaps.round(2) isite A C G I K L M N S 0 1 0.0 0.00 1.0 0.00 0.0 0.00 0.00 0.0 0.0 1 2 0.0 0.00 0.0 0.00 1.0 0.00 0.00 0.0 0.0 2 3 0.0 0.00 0.0 0.00 0.0 0.00 0.00 1.0 0.0 3 4 1.0 0.00 0.0 0.00 0.0 0.00 0.00 0.0 0.0 4 5 0.0 0.75 0.0 0.00 0.0 0.00 0.25 0.0 0.0 5 6 0.0 0.00 0.0 0.00 0.0 0.00 0.00 0.0 1.0 6 7 0.0 0.00 0.0 0.25 0.0 0.75 0.00 0.0 0.0 >>> (freqs_with_ent.drop(columns=['entropy', 'n_effective']) == ... freqs_nogaps).all(None) True >>> freqs_with_ent[['entropy', 'n_effective']].round(3) entropy n_effective 0 0.000 1.000 1 0.000 1.000 2 0.000 1.000 3 0.000 1.000 4 0.811 1.755 5 0.000 1.000 6 0.811 1.755
-
pdb_prot_align.utils.
pdb_seq_to_number
(pdbfile, chain_ids, chain_identity='require_same')[source]¶ Get sequence and number of chain(s) for some protein in PDB file.
- Parameters
pdbfile (str) – Path to existing PDB file.
chain_ids (list) – List of chains in PDB file. All these chains must correspond to the same underlying molecule (i.e., be monomers in homo-oligomer).
chain_identity ({'union', 'intersection', 'require_same'}) – How to parse chains. They are required to share the same wildtype at all sites they have in common. If all sites are not shared between all chains, take the union, the intersection, or raise an error if they are not exactly the same.
- Returns
- Columns are:
’sequential’ : sequential 1, 2, … numbering of residues
’pdb_site’ : PDB file residue number
’wildtype’ : wildtype residue identity (1-letter code)
- Return type
pandas.DataFrame
-
pdb_prot_align.utils.
strip_gaps_to_ref
(aln, ref_head)[source]¶ Strip gaps in alignment relative to reference.
- Parameters
aln (Bio.Align.MultipleSeqAlignment) – The alignment of sequences.
ref_head (str) – Header of reference sequence in aln.
- Returns
Alignment as list of 2-tuples of strings (header, sequence) with all gaps relative to reference removed.
- Return type
list
Example
>>> with tempfile.TemporaryFile('w+') as f: ... _ = f.write(textwrap.dedent( ... ''' ... >seq1 ... GK-AC-L ... >seq2 ... -K-AM-L ... >seq3 ... G-NAC-L ... >seq4 ... GK-ACSI ... ''' ... )) ... f.flush() ... _ = f.seek(0) ... aln = Bio.AlignIO.read(f, 'fasta') >>> strip_gaps_to_ref(aln, 'seq1') [('seq1', 'GKACL'), ('seq2', '-KAML'), ('seq3', 'G-ACL'), ('seq4', 'GKACI')]