genbank

Extract information from Genbank-format files.

pdb_prot_align.genbank.genbank_to_feature_df(gb, multi_features, *, feature_color_map=None, featureless_sites='NaN')[source]

Convert Genbank with annotated features to map of sites to features.

Parameters
  • gb (BioPython.SeqRecord.SeqRecord or name of Genbank file) – Contains features annotated by their type.

  • multi_features ({'last', 'first', 'all', 'error'}) – What to do if a site is in multiple features? Assign it to ‘last’ or ‘first’ feature (as ordered in gb) that it is part of, assign it to ‘all’ features it is part of, or raise an ‘error’.

  • feature_color_map (None or dict) – If not None, add a column mappping each feature to a color. The dict must have a key for each feature type in gb.

  • featureless_sites ({'NaN', 'drop'}) – How to handle sites without features. Keep them with features of NaN, or drop them from returned data frame.

Returns

Columns are ‘isite’ (1, … numbering), ‘amino_acid’, ‘feature’, and optionally ‘color’ if feature_color_map is not None. If mapping sites to multiple features, the returned data frame is tidy.

Return type

pandas.DataFrame

Example

>>> gb = Bio.SeqRecord.SeqRecord(
...         seq='MGKLLIT',
...         id='example',
...         name='example',
...         description='example',
...         features=[Bio.SeqFeature.SeqFeature(
...                     type='1to6',
...                     location=Bio.SeqFeature.FeatureLocation(0, 6)),
...                   Bio.SeqFeature.SeqFeature(
...                     type='1to2',
...                     location=Bio.SeqFeature.FeatureLocation(0, 2)),
...                   Bio.SeqFeature.SeqFeature(
...                     type='4to6',
...                     location=Bio.SeqFeature.FeatureLocation(3, 6)),
...                   ],
...         )
>>> genbank_to_feature_df(gb, 'last')
   isite amino_acid feature
0      1          M    1to2
1      2          G    1to2
2      3          K    1to6
3      4          L    4to6
4      5          L    4to6
5      6          I    4to6
6      7          T     NaN
>>> genbank_to_feature_df(gb, 'last', featureless_sites='drop')
   isite amino_acid feature
0      1          M    1to2
1      2          G    1to2
2      3          K    1to6
3      4          L    4to6
4      5          L    4to6
5      6          I    4to6
>>> genbank_to_feature_df(gb, 'first')
   isite amino_acid feature
0      1          M    1to6
1      2          G    1to6
2      3          K    1to6
3      4          L    1to6
4      5          L    1to6
5      6          I    1to6
6      7          T     NaN
>>> genbank_to_feature_df(gb, 'all')
    isite amino_acid feature
0       1          M    1to6
1       1          M    1to2
2       2          G    1to6
3       2          G    1to2
4       3          K    1to6
5       4          L    1to6
6       4          L    4to6
7       5          L    1to6
8       5          L    4to6
9       6          I    1to6
10      6          I    4to6
11      7          T     NaN
>>> genbank_to_feature_df(gb, 'error')
Traceback (most recent call last):
    ...
ValueError: site 1 in multiple features
>>> feature_color_map = {'1to6': 'red', '1to2': 'blue', '4to6': 'green'}
>>> genbank_to_feature_df(gb, 'last', feature_color_map=feature_color_map)
   isite amino_acid feature  color
0      1          M    1to2   blue
1      2          G    1to2   blue
2      3          K    1to6    red
3      4          L    4to6  green
4      5          L    4to6  green
5      6          I    4to6  green
6      7          T     NaN    NaN