genbank¶
Extract information from Genbank-format files.
-
pdb_prot_align.genbank.
genbank_to_feature_df
(gb, multi_features, *, feature_color_map=None, featureless_sites='NaN')[source]¶ Convert Genbank with annotated features to map of sites to features.
- Parameters
gb (BioPython.SeqRecord.SeqRecord or name of Genbank file) – Contains features annotated by their type.
multi_features ({'last', 'first', 'all', 'error'}) – What to do if a site is in multiple features? Assign it to ‘last’ or ‘first’ feature (as ordered in gb) that it is part of, assign it to ‘all’ features it is part of, or raise an ‘error’.
feature_color_map (None or dict) – If not None, add a column mappping each feature to a color. The dict must have a key for each feature type in gb.
featureless_sites ({'NaN', 'drop'}) – How to handle sites without features. Keep them with features of NaN, or drop them from returned data frame.
- Returns
Columns are ‘isite’ (1, … numbering), ‘amino_acid’, ‘feature’, and optionally ‘color’ if feature_color_map is not None. If mapping sites to multiple features, the returned data frame is tidy.
- Return type
pandas.DataFrame
Example
>>> gb = Bio.SeqRecord.SeqRecord( ... seq='MGKLLIT', ... id='example', ... name='example', ... description='example', ... features=[Bio.SeqFeature.SeqFeature( ... type='1to6', ... location=Bio.SeqFeature.FeatureLocation(0, 6)), ... Bio.SeqFeature.SeqFeature( ... type='1to2', ... location=Bio.SeqFeature.FeatureLocation(0, 2)), ... Bio.SeqFeature.SeqFeature( ... type='4to6', ... location=Bio.SeqFeature.FeatureLocation(3, 6)), ... ], ... )
>>> genbank_to_feature_df(gb, 'last') isite amino_acid feature 0 1 M 1to2 1 2 G 1to2 2 3 K 1to6 3 4 L 4to6 4 5 L 4to6 5 6 I 4to6 6 7 T NaN
>>> genbank_to_feature_df(gb, 'last', featureless_sites='drop') isite amino_acid feature 0 1 M 1to2 1 2 G 1to2 2 3 K 1to6 3 4 L 4to6 4 5 L 4to6 5 6 I 4to6
>>> genbank_to_feature_df(gb, 'first') isite amino_acid feature 0 1 M 1to6 1 2 G 1to6 2 3 K 1to6 3 4 L 1to6 4 5 L 1to6 5 6 I 1to6 6 7 T NaN
>>> genbank_to_feature_df(gb, 'all') isite amino_acid feature 0 1 M 1to6 1 1 M 1to2 2 2 G 1to6 3 2 G 1to2 4 3 K 1to6 5 4 L 1to6 6 4 L 4to6 7 5 L 1to6 8 5 L 4to6 9 6 I 1to6 10 6 I 4to6 11 7 T NaN
>>> genbank_to_feature_df(gb, 'error') Traceback (most recent call last): ... ValueError: site 1 in multiple features
>>> feature_color_map = {'1to6': 'red', '1to2': 'blue', '4to6': 'green'} >>> genbank_to_feature_df(gb, 'last', feature_color_map=feature_color_map) isite amino_acid feature color 0 1 M 1to2 blue 1 2 G 1to2 blue 2 3 K 1to6 red 3 4 L 4to6 green 4 5 L 4to6 green 5 6 I 4to6 green 6 7 T NaN NaN