targets¶
Defines Targets
, which holds Target
objects that define
alignment targets. Each Target
has some Feature
regions.
- class alignparse.targets.Feature(*, name, seq, start, end)[source]¶
Bases:
object
A sequence feature within a
Target
sequence.- Parameters:
- name¶
Name of feature.
- Type:
str
- seq¶
Sequence of feature.
- Type:
str
- length¶
Length of feature.
- Type:
int
- class alignparse.targets.Target(*, seqrecord, req_features=frozenset({}), opt_features=frozenset({}), allow_extra_features=False)[source]¶
Bases:
object
A single target sequence.
- Parameters:
seqrecord (Bio.SeqRecord.SeqRecord) – BioPython sequence record of target. Must have seq, name, and features attributes. Currently only handles + strand features.
req_features (set or other iterable) – Required features in seqrecord.
opt_features (set of other iterable) – Optional features in seqrecord.
allow_extra_features (bool) – Can seqrecord have features not in req_features or opt_features?
- seq¶
Full sequence of target.
- Type:
str
- name¶
Name of target.
- Type:
str
- length¶
Length of sequence.
- Type:
str
- feature_names¶
List of names of all features.
- Type:
list
- classmethod get_name(seqrecord)[source]¶
Get name of target from sequence record.
- Parameters:
seqrecord (Bio.SeqRecord.SeqRecord) – Sequence record as passed to
Target
.- Returns:
Name parsed from seqrecord.
- Return type:
str
- has_feature(name)[source]¶
Check if a feature is defined for this target.
- Parameters:
name (str) – Name of
Feature
.- Returns:
True if target has feature of this name, False otherwise.
- Return type:
bool
- image(*, color_map=None, feature_labels=None, plots_indexing='genbank')[source]¶
Get image of the target.
- Parameters:
color_map (None or dict) – To specify colors for each feature, provide a dict mapping feature names to colors. Otherwise automatically chosen.
feature_labels (None or dict) – Map feature names to text labels shown on plot. Otherwise features just labeled by name.
plots_indexing ({'biopython', 'genbank'}) – Does image use 0-based (‘biopython’) or 1-based (‘genbank’) indexing of nucleotide sites?
- Returns:
Image of target, which has .plot and .plot_with_bokeh methods: https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer
- Return type:
dna_features_viewer.GraphicRecord.GraphicRecord
- class alignparse.targets.Targets(*, seqsfile, feature_parse_specs, allow_extra_features=False, seqsfileformat='genbank', allow_clipped_muts_seqs=False, ignore_feature_parse_specs_keys=None, select_target_names=None)[source]¶
Bases:
object
Collection of
Target
sequences.- Parameters:
seqsfile (str or list) – Name of file specifying the targets, or list of such files. So if multiple targets they can all be in one file or in separate files.
feature_parse_specs (dict or str) –
How
Targets.parse_alignment()
parses alignments. Specify dict or name of YAML file. Keyed by names of targets, values target-level dicts keyed by feature names. The feature-level dicts have two keys:’filter’: dict keyed by ‘clip5’, ‘clip3’, ‘mutation_nt_count’, and ‘mutation_op_count’ giving max clipping at each end, number of nucleotide mutations, and number of
cs
tag mutation operations allowed for feature. If ‘filter’ itself or any of the keys are missing, the value is set to zero. If the value is None (‘null’ in YAML notation), then no filter is applied.’return’: str or list of strings indicating what to return for this feature. If ‘returns’ is absent or the value is None (‘null’ in YAML notation), nothing is returned for this feature. Otherwise list one or more of ‘sequence’, ‘mutations’, ‘accuracy’, ‘cs’, ‘clip5’, and ‘clip3’ to get the sequence, mutation string,
cs
tag, or number of clipped nucleotides from each end.
In addition, target-level dicts should have keys ‘query_clip5’ and ‘query_clip3’ which give the max amount that can be clipped from each end of the query prior to the alignment. Use a value of None (‘null’ in YAML notation) to have no filter on this clipping. Filters will be applied in the order the features appear in the feature_parse_specs.
allow_extra_features (bool) – Can targets have features not in feature_parse_specs?
seqsfileformat ({'genbank'}) – Format of seqsfile. Currently, ‘genbank’ is the only supported option. The GenBank Flat File format is described here, but not all fields are required. The documentation includes examples that show what fields should typically be included. GenBank files can be readily generated using several sequence editing programs, such as ApE or Benchling.
allow_clipped_muts_seqs (bool) – Returning sequence or mutations for features where non-zero clipping is allowed is dangerous, since as described in
Targets.parse_alignment()
these will only be for unclipped region and so are easy to mis-interpret. So you must explicitly set this option to True in order to allow return of mutations / sequences for features with clipping allowed; otherwise you’ll get an error if you try to recover such sequences / mutations.ignore_feature_parse_specs_keys (None or list) – Ignore these target-level keys in feature_parse_specs. Useful for YAML with default keys that don’t represent actual targets.
select_target_names (None or list) – If None, the created object is for all sequences in seqsfile. Otherwise pass a list with names of just the sequences of interest.
- target_names¶
List of names of all targets.
- Type:
list
- target_seqs¶
Keyed by target name, value is sequence as str.
- Type:
dict
- align(queryfile, alignmentfile, mapper)[source]¶
Align query sequences to targets.
- Parameters:
queryfile (str) – The query sequences to align (FASTQ or FASTA, can be gzipped).
alignmentfile (str) – SAM file created by mapper with alignments of queries to the target sequences within this
Targets
object.mapper (
alignparse.minimap2.Mapper
) – Mapper that runsminimap2
. Alignment options set when creating this mapper.
- align_and_parse(df, mapper, outdir, *, name_col='name', queryfile_col='queryfile', group_cols=None, to_csv=False, overwrite=False, multi_align='primary', filtered_cs=False, skip_sups=True, ncpus=-1)[source]¶
Align query sequences and then parse alignments.
Note
This is a convenience method to run
Targets.align()
andTargets.parse_alignment()
on multiple queries and collate the results.It also allows multiple queries to be handled simultaneously using multiprocessing.
- Parameters:
df (pandas.DataFrame) – Data frame with information on queries to align.
mapper (
alignparse.minimap2.Mapper
) – Mapper that runsminimap2
. Alignment options set when creating this mapper.outdir (str) – Name of directory with created alignments and parsing files. Created if it does not exist.
name_col (str) – Column in df with the name of each set of queries.
queryfile_col (str) – Column in df with FASTQ file with queries.
group_cols (None or str or list) – Columns in df used to “group” results. These columns are in all created data frames. For instance, might specify different libraries or samples.
to_csv (bool) – Write CSV files rather than return data frames. Useful to avoid reading large data frames into memory.
overwrite (bool) – If some of the created output files already exist, do we overwrite them or raise and error?
multi_align ({'primary'}) – How to handle multiple alignments. Currently only option is ‘primary’, which ignores all secondary alignments.
filtered_cs (bool) – Add cs tag that failed the filter to filtered dataframe along with filter reason. Allows for more easily investigating why reads are failing the filters.
skip_sups (bool) – Whether or not to skip supplementary alignments when parsing. Supplementary alignments are additional possible alignments for a read due to the read potentially being a chimeric. The default is to skip these alignments and not parse them.
ncpus (int) – Number of CPUs to use; -1 means all available.
- Returns:
(readstats, aligned, filtered) – Same meaning as for
Targets.parse_alignment()
except the data frames / CSV files all have additional columns indicating name of each query set (name_cols) as well as any group_cols.- Return type:
tuple
- feature_parse_specs(returntype)[source]¶
Get the feature parsing specs.
Note
Filters will be applied in the order they are listed in the feature_parse_specs yaml file or dict. Once a read fails a filter, other filters will not be applied. As such, it is recommended to have features with filters for 5’ and 3’ clipping listed first.
- Parameters:
returntype ({'dict', 'yaml'}) – Return a Python dict or a YAML string representation.
- Returns:
The feature parsing specs set by the feature_parse_specs at
Targets
initialization, but with any missing default values explicitly filled in.- Return type:
dict or str
- features_to_parse(targetname, feature_or_name='feature')[source]¶
Features to parse for a target.
- Parameters:
targetname (str) – Name of target.
feature_or_name ({'feature', 'name'}) – Get the
Feature
objects themselves or their names.
- Returns:
Features to parse for this target, as specified in
Targets.feature_parse_specs()
.- Return type:
list
- parse_alignment(samfile, multi_align='primary', to_csv=False, csv_dir=None, overwrite_csv=False, filtered_cs=False, skip_sups=True)[source]¶
Parse alignment features as specified in feature_parse_specs.
- Parameters:
samfile (str) – SAM file with
minimap2
alignments withcs
tag, typically created byTargets.align()
.multi_align ({'primary'}) – How to handle multiple alignments. Currently only option is ‘primary’, which ignores all secondary alignments.
to_csv (bool) – Return CSV file names rather than return data frames. Useful to avoid reading large data frames into memory.
csv_dir (None or str) – If to_csv is True, name of directory to which we write CSV files (created if needed). If None, write to current directory.
overwrite_csv (bool) – If using to_csv, do we overwrite existing CSV files or raise an error if they already exist?
filtered_cs (bool) – Add cs tag that failed the filter to filtered dataframe along with filter reason. Allows for more easily investigating why reads are failing the filters.
skip_sups (bool) – Whether or not to skip supplementary alignments when parsing. Supplementary alignments are additional possible alignments for a read due to the read potentially being a chimeric. The default is to skip these alignments and not parse them.
- Returns:
(readstats, aligned, filtered) –
readstats is pandas.DataFrame with numbers of unmapped reads, and for each target the number of mapped reads that are validly aligned and that fail filters in feature_parse_specs.
aligned is a dict keyed by name of each target. Entries are pandas.DataFrame with rows for each validly aligned read. Rows give query name, query clipping at each end of alignment, and any feature-level info specified for return in feature_parse_specs in columns with names equal to feature suffixed by ‘_sequence’, ‘_mutations’, ‘_accuracy’, ‘_cs’, ‘_clip5’, and ‘_clip3’. and ‘_clip3’.
filtered is a dict keyed by name of each target. Entries are pandas.DataFrame with a row for each filtered aligned read giving the query name and the reason it was filtered. If filtered_cs is True then, add a column to the “filtered” pandas.DataFrame`s with the `cs tag that failed the filter.
If to_csv is True, then aligned and filtered give names of CSV files holding data frames.
- Return type:
tuple
Note
The
cs
tags are in the short format returned byminimap2
; see here for details: https://lh3.github.io/minimap2/minimap2.htmlWhen parsing features, if an insertion occurs between two features, it is assigned to the end of the first feature.
Returned sequences, mutation strings, and
cs
tags are only for for the portion of the feature that aligns, and do not indicate clipping, which you instead get in the ‘_clip*’ columns. The sequences are simply what thecs
tag implies, indels / mutations are not indicated in this column. Mutation strings are space-delimited with these operations in 1-based (1, 2, …) numbering from start of the feature:‘A2G’ : substitution at site 2 from A to G
‘ins5TAA’ : insertion of ‘TAA’ starting at site 5
‘del5to6’ : deletion of sites 5 to 6, inclusive
The returned accuracy is the average accuracy of the aligned query sites as calculated from the Q-values, and is nan if there are no aligned query sites.
- plot(*, sharex=True, ax_width=5, ax_height=3, hspace=0.4, **kwargs)[source]¶
Plot all the targets.
Note
For more customizable plots, call
Target.image()
for individual targets.- Parameters:
sharex (bool) – Share x-axis among plots for each target?
ax_width (float) – Width of each axis in inches.
ax_height (float) – Height of each axis in inches.
hspace (float) – Vertical space between axes as fraction of ax_height.
**kwargs – Keyword arguments passed to
Target.image()
.
- Returns:
Figure showing all targets.
- Return type:
matplotlib.pyplot.figure