targets

Defines Targets, which holds Target objects that define alignment targets. Each Target has some Feature regions.

class alignparse.targets.Feature(*, name, seq, start, end)[source]

Bases: object

A sequence feature within a Target sequence.

Parameters:
  • name (str) – Name of feature.

  • seq (str) – Sequence of feature.

  • start (int) – Feature start in Target, using Python-like 0, … numbering.

  • end (int) – Feature end in Target using Python-like 0, … numbering.

name

Name of feature.

Type:

str

seq

Sequence of feature.

Type:

str

start

Feature start in Target, using Python-like 0, … numbering.

Type:

int

end

Feature end in Target using Python-like 0, … numbering.

Type:

int

length

Length of feature.

Type:

int

class alignparse.targets.Target(*, seqrecord, req_features=frozenset({}), opt_features=frozenset({}), allow_extra_features=False)[source]

Bases: object

A single target sequence.

Parameters:
  • seqrecord (Bio.SeqRecord.SeqRecord) – BioPython sequence record of target. Must have seq, name, and features attributes. Currently only handles + strand features.

  • req_features (set or other iterable) – Required features in seqrecord.

  • opt_features (set of other iterable) – Optional features in seqrecord.

  • allow_extra_features (bool) – Can seqrecord have features not in req_features or opt_features?

seq

Full sequence of target.

Type:

str

name

Name of target.

Type:

str

length

Length of sequence.

Type:

str

features

List of all features as Feature objects.

Type:

list

feature_names

List of names of all features.

Type:

list

get_feature(name)[source]

Get Feature by name.

Parameters:

name (str) – Name of Feature.

Returns:

Returns the feature, or raises ValueError if no such feature.

Return type:

Feature

classmethod get_name(seqrecord)[source]

Get name of target from sequence record.

Parameters:

seqrecord (Bio.SeqRecord.SeqRecord) – Sequence record as passed to Target.

Returns:

Name parsed from seqrecord.

Return type:

str

has_feature(name)[source]

Check if a feature is defined for this target.

Parameters:

name (str) – Name of Feature.

Returns:

True if target has feature of this name, False otherwise.

Return type:

bool

image(*, color_map=None, feature_labels=None, plots_indexing='genbank')[source]

Get image of the target.

Parameters:
  • color_map (None or dict) – To specify colors for each feature, provide a dict mapping feature names to colors. Otherwise automatically chosen.

  • feature_labels (None or dict) – Map feature names to text labels shown on plot. Otherwise features just labeled by name.

  • plots_indexing ({'biopython', 'genbank'}) – Does image use 0-based (‘biopython’) or 1-based (‘genbank’) indexing of nucleotide sites?

Returns:

Image of target, which has .plot and .plot_with_bokeh methods: https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer

Return type:

dna_features_viewer.GraphicRecord.GraphicRecord

class alignparse.targets.Targets(*, seqsfile, feature_parse_specs, allow_extra_features=False, seqsfileformat='genbank', allow_clipped_muts_seqs=False, ignore_feature_parse_specs_keys=None, select_target_names=None)[source]

Bases: object

Collection of Target sequences.

Parameters:
  • seqsfile (str or list) – Name of file specifying the targets, or list of such files. So if multiple targets they can all be in one file or in separate files.

  • feature_parse_specs (dict or str) –

    How Targets.parse_alignment() parses alignments. Specify dict or name of YAML file. Keyed by names of targets, values target-level dicts keyed by feature names. The feature-level dicts have two keys:

    • ’filter’: dict keyed by ‘clip5’, ‘clip3’, ‘mutation_nt_count’, and ‘mutation_op_count’ giving max clipping at each end, number of nucleotide mutations, and number of cs tag mutation operations allowed for feature. If ‘filter’ itself or any of the keys are missing, the value is set to zero. If the value is None (‘null’ in YAML notation), then no filter is applied.

    • ’return’: str or list of strings indicating what to return for this feature. If ‘returns’ is absent or the value is None (‘null’ in YAML notation), nothing is returned for this feature. Otherwise list one or more of ‘sequence’, ‘mutations’, ‘accuracy’, ‘cs’, ‘clip5’, and ‘clip3’ to get the sequence, mutation string, cs tag, or number of clipped nucleotides from each end.

    In addition, target-level dicts should have keys ‘query_clip5’ and ‘query_clip3’ which give the max amount that can be clipped from each end of the query prior to the alignment. Use a value of None (‘null’ in YAML notation) to have no filter on this clipping. Filters will be applied in the order the features appear in the feature_parse_specs.

  • allow_extra_features (bool) – Can targets have features not in feature_parse_specs?

  • seqsfileformat ({'genbank'}) – Format of seqsfile. Currently, ‘genbank’ is the only supported option. The GenBank Flat File format is described here, but not all fields are required. The documentation includes examples that show what fields should typically be included. GenBank files can be readily generated using several sequence editing programs, such as ApE or Benchling.

  • allow_clipped_muts_seqs (bool) – Returning sequence or mutations for features where non-zero clipping is allowed is dangerous, since as described in Targets.parse_alignment() these will only be for unclipped region and so are easy to mis-interpret. So you must explicitly set this option to True in order to allow return of mutations / sequences for features with clipping allowed; otherwise you’ll get an error if you try to recover such sequences / mutations.

  • ignore_feature_parse_specs_keys (None or list) – Ignore these target-level keys in feature_parse_specs. Useful for YAML with default keys that don’t represent actual targets.

  • select_target_names (None or list) – If None, the created object is for all sequences in seqsfile. Otherwise pass a list with names of just the sequences of interest.

targets

List of all Target objects.

Type:

list

target_names

List of names of all targets.

Type:

list

target_seqs

Keyed by target name, value is sequence as str.

Type:

dict

align(queryfile, alignmentfile, mapper)[source]

Align query sequences to targets.

Parameters:
  • queryfile (str) – The query sequences to align (FASTQ or FASTA, can be gzipped).

  • alignmentfile (str) – SAM file created by mapper with alignments of queries to the target sequences within this Targets object.

  • mapper (alignparse.minimap2.Mapper) – Mapper that runs minimap2. Alignment options set when creating this mapper.

align_and_parse(df, mapper, outdir, *, name_col='name', queryfile_col='queryfile', group_cols=None, to_csv=False, overwrite=False, multi_align='primary', filtered_cs=False, skip_sups=True, ncpus=-1)[source]

Align query sequences and then parse alignments.

Note

This is a convenience method to run Targets.align() and Targets.parse_alignment() on multiple queries and collate the results.

It also allows multiple queries to be handled simultaneously using multiprocessing.

Parameters:
  • df (pandas.DataFrame) – Data frame with information on queries to align.

  • mapper (alignparse.minimap2.Mapper) – Mapper that runs minimap2. Alignment options set when creating this mapper.

  • outdir (str) – Name of directory with created alignments and parsing files. Created if it does not exist.

  • name_col (str) – Column in df with the name of each set of queries.

  • queryfile_col (str) – Column in df with FASTQ file with queries.

  • group_cols (None or str or list) – Columns in df used to “group” results. These columns are in all created data frames. For instance, might specify different libraries or samples.

  • to_csv (bool) – Write CSV files rather than return data frames. Useful to avoid reading large data frames into memory.

  • overwrite (bool) – If some of the created output files already exist, do we overwrite them or raise and error?

  • multi_align ({'primary'}) – How to handle multiple alignments. Currently only option is ‘primary’, which ignores all secondary alignments.

  • filtered_cs (bool) – Add cs tag that failed the filter to filtered dataframe along with filter reason. Allows for more easily investigating why reads are failing the filters.

  • skip_sups (bool) – Whether or not to skip supplementary alignments when parsing. Supplementary alignments are additional possible alignments for a read due to the read potentially being a chimeric. The default is to skip these alignments and not parse them.

  • ncpus (int) – Number of CPUs to use; -1 means all available.

Returns:

(readstats, aligned, filtered) – Same meaning as for Targets.parse_alignment() except the data frames / CSV files all have additional columns indicating name of each query set (name_cols) as well as any group_cols.

Return type:

tuple

feature_parse_specs(returntype)[source]

Get the feature parsing specs.

Note

Filters will be applied in the order they are listed in the feature_parse_specs yaml file or dict. Once a read fails a filter, other filters will not be applied. As such, it is recommended to have features with filters for 5’ and 3’ clipping listed first.

Parameters:

returntype ({'dict', 'yaml'}) – Return a Python dict or a YAML string representation.

Returns:

The feature parsing specs set by the feature_parse_specs at Targets initialization, but with any missing default values explicitly filled in.

Return type:

dict or str

features_to_parse(targetname, feature_or_name='feature')[source]

Features to parse for a target.

Parameters:
  • targetname (str) – Name of target.

  • feature_or_name ({'feature', 'name'}) – Get the Feature objects themselves or their names.

Returns:

Features to parse for this target, as specified in Targets.feature_parse_specs().

Return type:

list

get_target(name)[source]

Get Target by name.

Parameters:

name (str) – Name of Target.

Returns:

Returns the target, or raises ValueError if no such target.

Return type:

Target

parse_alignment(samfile, multi_align='primary', to_csv=False, csv_dir=None, overwrite_csv=False, filtered_cs=False, skip_sups=True)[source]

Parse alignment features as specified in feature_parse_specs.

Parameters:
  • samfile (str) – SAM file with minimap2 alignments with cs tag, typically created by Targets.align().

  • multi_align ({'primary'}) – How to handle multiple alignments. Currently only option is ‘primary’, which ignores all secondary alignments.

  • to_csv (bool) – Return CSV file names rather than return data frames. Useful to avoid reading large data frames into memory.

  • csv_dir (None or str) – If to_csv is True, name of directory to which we write CSV files (created if needed). If None, write to current directory.

  • overwrite_csv (bool) – If using to_csv, do we overwrite existing CSV files or raise an error if they already exist?

  • filtered_cs (bool) – Add cs tag that failed the filter to filtered dataframe along with filter reason. Allows for more easily investigating why reads are failing the filters.

  • skip_sups (bool) – Whether or not to skip supplementary alignments when parsing. Supplementary alignments are additional possible alignments for a read due to the read potentially being a chimeric. The default is to skip these alignments and not parse them.

Returns:

(readstats, aligned, filtered)

  • readstats is pandas.DataFrame with numbers of unmapped reads, and for each target the number of mapped reads that are validly aligned and that fail filters in feature_parse_specs.

  • aligned is a dict keyed by name of each target. Entries are pandas.DataFrame with rows for each validly aligned read. Rows give query name, query clipping at each end of alignment, and any feature-level info specified for return in feature_parse_specs in columns with names equal to feature suffixed by ‘_sequence’, ‘_mutations’, ‘_accuracy’, ‘_cs’, ‘_clip5’, and ‘_clip3’. and ‘_clip3’.

  • filtered is a dict keyed by name of each target. Entries are pandas.DataFrame with a row for each filtered aligned read giving the query name and the reason it was filtered. If filtered_cs is True then, add a column to the “filtered” pandas.DataFrame`s with the `cs tag that failed the filter.

If to_csv is True, then aligned and filtered give names of CSV files holding data frames.

Return type:

tuple

Note

The cs tags are in the short format returned by minimap2; see here for details: https://lh3.github.io/minimap2/minimap2.html

When parsing features, if an insertion occurs between two features, it is assigned to the end of the first feature.

Returned sequences, mutation strings, and cs tags are only for for the portion of the feature that aligns, and do not indicate clipping, which you instead get in the ‘_clip*’ columns. The sequences are simply what the cs tag implies, indels / mutations are not indicated in this column. Mutation strings are space-delimited with these operations in 1-based (1, 2, …) numbering from start of the feature:

  • ‘A2G’ : substitution at site 2 from A to G

  • ‘ins5TAA’ : insertion of ‘TAA’ starting at site 5

  • ‘del5to6’ : deletion of sites 5 to 6, inclusive

The returned accuracy is the average accuracy of the aligned query sites as calculated from the Q-values, and is nan if there are no aligned query sites.

plot(*, sharex=True, ax_width=5, ax_height=3, hspace=0.4, **kwargs)[source]

Plot all the targets.

Note

For more customizable plots, call Target.image() for individual targets.

Parameters:
  • sharex (bool) – Share x-axis among plots for each target?

  • ax_width (float) – Width of each axis in inches.

  • ax_height (float) – Height of each axis in inches.

  • hspace (float) – Vertical space between axes as fraction of ax_height.

  • **kwargs – Keyword arguments passed to Target.image().

Returns:

Figure showing all targets.

Return type:

matplotlib.pyplot.figure

write_fasta(fastafile)[source]

Write all targets to a FASTA file.

Parameters:

filename (str or writable file-like object) – Write targets to this file.