cs_tag¶
Parse SAM entries with cs
short tag from
minimap2 output.
See here for details on
short cs
tag format.
- class alignparse.cs_tag.Alignment(sam_alignment, *, introns_to_deletions=False, target_seqs=None)[source]¶
Bases:
object
Process a SAM alignment with a
cs
tag to extract features.- Parameters:
sam_alignment (pysam.AlignedSegment) – Aligned segment from pysam, must have a short format
cs
tag (see https://lh3.github.io/minimap2/minimap2.html). This must be a mapped read, you will get an error if unmapped.introns_to_deletions (bool) – Convert all introns in the
cs
tag to deletions.target_seqs (dict) – Required if introns_to_deletions is True. Is keyed by target names with values target sequences as str.
- query_name¶
Name of query in alignment.
- Type:
str
- target_name¶
Name of alignment target.
- Type:
str
- cs¶
The
cs
tag.- Type:
str
- query_clip5¶
Length at 5’ end of query not in alignment.
- Type:
int
- query_clip3¶
Length at 3’ end of query not in alignment.
- Type:
int
- target_clip5¶
Length at 5’ end of target not in alignment.
- Type:
int
- target_lastpos¶
Last position of alignment in target (exclusive).
- Type:
int
- orientation¶
Does query align to the target (+) or reverse complement (-).
- Type:
{‘+’, ‘-‘}
- extract_cs(start, end)[source]¶
Extract
cs
tag corresponding to feature in target.- Parameters:
start (int) – Start of feature in target in 0, 1, … numbering.
end (int) – End of feature in target (not inclusive of this site).
- Returns:
If feature is not present in alignment, return None. Otherwise, return (cs, clip5, clip3) where cs is the
cs
string for the aligned portion of the feature, and clip5 and clip3 are the lengths that must be clipped off the end of the feature to get to that alignment.- Return type:
tuple or None
Note
If an insertion is at the boundary of two features, it is assigned as being at the end of the first feature.
- get_accuracy(targetstart, targetend)[source]¶
Get accuracy of part of query aligned to a region of the target.
- Parameters:
targetstart (int) – Start of region in target in 0, 1, … numbering.
targetend (int) – End of region in target (not inclusive of this site).
- Returns:
The accuracy of the region, computed as the average of the Q-scores for all aligned sites in the query, or NaN if there are no query sites aligned to this region.
- Return type:
float or NaN
- alignparse.cs_tag.cs_introns_to_deletions(cs, targetseq)[source]¶
Convert introns to deletions in
cs
tag.- Parameters:
cs (str) – Short-format
cs
tag.targetseq (str) – Region of target sequenced covered by cs alignment.
- Returns:
Version of cs where all introns have been converted to deletions.
- Return type:
str
Examples
>>> cs_introns_to_deletions(':3-ggaac:2', 'ATGGGAACAT') ':3-ggaac:2' >>> cs_introns_to_deletions(':3~gg5ac:2', 'ATGGGAACAT') ':3-ggaac:2' >>> cs_introns_to_deletions(':2*ga~gg6ta:2-gc:2~at4cg:3', ... 'ATGGGCCTATTGCTAATCGAAA') ':2*ga-ggccta:2-gc:2-atcg:3'
- alignparse.cs_tag.cs_op_len_target(cs_op, *, invalid='raise')[source]¶
Get length of valid
cs
operation.- Parameters:
cs_op (str) – A single operation in a short
cs
tag.invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.
- Returns:
Length of given cs_op. This length is based on the target sequence, so insertions in the query have length 0 and deletions are the length of the target sequence deleted from the query. ‘None’ if cs_op is invalid and invalid is ignore.
- Return type:
int or None
Example
>>> cs_op_len_target('*nt') 1 >>> cs_op_len_target(':45') 45 >>> cs_op_len_target('-at') 2 >>> cs_op_len_target('+gc') 0 >>> cs_op_len_target('*nt:45') Traceback (most recent call last): ... ValueError: invalid `cs_op` of *nt:45 >>> cs_op_len_target('*nt:45', invalid='ignore') is None True
- alignparse.cs_tag.cs_op_type(cs_op, *, invalid='raise')[source]¶
Get type of
cs
operation.- Parameters:
cs_op (str) – A single operation in a short
cs
tag.invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.
- Returns:
Type of
cs
operation, or None if cs_string invalid and invalid is ‘ignore’.- Return type:
{‘substitution’, ‘insertion’, ‘deletion’, ‘identity’, None}
Example
>>> cs_op_type('*nt') 'substitution' >>> cs_op_type(':45') 'identity' >>> cs_op_type('*nt:45') Traceback (most recent call last): ... ValueError: invalid `cs_op` of *nt:45 >>> cs_op_type('*nt:45', invalid='ignore') is None True
- alignparse.cs_tag.cs_to_mutation_str(cs, offset=0)[source]¶
Convert
cs
tag to a descriptive string of mutations.- Parameters:
cs (str) – A
cs
tag.offset (int) – Mutations are numbered in 1, … numbering plus this offset.
- Returns:
Space-delimited string of form ‘A5T G86A ins7ACG del19to24’ for all mutations specified in cs.
- Return type:
str
Example
>>> cs_to_mutation_str(':4*nt-tc:2+ga:6') 'del6to7 ins10GA' >>> cs_to_mutation_str(':4*at-tc:2+ga:6') 'A5T del6to7 ins10GA' >>> cs_to_mutation_str(':4*at-tc:2+ga:6', offset=2) 'A7T del8to9 ins12GA' >>> cs_to_mutation_str(':45') ''
Note
Mutation strings use “human readable” indexing, so the first nucleotide of the sequence is 1 and deletions are inclusive of the last number.
Changes from ambiguous nucleotides to any other identity are not considered mutations in the returned strings.
- alignparse.cs_tag.cs_to_nt_mutation_count(cs)[source]¶
Count the number of nucleotide mutations in
cs
tag.- Parameters:
cs (str) – cs string
- Returns:
Number of nucleotides that are mutated. Insertions / deletions are counted as the number of nucleotides in the indel. Changes from an ambiguous nucleotide to are not considered mutations.
- Return type:
int
Example
>>> cs_to_nt_mutation_count(':4*nt-tc:2+g') 3 >>> cs_to_nt_mutation_count(':4*gt-tc:2+g') 4
- alignparse.cs_tag.cs_to_op_mutation_count(cs)[source]¶
Count the number of mutation operations in
cs
tag.- Parameters:
cs (str) – The
cs
tag.- Returns:
Number of mutation operations in the query sequence. Each indel or substitution is counted as a single mutation operation regardless of how many mutations it contains. Changes from ambiguous nucleotides to another nucleotide are not counted.
- Return type:
int
Example
>>> cs_to_op_mutation_count(':4*nt-tc:2+g') 2 >>> cs_to_op_mutation_count(':4*gt-tc:2+g') 3
- alignparse.cs_tag.cs_to_sequence(cs, seq)[source]¶
Convert
cs
tag to a sequence.- Parameters:
cs (str) – cs string
seq (str) – Sequence of target for region corresponding to cs string.
- Returns:
Nucleotide sequence generated by applying cs to seq.
- Return type:
str
Example
>>> cs_to_sequence(':4*nt-tc:2+g:2', 'CGGANTCCAAT') 'CGGATCAGAT'
- alignparse.cs_tag.split_cs(cs_string, *, invalid='raise', allow_intron=False)[source]¶
Split a short
cs
tag into its constituent operations.- Parameters:
cs_string (str) – The short
cs
tag.invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.
allow_intron (bool) – Are introns allowed as
cs
operations?
- Returns:
Tuple of the individual
cs
operations, or None if invalid cs_string and invalid is ‘ignore’.- Return type:
tuple or None
Example
>>> split_cs(':32*nt*na:10-gga:5+aaa:10') (':32', '*nt', '*na', ':10', '-gga', ':5', '+aaa', ':10')
>>> split_cs('bad:32*nt*na:10-gga:5', invalid='ignore') is None True
>>> split_cs('bad:32*nt*na:10-gga:5') Traceback (most recent call last): ... ValueError: invalid `cs_string` of bad:32*nt*na:10-gga:5