cs_tag

Parse SAM entries with cs short tag from minimap2 output.

See here for details on short cs tag format.

class alignparse.cs_tag.Alignment(sam_alignment, *, introns_to_deletions=False, target_seqs=None)[source]

Bases: object

Process a SAM alignment with a cs tag to extract features.

Parameters:
  • sam_alignment (pysam.AlignedSegment) – Aligned segment from pysam, must have a short format cs tag (see https://lh3.github.io/minimap2/minimap2.html). This must be a mapped read, you will get an error if unmapped.

  • introns_to_deletions (bool) – Convert all introns in the cs tag to deletions.

  • target_seqs (dict) – Required if introns_to_deletions is True. Is keyed by target names with values target sequences as str.

query_name

Name of query in alignment.

Type:

str

target_name

Name of alignment target.

Type:

str

cs

The cs tag.

Type:

str

query_clip5

Length at 5’ end of query not in alignment.

Type:

int

query_clip3

Length at 3’ end of query not in alignment.

Type:

int

target_clip5

Length at 5’ end of target not in alignment.

Type:

int

target_lastpos

Last position of alignment in target (exclusive).

Type:

int

orientation

Does query align to the target (+) or reverse complement (-).

Type:

{‘+’, ‘-‘}

extract_cs(start, end)[source]

Extract cs tag corresponding to feature in target.

Parameters:
  • start (int) – Start of feature in target in 0, 1, … numbering.

  • end (int) – End of feature in target (not inclusive of this site).

Returns:

If feature is not present in alignment, return None. Otherwise, return (cs, clip5, clip3) where cs is the cs string for the aligned portion of the feature, and clip5 and clip3 are the lengths that must be clipped off the end of the feature to get to that alignment.

Return type:

tuple or None

Note

If an insertion is at the boundary of two features, it is assigned as being at the end of the first feature.

get_accuracy(targetstart, targetend)[source]

Get accuracy of part of query aligned to a region of the target.

Parameters:
  • targetstart (int) – Start of region in target in 0, 1, … numbering.

  • targetend (int) – End of region in target (not inclusive of this site).

Returns:

The accuracy of the region, computed as the average of the Q-scores for all aligned sites in the query, or NaN if there are no query sites aligned to this region.

Return type:

float or NaN

alignparse.cs_tag.cs_introns_to_deletions(cs, targetseq)[source]

Convert introns to deletions in cs tag.

Parameters:
  • cs (str) – Short-format cs tag.

  • targetseq (str) – Region of target sequenced covered by cs alignment.

Returns:

Version of cs where all introns have been converted to deletions.

Return type:

str

Examples

>>> cs_introns_to_deletions(':3-ggaac:2', 'ATGGGAACAT')
':3-ggaac:2'
>>> cs_introns_to_deletions(':3~gg5ac:2', 'ATGGGAACAT')
':3-ggaac:2'
>>> cs_introns_to_deletions(':2*ga~gg6ta:2-gc:2~at4cg:3',
...                  'ATGGGCCTATTGCTAATCGAAA')
':2*ga-ggccta:2-gc:2-atcg:3'
alignparse.cs_tag.cs_op_len_target(cs_op, *, invalid='raise')[source]

Get length of valid cs operation.

Parameters:
  • cs_op (str) – A single operation in a short cs tag.

  • invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.

Returns:

Length of given cs_op. This length is based on the target sequence, so insertions in the query have length 0 and deletions are the length of the target sequence deleted from the query. ‘None’ if cs_op is invalid and invalid is ignore.

Return type:

int or None

Example

>>> cs_op_len_target('*nt')
1
>>> cs_op_len_target(':45')
45
>>> cs_op_len_target('-at')
2
>>> cs_op_len_target('+gc')
0
>>> cs_op_len_target('*nt:45')
Traceback (most recent call last):
...
ValueError: invalid `cs_op` of *nt:45
>>> cs_op_len_target('*nt:45', invalid='ignore') is None
True
alignparse.cs_tag.cs_op_type(cs_op, *, invalid='raise')[source]

Get type of cs operation.

Parameters:
  • cs_op (str) – A single operation in a short cs tag.

  • invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.

Returns:

Type of cs operation, or None if cs_string invalid and invalid is ‘ignore’.

Return type:

{‘substitution’, ‘insertion’, ‘deletion’, ‘identity’, None}

Example

>>> cs_op_type('*nt')
'substitution'
>>> cs_op_type(':45')
'identity'
>>> cs_op_type('*nt:45')
Traceback (most recent call last):
...
ValueError: invalid `cs_op` of *nt:45
>>> cs_op_type('*nt:45', invalid='ignore') is None
True
alignparse.cs_tag.cs_to_mutation_str(cs, offset=0)[source]

Convert cs tag to a descriptive string of mutations.

Parameters:
  • cs (str) – A cs tag.

  • offset (int) – Mutations are numbered in 1, … numbering plus this offset.

Returns:

Space-delimited string of form ‘A5T G86A ins7ACG del19to24’ for all mutations specified in cs.

Return type:

str

Example

>>> cs_to_mutation_str(':4*nt-tc:2+ga:6')
'del6to7 ins10GA'
>>> cs_to_mutation_str(':4*at-tc:2+ga:6')
'A5T del6to7 ins10GA'
>>> cs_to_mutation_str(':4*at-tc:2+ga:6', offset=2)
'A7T del8to9 ins12GA'
>>> cs_to_mutation_str(':45')
''

Note

Mutation strings use “human readable” indexing, so the first nucleotide of the sequence is 1 and deletions are inclusive of the last number.

Changes from ambiguous nucleotides to any other identity are not considered mutations in the returned strings.

alignparse.cs_tag.cs_to_nt_mutation_count(cs)[source]

Count the number of nucleotide mutations in cs tag.

Parameters:

cs (str) – cs string

Returns:

Number of nucleotides that are mutated. Insertions / deletions are counted as the number of nucleotides in the indel. Changes from an ambiguous nucleotide to are not considered mutations.

Return type:

int

Example

>>> cs_to_nt_mutation_count(':4*nt-tc:2+g')
3
>>> cs_to_nt_mutation_count(':4*gt-tc:2+g')
4
alignparse.cs_tag.cs_to_op_mutation_count(cs)[source]

Count the number of mutation operations in cs tag.

Parameters:

cs (str) – The cs tag.

Returns:

Number of mutation operations in the query sequence. Each indel or substitution is counted as a single mutation operation regardless of how many mutations it contains. Changes from ambiguous nucleotides to another nucleotide are not counted.

Return type:

int

Example

>>> cs_to_op_mutation_count(':4*nt-tc:2+g')
2
>>> cs_to_op_mutation_count(':4*gt-tc:2+g')
3
alignparse.cs_tag.cs_to_sequence(cs, seq)[source]

Convert cs tag to a sequence.

Parameters:
  • cs (str) – cs string

  • seq (str) – Sequence of target for region corresponding to cs string.

Returns:

Nucleotide sequence generated by applying cs to seq.

Return type:

str

Example

>>> cs_to_sequence(':4*nt-tc:2+g:2', 'CGGANTCCAAT')
'CGGATCAGAT'
alignparse.cs_tag.split_cs(cs_string, *, invalid='raise', allow_intron=False)[source]

Split a short cs tag into its constituent operations.

Parameters:
  • cs_string (str) – The short cs tag.

  • invalid ({'raise', 'ignore'}) – If cs_string is not a valid string, raise an error or ignore it and return None.

  • allow_intron (bool) – Are introns allowed as cs operations?

Returns:

Tuple of the individual cs operations, or None if invalid cs_string and invalid is ‘ignore’.

Return type:

tuple or None

Example

>>> split_cs(':32*nt*na:10-gga:5+aaa:10')
(':32', '*nt', '*na', ':10', '-gga', ':5', '+aaa', ':10')
>>> split_cs('bad:32*nt*na:10-gga:5', invalid='ignore') is None
True
>>> split_cs('bad:32*nt*na:10-gga:5')
Traceback (most recent call last):
...
ValueError: invalid `cs_string` of bad:32*nt*na:10-gga:5