minimap2

Runs minimap2 aligner.

class dms_tools2.minimap2.Alignment

Bases: tuple

Alignment of a query to a target.

property additional

List of additional Alignment objects, useful for multiple alignments.

property cigar_str

CIGAR in PAF long format

property q_en

Alignment end in query (0 based).

property q_len

Total length of query prior to any clipping.

property q_st

Alignment start in query (0 based).

property r_en

Alignment end in target (0 based).

property r_len

Total length of target prior to any clipping.

property r_st

Alignment start in target (0 based).

property score

Alignment score.

property strand

1 if aligns in forward polarity, -1 if in reverse.

property target

Name of target to which query was aligned.

class dms_tools2.minimap2.Mapper(targetfile, options, *, prog='minimap2', target_isoforms={})[source]

Bases: object

Class to run minimap2 and get results.

Args:
targetfile (str)

FASTA file with target (reference) to which we align reads.

options (list)

Command line options to minimap2. For recommended options, for different situations, see:

prog (str or None)

Path to minimap2 executable. Class is only tested against version >=2.11 but may work with other versions too.

target_isoforms (dict)

Sometimes targets might be be isoforms of each other. You can specify that with this dict, which is keyed by target names and has values that are sets of other targets. If targets M1 and M2 are isoforms, set target_isoforms={‘M1’:[‘M2’], ‘M2’:[‘M1’]}. This argument is just used to set the target_isoforms attribute, but isn’t used during alignment.

Attributes:
targetfile (str)

Target (reference) file set at initialization.

targetseqs (OrderedDict)

Sequences in targetfile. Keys are sequence names, values are sequences as strings. In same order as listing in targetfile.

prog (str)

Path to minimap2 set at initialization.

options (list)

Options to minimap2 set at initialization.

version (str)

Version of minimap2.

target_isoforms (dict)

Isoforms for each target. This is the value set by target_isoforms at initialization plus ensuring that each target is listed as an isoform of itself.

Here is an example where we align a few reads to two target sequences.

First, we generate a few target sequences:

>>> targetlen = 200
>>> random.seed(1)
>>> targets = collections.OrderedDict()
>>> for i in [1, 2]:
...     targets['target{0}'.format(i)] = ''.join(random.choice(NTS)
...             for _ in range(targetlen))

Now we generate some queries. One is a random sequence that should not align, and the other two are substrings of the targets into which we have introduced a single mutation or indel. The names of the queries give their target, start in query, end in query, cigar string:

>>> queries = {'randseq':''.join(random.choice(NTS) for _ in range(180))}
>>> for qstart, qend, mut in [(0, 183, 'mut53'), (36, 194, 'del140')]:
...     target = random.choice(list(targets.keys()))
...     qseq = targets[target][qstart : qend]
...     mutsite = int(mut[3 : ])
...     if 'mut' in mut:
...         wt = qseq[mutsite]
...         mut = random.choice([nt for nt in NTS if nt != wt])
...         cigar = ('=' + qseq[ : mutsite] + '*' + wt.lower() +
...                  mut.lower() + '=' + qseq[mutsite + 1 : ])
...         qseq = qseq[ : mutsite] + mut + qseq[mutsite + 1 : ]
...     elif 'del' in mut:
...         cigar = ('=' + qseq[ : mutsite] + '-' +
...                  qseq[mutsite].lower() + '=' + qseq[mutsite + 1 : ])
...         qseq = qseq[ : mutsite] + qseq[mutsite + 1 : ]
...     queryname = '_'.join(map(str, [target, qstart, qend, cigar]))
...     queries[queryname] = qseq

Now map the queries to the targets:

>>> TempFile = functools.partial(tempfile.NamedTemporaryFile, mode='w')
>>> with TempFile() as targetfile, TempFile() as queryfile:
...     _ = targetfile.write('\n'.join('>{0}\n{1}'.format(*tup)
...                          for tup in targets.items()))
...     targetfile.flush()
...     _ = queryfile.write('\n'.join('>{0}\n{1}'.format(*tup)
...                         for tup in queries.items()))
...     queryfile.flush()
...     mapper = Mapper(targetfile.name, OPTIONS_CODON_DMS)
...     mapper2 = Mapper(targetfile.name, OPTIONS_CODON_DMS,
...             target_isoforms={'target1':{'target2'},
...                              'target2':{'target1'}})
...     alignments = mapper.map(queryfile.name)
>>> mapper.targetseqs == targets
True

Now make sure we find the expected alignments:

>>> set(alignments.keys()) == set(q for q in queries if q != 'randseq')
True
>>> matched = []
>>> for (query, a) in alignments.items():
...     expected = query.split('_')
...     matched.append(a.target == expected[0])
...     matched.append([a.r_st, a.r_en] == list(map(int, expected[1 : 3])))
...     matched.append([a.q_st, a.q_en] == [0, len(queries[query])])
...     matched.append(a.cigar_str == expected[3])
...     matched.append(a.strand == 1)
>>> all(matched)
True

Test out the target_isoform argument:

>>> mapper.target_isoforms == {'target1':{'target1'}, 'target2':{'target2'}}
True
>>> mapper2.target_isoforms == {'target1':{'target1', 'target2'},
...         'target2':{'target1', 'target2'}}
True
map(queryfile, *, outfile=None, introns_to_gaps=True, shift_indels=True, check_alignments=True)[source]

Map query sequences to target.

Aligns query sequences to targets. Adds --c --cs=long arguments to options to get a long CIGAR string, and returns results as a dictionary, and optionally writes them to a PAF file.

This is not a memory-efficient implementation as a lot is read into memory. So if you have very large queries or targets, that may pose a problem.

Args:
queryfile (str)

FASTA file with query sequences to align. Headers should be unique.

outfile (None or str)

Name of output file containing alignment results in PAF format if a str is provided. Provide None if you don’t want to create a permanent alignment file.

introns_to_gaps (bool)

If there are introns in the alignment CIGARs, convert to gaps by running through intronsToGaps().

shift_indels (bool)

Pass alignments through shiftIndels(). Only applies to alignments returned in dict, does not affect any results in outfile.

check_alignments (bool)

Run all alignments through checkAlignment() before returning, and raise error if any are invalid. This is a good debugging check, but costs time.

Returns:

A dict where keys are the name of each query in queryfile for which there is an alignment (there are no keys for queries that do not align). If there is a single alignment for a query, the value is that Alignment object. There can be multiple primary alignments (see here). If there are multiple alignments, the value is the Alignment with the highest score, and the remaining alignments are listed in the Alignment.additional attribute of that “best” alignment.

class dms_tools2.minimap2.MutationCaller(mapper, *, transcriptconverter=None, targetindex=1, target_clip=0, query_softclip=0)[source]

Bases: object

Call Mutations from Alignment.

Attributes:
mapper (Mapper)

The Mapper used to create the alignments.

transcriptconverter (dms_tools2.seqnumbering.TranscriptConverter or None)

Use if the alignments made by mapper are to transcripts (or partial transcripts) and you want to convert the numbering of mutations to 1, 2, … numbering of the chromosomes containing these transcripts, plus indicate amino-acid substitutions in any CDSs. The target names in mapper should match the mRNA names in transcriptconverter. If you use a transcriptconverter, targetindex must be 1.

targetindex (int)

Number assigned to first nucleotide of target when numbering mutations. A value of 1 means that the first nucleotide of the target is position 1.

target_clip (int)

Ignore any mutations that occur within this many nucleotides of the termini of target. If an indel includes any nucleotides that are not ignored, then the full indel is reported.

query_softclip (int)

Ignore any mutations that occur within this many nucleotides of the termini of the query and are soft clipped in the alignment. If an indel includes any nucleotides not ignored, then the full indel is reported.

Here is an example. First, create an Alignment that corresponds to the following:

target: --ATGCATGAAT--CGAAT
query:  cgATGaAcG--TatCt---
>>> with tempfile.NamedTemporaryFile(mode='w') as targetfile:
...         _ = targetfile.write('>target\nATGCATGGATCGAAT')
...         targetfile.flush()
...         mapper = Mapper(targetfile.name, OPTIONS_CODON_DMS)
>>> a = Alignment(q_st=2, q_en=14, q_len=14, strand=1,
...         r_st=0, r_en=12, r_len=15, score=16, target='target',
...         cigar_str='=ATG*ca=A*tc=G-aa=T+at=C*gt', additional=[])

Also create some Q-values. Just to make things simple for this example, we use unrealistic Q-values. For aligned sites they are equal to the site number in the target; for un-aligned sites they are 50.

>>> qvals = numpy.array([50, 50, 1, 2, 3, 4, 5, 6, 7, 10, 50, 50, 11, 12])

Now call mutations using MutationCaller.call:

>>> mutcaller = MutationCaller(mapper)
>>> muts = mutcaller.call(a, qvals)
>>> muts.substitutions()
['C4A', 'T6C', 'G12T']
>>> muts.substitutions(returnval='site')
[4, 6, 12]
>>> muts.deletions()
['del8to9', 'del13to15']
>>> muts.deletions(returnval='site')
[8, 13]
>>> muts.insertions()
['ins1len2', 'ins11len2']
>>> muts.insertions(returnval='site')
[1, 11]

Check that homo-polymer lengths are correct:

>>> muts.deletions(returnval='homopolymer_length')
[2, 2]
>>> muts.insertions(returnval='homopolymer_length')
[1, 1]

Check that Q-values are also correct:

>>> numpy.allclose(muts.substitutions(returnval='accuracy'),
...     list(map(dms_tools2.pacbio.qvalsToAccuracy, [4, 6, 12])))
True
>>> numpy.allclose(muts.insertions(returnval='accuracy'), list(
...     map(dms_tools2.pacbio.qvalsToAccuracy, [[50, 50], [50, 50]])))
True
>>> numpy.allclose(muts.deletions(returnval='accuracy'),
...     [dms_tools2.pacbio.qvalsToAccuracy(10), math.nan], equal_nan=True)
True

Illustrate targetindex by re-calling with 0-based idexing:

>>> mutcaller_index0 = MutationCaller(mapper, targetindex=0)
>>> muts_index0 = mutcaller_index0.call(a, qvals)
>>> muts_index0.substitutions()
['C3A', 'T5C', 'G11T']
>>> muts_index0.deletions()
['del7to8', 'del12to14']
>>> muts_index0.insertions()
['ins0len2', 'ins10len2']

Use target_clip to ignore mutations near target termini:

>>> mutcaller_targetclip2 = MutationCaller(mapper, target_clip=2)
>>> muts_targetclip2 = mutcaller_targetclip2.call(a, qvals)
>>> muts_targetclip2.substitutions()
['C4A', 'T6C', 'G12T']
>>> muts_targetclip2.deletions()
['del8to9', 'del13to15']
>>> muts_targetclip2.insertions()
['ins11len2']
>>> mutcaller_targetclip4 = MutationCaller(mapper, target_clip=4)
>>> muts_targetclip4 = mutcaller_targetclip4.call(a, qvals)
>>> muts_targetclip4.substitutions()
['T6C']
>>> muts_targetclip4.deletions()
['del8to9']
>>> muts_targetclip4.insertions()
['ins11len2']

Use query_softclip to ignore clipped regions in query:

>>> mutcaller_querysoftclip = MutationCaller(mapper, query_softclip=3)
>>> muts_querysoftclip = mutcaller_querysoftclip.call(a, qvals)
>>> muts_querysoftclip.substitutions()
['C4A', 'T6C', 'G12T']
>>> muts_querysoftclip.deletions()
['del8to9', 'del13to15']
>>> muts_querysoftclip.insertions()
['ins11len2']

Now use transcriptconverter to get mutations numbered in chromosome coordinates along with amino-acid substitutions. First, initialize a dms_tools2.seqnumbering.TranscriptConverter:

>>> with tempfile.NamedTemporaryFile(mode='r+') as genbankfile:
...     _ = genbankfile.write('''
... LOCUS       chromosome                17 bp    DNA              UNK 01-JAN-1980
... FEATURES             Location/Qualifiers
...      mRNA            3..17
...                      /label="target"
...      CDS             3..14
...                      /label="prot"
... ORIGIN
...         1 CAATGCATGG ATCGAAT
... //
... ''')
...     genbankfile.flush()
...     _ = genbankfile.seek(0)
...     transcriptconverter = dms_tools2.seqnumbering.TranscriptConverter(genbankfile)

Now create a MutationCaller that uses transcriptconverter:

>>> mutcaller_tc = MutationCaller(mapper, transcriptconverter=transcriptconverter)

Use it to call the mutations in chromosome coordinates. Note that the amino-acid substitutions that result from a nucleotide point substitution are also indicated in the results:

>>> muts_tc = mutcaller_tc.call(a, qvals)
>>> muts_tc.substitutions()
['chromosome-C6A_prot-His2Asn', 'chromosome-T8C', 'chromosome-G14T']
>>> muts_tc.substitutions(returnval='site')
[6, 8, 14]
>>> muts_tc.deletions()
['chromosome-del10to11', 'chromosome-del15to17']
>>> muts_tc.deletions(returnval='site')
[10, 15]
>>> muts_tc.insertions()
['chromosome-ins3len2', 'chromosome-ins13len2']
>>> muts_tc.insertions(returnval='site')
[3, 13]
call(a, qvals=None)[source]

Call mutations in alignment.

Args:
a (Alignment)

Call mutations in this alignment.

qvals (None or numpy array)

Array of Q-values for the entire query used to build alignment, not just aligned region.

Returns:

A Mutations object holding the mutations. If you are not using transcriptconverter, mutations are named like this:

  • “A60T” for substitution (at site 60)

  • “ins5len10” for insertion (starts at 5, length 10)

  • “del12to14” for deletion (12, 13, and 14 deleted)

where the number refers to the alignment target in mapper.

If you are using transcriptconverter, then mutations are named like this:

  • “fluNS-A61T_fluNS1-Asp12Val” indicating substitution at site 61 in the “fluNS” chromosome causing amino-acid substitution Asp12Val in the “fluNS1” CDS. If there is no amino-acid substitution, the name will just be “fluNS-A61T”.

  • “fluNS-ins4len10” for insertion (again, now in coordinates of the chromosome, which is “fluNS”.

  • “fluNS-del11to13” for deletion.

class dms_tools2.minimap2.MutationConsensus(*, n_mut=2, min_acc=0.999, min_sub_frac=0.3, min_indel_frac=0.5, group_indel_frac=0.9, homopolymer_calling={'n_mut': 3})[source]

Bases: object

Takes consensus of several Mutations.

Designed for when you have called Mutations for several sequences thought to represent the same template. The method make some use of statistical information in the accuracies computed from quality scores, but is heuristic. This is because it is designed for cases where causes other than simple sequencing error (e.g., mis-assigned sequences, true mix of templates, errors during library preparation) may be present.

Initialize a MutationConsensus object with arguments below, which become attributes of the object. Then call consensus mutations with MutationConsensus.callConsensus method.

Args:
n_mut (int)

At least this many sequences must have mutation to call it in consensus.

min_acc (float)

Completely ignore any mutations with accuracy lower than this. Should be a threshold that just gets rid of really low-accuracy ones. Is not applied to mutations with an accuracy of NaN (math.nan).

min_sub_frac (float)

More than this fraction of sequences must have substitution (point mutation) to call it as present.

min_indel_frac (float)

More than this fraction of sequences must have an indel to call it as present. By default this is more stringent than min_sub_frac because indel sequencing errors are more common and deletions can be enriched during PCR.

group_indel_frac (float)

If other overlap by >= this fraction of their total net length with the most common indel, group them together as the most common indel. Designed to handle this case where alignment issues slightly change called boundaries of long indels.

homopolymer_calling (dict)

More stringent calling of single-nucleotide indels in hompolymers, which are defined as runs of >= 3 consecutive nucleotides. Rationale is that main mode of PacBio sequencing errors is short indels in homopolymers. Can contain entries keyed by “n_mut” and “min_indel_frac”. Then for single-nucleotide indels in >= 3 nt homopolymers, increases calling stringency to these new parameters. If no entry in dict, just uses main values set for other mutations, so supply empty dict if you want to no homopolymer correction.

Mutations are called from the list of Mutations passed to MutationConsensus.callConsensus as follows:

  1. Ignore any mutations with accuracy less then min_acc, but do not eliminate mutations with accuracy that is math.nan.

  2. If there are less then n_mut Mutations passed, and they have no mutations, call as wildtype.

  3. If there are less than n_mut Mutations passed and some have mutations, call as ambiguous.

  4. If there are at least n_mut Mutations passed but less than n_mut of them have a mutation, call as wildtype.

  5. If there are at least n_mut Mutations that contain a specific mutation and the fraction of Mutations that have this mutation is >= min_sub_frac (for point mutations) or min_indel_frac (for indels), then call the sequence as having the mutation.

  6. Otherwise call as wildtype.

Here is an example.

First, define some Mutations:

>>> m_wt = Mutations(
...         substitution_tuples=[],
...         insertion_tuples=[],
...         deletion_tuples=[])
>>> m_A1G_high_acc = Mutations(
...         substitution_tuples=[(1, 'A1G', 30)],
...         insertion_tuples=[],
...         deletion_tuples=[])
>>> m_A1G_low_acc = Mutations(
...         substitution_tuples=[(1, 'A1G', 9)],
...         insertion_tuples=[],
...         deletion_tuples=[])

Initialize a MutationConsensus with default args:

>>> mutcons = MutationConsensus()

A single Mutations with no mutation is called wildtype, and so returns an empty string:

>>> mutcons.callConsensus([m_wt], 'substitutions')
''
>>> mutcons.callConsensus([m_wt], 'insertions')
''
>>> mutcons.callConsensus([m_wt], 'deletions')
''

Same for two with no mutations:

>>> mutcons.callConsensus([m_wt, m_wt], 'substitutions')
''

A single Mutations with mutations is called as ambiguous:

>>> mutcons.callConsensus([m_A1G_high_acc], 'substitutions')
'unknown'

One with mutations and one without is called wildtype:

>>> mutcons.callConsensus([m_A1G_high_acc, m_wt], 'substitutions')
''

Two high-quality mutation calls are give a consensus mutation:

>>> mutcons.callConsensus([m_A1G_high_acc, m_A1G_high_acc], 'substitutions')
'A1G'

But it two low-quality mutation calls do not:

>>> mutcons.callConsensus([m_A1G_low_acc, m_A1G_low_acc], 'substitutions')
''

Having two mutation calls and one wildtype call gives mutation:

>>> mutcons.callConsensus([m_A1G_high_acc, m_A1G_high_acc, m_wt],
...         'substitutions')
'A1G'

Unless the wildtype is in sufficient excess:

>>> mutcons.callConsensus([m_A1G_high_acc] * 2 + [m_wt] * 8,
...         'substitutions')
''

Example with two substitutions:

>>> m_A1G_T2A_high_acc = Mutations(
...         substitution_tuples=[(1, 'A1G', 30), (2, 'T2A', 30)],
...         insertion_tuples=[],
...         deletion_tuples=[])
>>> mutcons.callConsensus([m_A1G_T2A_high_acc, m_A1G_T2A_high_acc],
...         'substitutions')
'A1G T2A'

Example where one sequence has two mutations (enough to be called) and the other only has one (not enough to be called):

>>> mutcons.callConsensus([m_A1G_T2A_high_acc, m_A1G_high_acc],
...         'substitutions')
'A1G'

Use the include_stats option to get more detailed output:

>>> mutcons.callConsensus([m_A1G_T2A_high_acc, m_A1G_T2A_high_acc,
...         m_A1G_high_acc, m_wt], 'substitutions', include_stats=True)
'A1G_(3/4) T2A_(2/4)'

Group sufficiently overlapping indels:

>>> m_shortdel = Mutations(
...         substitution_tuples=[],
...         insertion_tuples=[],
...         deletion_tuples=[(8, 9, 'del8to9', math.nan, 1)])
>>> m_longdel = Mutations(
...         substitution_tuples=[],
...         insertion_tuples=[],
...         deletion_tuples=[(8, 18, 'del8to18', math.nan, 1)])
>>> mutcons.callConsensus([m_shortdel] * 2, 'deletions')
'del8to9'
>>> mutcons.callConsensus([m_longdel] * 2, 'deletions')
'del8to18'
>>> m_overlaplongdel = Mutations(
...         substitution_tuples=[],
...         insertion_tuples=[],
...         deletion_tuples=[(9, 17, 'del9to17', math.nan, 1)])
>>> mutcons.callConsensus([m_longdel, m_shortdel], 'deletions')
''
>>> mutcons.callConsensus([m_wt, m_longdel] + [m_overlaplongdel] * 4,
...         'deletions')
'del9to17'

Demonstrate min_acc filter:

>>> mutcons_no_min_acc = MutationConsensus(min_acc=0)
>>> mutcons.callConsensus([m_wt, m_A1G_low_acc] * 6, 'substitutions')
''
>>> mutcons_no_min_acc.callConsensus([m_wt, m_A1G_low_acc] * 6, 'substitutions')
'A1G'

Demonstrate homopolymer_calling:

>>> m_homopolymer_indel = Mutations(
...         substitution_tuples=[],
...         insertion_tuples=[(13, 1, 'ins13len1', math.nan, 3)],
...         deletion_tuples=[(9, 10, 'del9to10', math.nan, 3),
...                          (5, 5, 'del5to5', math.nan, 3)])
>>> mutcons.callConsensus([m_homopolymer_indel] * 2, 'insertions')
''
>>> mutcons.callConsensus([m_homopolymer_indel] * 2, 'deletions')
'del9to10'
>>> mutcons.callConsensus([m_homopolymer_indel] * 3, 'insertions')
'ins13len1'
>>> mutcons.callConsensus([m_homopolymer_indel] * 3, 'deletions')
'del5to5 del9to10'
>>> mutcons.callConsensus([m_homopolymer_indel] * 3 + [m_wt] * 4, 'deletions')
''
>>> mutcons.callConsensus([m_homopolymer_indel] * 3 + [m_wt] * 2, 'deletions')
'del5to5 del9to10'
callConsensus(mutationlist, mutation_type, *, include_stats=False)[source]

Calls consensus from Mutations.

The calling is done using the criteria described in the main docs for MutationConsensus.

Args:
mutationlist (list)

List of one or more Mutations from which we call consensus.

mutation_type (str)

Type of mutation to call. Should be one of ‘substitutions’, ‘insertions’, or ‘deletions’.

include_stats (bool)

Include statistics on number of CCSs that have mutation for each called mutation.

Returns:

A str giving the result. If consensus of mutations cannot be called, returns the string “unknown”. Otherwise returns empty string if no consensus mutations, or a string giving the mutations separated by spaces the same way they are returned by the methods of Mutations. Mutations are sorted first by number of times observed, and then alphabetically.

class dms_tools2.minimap2.Mutations(*, substitution_tuples, insertion_tuples, deletion_tuples, acc_not_q=False)[source]

Bases: object

Class to hold mutations.

Holds three types of mutations: substitutions (point mutations), insertions, and deletions.

The numbering scheme used to define the mutations (e.g., 0-based, 1-based) is determined upstream when deciding how to number the mutations passed at initialization.

When initializing, set Q-values to math.nan if they are not known. The Q-values are used to calculate accuracy of mutations via dms_tools2.pacbio.qvalsToAccuracy(). Likewise, pass math.nan for homopolymer lengths if they are not known.

Args:
substitution_tuples (list)

Substitutions (i, mut_str, q) where i is the site number, mut_str is a string describing substitution, and q is the Q-value.

insertion_tuples (list)

Insertions (i, ins_len, mut_str, qs, hplen) where i is site after insertion, inslen is insertion length, mut_str is string describing insertion, qs is numpy array of Q-values, and hplen is length of homopolymer in which insertion occurs.

deletion_tuples (list)

Deletions (istart, iend, mut_str, q, hplen) where istart is first site of deletion, iend is last site of deletion (so a single nucleotide deletion has istart == iend), mut_str is string describing deletion , q is the Q-value of the site immediately after the deletion in accordance with PacBio CCS specification, and hplen is length of homopolymer in which deletion occurs.

acc_not_q (bool)

If True, then the tuples used to pass the mutations should give the accuracy rather than the Q-values.

You can also create objects using Mutations.from_str.

After initialization, use the methods described below to get information about the mutations. All of the methods that return lists of mutations do them ordered by site (first to last).

We keep track of the length of the homopolymers in which indels are found because the main error mode of PacBio sequencing is indels in homopolymers.

Here is an example. Note that Q-value of 20 indicates an accuracy of 0.99, and a Q-value of 30 indicates an accuracy of 0.999:

>>> muts = Mutations(
...         substitution_tuples=[(1, 'A1T', math.nan),
...             (15, 'C15A', 30), (13, 'G13A', 20)],
...         insertion_tuples=[(5, 2, 'ins5len2', [20, 30], 1)],
...         deletion_tuples=[(8, 10, 'del8to10', 20, 2)])
>>> muts.substitutions()
['A1T', 'G13A', 'C15A']
>>> muts.substitutions(returnval='site')
[1, 13, 15]
>>> muts.insertions()
['ins5len2']
>>> muts.insertions(returnval='site')
[5]
>>> muts.deletions()
['del8to10']
>>> muts.deletions(returnval='site')
[8]

Now with some filtering on accuracy:

>>> muts.substitutions(min_acc=0.99)
['G13A', 'C15A']
>>> muts.substitutions(min_acc=0.995)
['C15A']
>>> muts.insertions(min_acc=0.991)
['ins5len2']
>>> muts.insertions(min_acc=0.999)
[]
>>> muts.deletions(min_acc=0.95)
['del8to10']
>>> muts.deletions(min_acc=0.999)
[]

Now get lengths of insertions / deletions:

>>> muts.deletions(returnval='length')
[3]
>>> muts.insertions(returnval='length')
[2]

Get homopolymer lengths:

>>> muts.deletions(returnval='homopolymer_length')
[2]
>>> muts.insertions(returnval='homopolymer_length')
[1]

Show how we can get string representations with repr and str:

>>> str(muts) == repr(muts)
True
>>> str(muts)
'A1T; G13A (accuracy = 0.990000); C15A (accuracy = 0.999000); ins5len2 (accuracy = 0.994500, homopolymer length = 1); del8to10 (accuracy = 0.990000, homopolymer length = 2)'

Note that when Q-values or homopolymer lengths are not provided, then the aren’t printed in the str representation:

>>> str(Mutations(substitution_tuples=[], insertion_tuples=[],
...         deletion_tuples=[(8, 10, 'del8to10', math.nan, math.nan),
...                          (12, 14, 'del12to14', math.nan, 4)]))
'del8to10; del12to14 (homopolymer length = 4)'

Also show what these representations look like when there are no mutations:

>>> str(Mutations(substitution_tuples=[], insertion_tuples=[],
...         deletion_tuples=[]))
'no mutations'
deletions(*, returnval='mutation', min_acc=None, min_acc_filter_nan=True)[source]

List of deletions.

Args:
min_acc (float or None)

Only include deletions with >= this accuracy.

min_acc_filter_nan (bool)

If using min_acc filter, do we also remove mutations with an accuracy that is math.nan?

returnval (str)

Type of value to return in list:

  • “mutation”: Strings giving mutations, where “del12to13” means deletion of nucleotides 12 to 13, inclusive.

  • “site”: Sites of where deletions start.

  • “length”: Integers giving deletion lengths.

  • “accuracy”: Numbers giving accuracy of each mutation. Accuracy is for first nucleotide after deletion.

  • “homopolymer_length”: Integers giving length of homopolymer in which deletion is found.

Returns:

List of mutations or other value specified by returnval.

classmethod from_str(mut_str)[source]

Create Mutations from str representation.

Args:
mut_str (str)

String representation as returned by str(mut) where mut is a Mutations object. See the examples below for more details.

Returns:

Mutations object.

Examples of creating Mutations.from_str. First, a single substitution without an accuracy:

>>> m = Mutations.from_str('A1T')
>>> str(m)
'A1T'
>>> str(m) == str(Mutations(substitution_tuples=[(1, 'A1T', math.nan)],
...         deletion_tuples=[], insertion_tuples=[]))
True

And now substitutions with Q-values:

>>> m2 = Mutations.from_str('A1T (accuracy = 0.99); G3A (accuracy = 0.999)')
>>> m2 == Mutations.from_str(str(m2))
True

Now including deletions:

>>> m3 = Mutations.from_str(
...         'del1to3 (accuracy = 0.99); '
...         'A6T (accuracy = 0.995); '
...         'del8to9 (accuracy = 0.999, homopolymer length = 2)')
>>> str(m3)
'A6T (accuracy = 0.995000); del1to3 (accuracy = 0.990000); del8to9 (accuracy = 0.999000, homopolymer length = 2)'

An insertion:

>>> m4 = Mutations.from_str('ins5len2 (accuracy = 0.99, homopolymer length = 3)')
>>> str(m4)
'ins5len2 (accuracy = 0.990000, homopolymer length = 3)'

Example of a null mutation:

>>> m_null = Mutations.from_str('no mutations')
>>> m_null2 = Mutations.from_str('')
>>> m_null == m_null2
True
>>> str(m_null) == str(m_null2)
True
>>> str(m_null)
'no mutations'
insertions(*, returnval='mutation', min_acc=None, min_acc_filter_nan=True)[source]

List of insertions.

Args:
min_acc (float or None)

Only include insertions with >= this accuracy.

min_acc_filter_nan (bool)

If using min_acc filter, do we also remove mutations with an accuracy that is math.nan?

returnval (str)

Type of value to return in list:

  • “mutation”: Strings giving mutations, where “ins10len20” means insertion of length 20 immediately before site 10.

  • “site”: Sites of insertions.

  • “length”: Integers giving insertion lengths.

  • “accuracy”: Numbers giving accuracy of each mutation. Accuracy of insertion is averaged over its length.

  • “homopolymer_length”: Integers giving length of homopolymer in which insertion is found.

Returns:

List of mutations or other value specified by returnval.

substitutions(*, returnval='mutation', min_acc=None, min_acc_filter_nan=True)[source]

List of substitutions or associated values.

Args:
min_acc (float or None)

Only include substitutions with >= this accuracy.

min_acc_filter_nan (bool)

If using min_acc filter, do we also remove mutations with an accuracy that is math.nan?

returnval (str)

Type of value to return in list:

  • “mutation”: Strings giving mutations, where “A1T” means site 1 is mutated from A to T.

  • “site”: Sites of substitutions.

  • “accuracy”: Numbers giving accuracy of each mutation.

Returns:

List of mutations or other value specified by returnval.

dms_tools2.minimap2.OPTIONS_CODON_DMS = ['--for-only', '-A2', '-B4', '-O12', '-E2', '--secondary=no', '--end-bonus=13']

options argument to Mapper that works well for codon-mutant libraries such as those created for deep mutational scanning. Indels are highly penalized as they are not expected, and settings are used that effectively call strings of consecutive nucleotide mutations as expected during codon mutagenesis. The queries are assumed to be in the same orientation as the target.

dms_tools2.minimap2.OPTIONS_VIRUS_W_DEL = ['-x', 'splice', '-un', '-C0', '--splice-flank=no', '--mask-level=1', '--secondary=no', '--for-only', '--end-seed-pen=2', '--end-bonus=1']

options argument to Mapper that works well for libraries that contain viral genes expected to potentially have both point mutations (at nucleotide, not codon level) and some longer deletions. The queries are assumed to be in the same orientation as the target. These options resemble those suggested for PacBio IsoSeq but use -C0 and -un to avoid splice-site preference as we want to trick the aligner into thinking long deletions are spliced introns.

class dms_tools2.minimap2.TargetVariants(variantfiles, mapper, *, variantsites_min_acc=None)[source]

Bases: object

After alignment, assign to one of several target variants.

Useful if you aligned against one set of targets, but in reality the queries could align to several different point mutant variants of that target. Use this class to take the alignments and see if they instead exactly match a variant of the target.

Initialize to specify the target variants, then classify using TargetVariants.call.

Args:
variantfiles (dict)

Specifies FASTA files giving the target variants. Each file must have the same target names as in mapper.targetfile. These are the variants to which we compare the queries, and they must be point mutants (same length) as the targets for mapper. Currently only works when there are two sets of variants.

mapper (Mapper)

The mapper used to make the alignments. Used to check that the sequences specified in variantfiles are proper point mutant variants of the alignment targets.

variantsites_min_acc (float or None)

Minimum required accuracy (computed from Q-values) required for calling by TargetVariants.call.

Attributes:
variantfiles (dict)

See above.

mapper (Mapper)

See above.

variantsites_min_acc (float or None)

See above.

variantnames (list)

Alphabeticized list of the variant names that key variantfiles.

targetnames (list)

Alphabeticized list of the target names.

variantseqs (dict)

The actual sequences in variantfiles. Keyed by each variant in variantnames, then keyed by each name in targetnames, and values are the sequences for that variant of that target.

variablesites (dict)

Keyed by target names, each value is a list of all sites that differ between target variants, sorted and in 0, 1, … indexing.

sitevariants (dict)

sitevariants[target][variant] is a list that gives the identity of variant for target ` at each site in variablesites[target].

Here is a short example.

First, create two target seqs, and two variants that each differ at two positions:

>>> target1_wt  = 'ATGCATGAA'
>>> target1_var = 'ATCCATGTA'
>>> target2_wt  = 'GATACCCGG'
>>> target2_var = 'GCTACCCCG'

Now write these to two targetfiles, initialize Mapper with the wildtype target sets, initialize TargetVariants with wildtype and variant target sets:

>>> TempFile = functools.partial(tempfile.NamedTemporaryFile, mode='w')
>>> with TempFile() as wtfile, TempFile() as varfile:
...     _ = wtfile.write('>target1\n{0}\n>target2\n{1}'.format(
...                      target1_wt, target2_wt))
...     wtfile.flush()
...     _ = varfile.write('>target1\n{0}\n>target2\n{1}'.format(
...                       target1_var, target2_var))
...     varfile.flush()
...     mapper = Mapper(wtfile.name, OPTIONS_CODON_DMS)
...     targetvars = TargetVariants(
...             {'wildtype':wtfile.name, 'variant':varfile.name},
...             mapper, variantsites_min_acc=0.99)
>>> targetvars.variantnames
['variant', 'wildtype']
>>> targetvars.targetnames
['target1', 'target2']
>>> sorted(targetvars.variablesites.items())
[('target1', [2, 7]), ('target2', [1, 7])]
>>> targetvars.sitevariants == {
...         'target1':{'wildtype':['G', 'A'], 'variant':['C', 'T']},
...         'target2':{'wildtype':['A', 'G'], 'variant':['C', 'C']}}
True

Now test on some alignments. First, one that matches the wildtype of target2:

>>> a_wildtype = Alignment(q_st=0, q_en=8, q_len=8, strand=1,
...         r_st=1, r_en=9, r_len=9, score=16, target='target2',
...         additional=[], cigar_str='=ATACCCGG')
>>> (variant, a_new) = targetvars.call(a_wildtype)
>>> variant
'wildtype'
>>> a_wildtype == a_new
True

Now one that matches the variant of target2:

>>> a_variant = a_wildtype._replace(cigar_str='*ac=TACCC*gc=G')
>>> (variant, a_new) = targetvars.call(a_variant)
>>> variant
'variant'
>>> a_variant == a_new
False
>>> a_new.cigar_str
'=CTACCCCG'

Now one that matches the variant of target2, but also has another mutation:

>>> a_variant = a_wildtype._replace(cigar_str='*ac=TA*ca=CC*gc=G')
>>> (variant, a_new) = targetvars.call(a_variant)
>>> variant
'variant'
>>> a_variant == a_new
False
>>> a_new.cigar_str
'=CTA*ca=CCCG'

Now one that is mixed (doesn’t match either wildtype or variant):

Now one that is mixed (doesn’t match either wildtype or variant):

>>> a_mixed = a_wildtype._replace(cigar_str='=ATACCC*gc=G')
>>> (variant, a_new) = targetvars.call(a_mixed)
>>> variant
'mixed'
>>> a_mixed == a_new
True

Now an alignment that only spans some variable sites:

>>> a_var_partial = Alignment(q_st=0, q_en=7, q_len=7, strand=1,
...         r_st=2, r_en=9, r_len=9, score=14, target='target2',
...         additional=[], cigar_str='=TACCC*gc=G')
>>> (variant, a_new) = targetvars.call(a_var_partial)
>>> variant
'partial variant'
>>> a_var_partial == a_new
False
>>> a_new.cigar_str
'=TACCCCG'

Now alignments that do and do not pass the accuracy threshold:

>>> a_qvals = Alignment(q_st=1, q_en=9, q_len=9, strand=1,
...         r_st=1, r_en=9, r_len=9, score=16, target='target2',
...         additional=[], cigar_str='=ATACCCGG')
>>> qvals_high = numpy.array([30] * 9)
>>> (variant, a_new) = targetvars.call(a_qvals, qvals_high)
>>> variant
'wildtype'
>>> qvals_low = numpy.array([30, 10] + [30] * 7)
>>> (variant, a_new) = targetvars.call(a_qvals, qvals_low)
>>> variant
'low accuracy'

Now an alignment that does not span any variable sites:

>>> a_unknown = Alignment(q_st=0, q_en=5, q_len=5, strand=1,
...         r_st=2, r_en=7, r_len=9, score=12, target='target2',
...         additional=[], cigar_str='=TACCC')
>>> (variant, a_new) = targetvars.call(a_unknown)
>>> a_unknown == a_new
True
>>> variant
'unknown'
call(a, qvals=None)[source]

Call target variant for an alignment.

Args:
a (Alignment)

The alignment (built with mapper) for which we want to call the target variant.

qvals (None or numpy array)

Array of all Q-values for entire query used to build alignment, not just aligned region. If not None and variantsites_min_acc attribute is not None, then an accuracy requirement is imposed.

Returns:

The 2-tuple (variant, new_a). Possible values are:

  • If a does not cover any of the variable sites, then variant is “unknown” and new_a is just a.

  • If any of the variable sites in a don’t meet the accuracy threshold of variantsites_min_acc, then variant is “low accuracy” and new_a is just a.

  • If all of the variable sites present in a don’t exactly match one of the variants, then variant is “mixed” and new_a is just a.

  • If a exactly matches one of the target variants at all variable sites, then variant is a variant in TargetVariant.variantnames and new_a is a version of a in which any mismatches relative to this target variant have been removed from the Alignment.cigar_str attribute.

  • If a only covers some of the variable sites but all of these match one of the target variants, then variant is “partial <variant>” where <variant> is a variant in TargetVariant.variantnames, and new_a is a version of a in which any mismatches relative to this target variant havea been removed from the Alignment.cigar_str attribute.

dms_tools2.minimap2.checkAlignment(a, target, query)[source]

Checks alignment is valid given target and query.

Arguments:
a (Alignment)

The alignment.

target (str)

The target to which query is aligned.

query (str)

The query aligned to target.

Returns:

True if a is a valid alignment of query to target, False otherwise. Being valid does not mean an alignment is good, just that the start / ends and CIGAR in a are valid.

>>> target = 'ATGCAT'
>>> query = 'TACA'
>>> a_valid = Alignment(target='target', r_st=1, r_en=5,
...         r_len=6, q_st=0, q_en=4, q_len=4, strand=1,
...         cigar_str='=T*ga=CA', additional=[], score=-1)
>>> checkAlignment(a_valid, target, query)
True
>>> a_invalid = a_valid._replace(r_st=0, r_en=4)
>>> checkAlignment(a_invalid, target, query)
False
>>> target = 'ATGCAT'
>>> a_valid2 = Alignment(target='target', r_st=0, r_en=6,
...         r_len=6, q_st=0, q_en=6, q_len=6, strand=1,
...         cigar_str='=A+tgc-tgc=AT', additional=[], score=-1)
>>> checkAlignment(a_valid2, target, target)
True
dms_tools2.minimap2.cigarToQueryAndTarget(cigar)[source]

Returns (query, target) specified by PAF long CIGAR.

Cannot handle CIGAR strings with intron operations. To check those, you first need to run through intronsToGaps().

Args:
cigar (str)

CIGAR string.

Returns:

The 2-tuple (query, target), where each is a string giving the encoded query and target.

>>> cigarToQueryAndTarget('=AT*ac=G+at=AG-ac=T')
('ATCGATAGT', 'ATAGAGACT')
>>> cigarToQueryAndTarget('=A+tgc-tgc=AT')
('ATGCAT', 'ATGCAT')
dms_tools2.minimap2.iTargetToQuery(a, i)[source]

Gets index in query aligned to target index.

Args:
a (Alignment)

The alignment.

i (int)

Index in target in 0-based numbering.

Returns:

Index in query that aligns to site i in target, or None if there is not an alignment at that site.

>>> a = Alignment(target='target', r_st=1, r_en=9, r_len=9,
...         q_st=3, q_en=10, q_len=7, strand=1,
...         cigar_str='=T*ga=CA-ga=T+c=T',
...         additional=[], score=-1)
>>> iTargetToQuery(a, 0) is None
True
>>> iTargetToQuery(a, 1)
3
>>> iTargetToQuery(a, 4)
6
>>> iTargetToQuery(a, 6) is None
True
>>> iTargetToQuery(a, 7)
7
>>> iTargetToQuery(a, 8)
9
>>> iTargetToQuery(a, 9) is None
True
dms_tools2.minimap2.intronsToGaps(cigar, target)[source]

Converts introns to gaps in CIGAR string.

If you run minimap2, it reports introns differently than gaps in the target. This function converts the intron notation to gaps. This is useful if you are using the introns as an ad-hoc way to identify long gaps.

Args:
cigar (str)

PAF long CIGAR string, format is detailed here.

target (str)

The exact portion of the target aligned to the query in cigar.

>>> target = 'ATGGAACTAGCATCTAG'
>>> cigar = '=A+ca=TG-g=A*ag=CT~ag5at=CTAG'
>>> intronsToGaps(cigar, target)
'=A+ca=TG-g=A*ag=CT-agcat=CTAG'
dms_tools2.minimap2.mutateSeq(wtseq, mutations, insertions, deletions)[source]

Mutates sequence and gets CIGAR.

Primarily useful for simulations.

In the mutation specifications below, 0-based numbering is used. Sequence characters are upper case nucleotides. Operations are applied in the order: mutations, insertions, deletions. So a deletion can overwrite a mutation or insertion. The entire insertion counts as just one added site when indexing the deletions. You will get an error if deletions and insertions overlap.

Arguments:
wtseq (str)

The wildtype sequence.

mutations (list)

List of point mutations in form (i, mut) where i is site and mut is mutant amino acid (can be same as wildtype).

deletions (list)

List of deletion locations in form (istart, iend).

insertions (list)

List of insertions in form (i, seqtoinsert).

Returns:

The 2-tuple (mutantseq, cigar) where cigar is the CIGAR in PAF long format.

Here is an example:

>>> wtseq = 'ATGGAATGA'
>>> (mutantseq, cigar) = mutateSeq(wtseq, [], [], [])
>>> mutantseq == wtseq
True
>>> cigar == '=' + wtseq
True
>>> (mutantseq, cigar) = mutateSeq(wtseq,
...         [(0, 'C'), (1, 'T'), (3, 'A')],
...         [(8, 'TAC')], [(5, 2)])
>>> mutantseq
'CTGAAGTACA'
>>> cigar
'*ac=TG*ga=A-at=G+tac=A'
dms_tools2.minimap2.numAligned(cigar)[source]

Gets number of aligned nucleotides from PAF long CIGAR.

Args:
cigar (str)

CIGAR str.

Returns:

The number of aligned nucleotides in the cigar, where a nucleotide is considered aligned if it is either a match or a point mutation, but not if it is an indel.

Example: the CIGAR below has 3 matches, a deletion, 5 matches, 2 mutations, 2 matches, an insertion, 3 matches, 1 mutation, and 2 matches. So this counts as 3 + 5 + 2 + 2 + 3 + 1 + 2 = 18 aligned nucleotides:

>>> numAligned('=ACT-gata=AGTCA*ta*ga=TA+tta=GCA*ca=GT')
18
dms_tools2.minimap2.numExactMatches(cigar)[source]

Number exactly matched nucleotides in long CIGAR.

>>> numExactMatches('=ATG-aca=A*gc+ac=TAC')
7
dms_tools2.minimap2.parsePAF(paf_file, targets=None, introns_to_gaps=False)[source]

Parse *.paf file as created by minimap2.

paf_file is assumed to be created with the minimap2 options -c --cs=long, which creates long cs tags with the CIGAR string.

PAF format is described here. PAF long CIGAR string format from minimap2 is detailed here.

Args:
paf_file (str or iterator)

If str, should be name of *.paf file. Otherwise should be an iterator that returns lines as would be read from a PAF file.

targets (dict or None)

If not None, is dict keyed by target names, with values target sequences. You need to provide this if minimap2 is run with a --splice option and introns_to_gaps is True.

introns_to_gap (bool)

Pass CIGAR strings through intronsToGaps(). Requires you to provide targets.

Returns:

A generator that yields query / alignments on each line in paf_file. Returned as the 2-tuple (query_name, a), where query_name is a str giving the name of the query sequence, and a is an Alignment.

Here is a short example:

>>> paf_file = io.StringIO('\t'.join([
...         'myquery', '10', '0', '10', '+', 'mytarget',
...         '20', '5', '15', '9', '10', '60',
...         'cs:Z:=ATG*ga=GAACAT', 'AS:i:7']))
>>> alignments = [tup for tup in parsePAF(paf_file)]
>>> len(alignments)
1
>>> (queryname, alignment) = alignments[0]
>>> queryname
'myquery'
>>> alignment.target
'mytarget'
>>> (alignment.r_st, alignment.r_en)
(5, 15)
>>> (alignment.q_st, alignment.q_en)
(0, 10)
>>> alignment.strand
1
>>> alignment.cigar_str
'=ATG*ga=GAACAT'
>>> alignment.q_len
10
>>> alignment.score
7

Now an example of using targets and introns_to_gaps. You can see that this option converts the ~gg5ac to -ggaac in the cigar_str attribute:

>>> targets = {'mytarget':'ATGGGAACAT'}
>>> paf_file = io.StringIO('\t'.join([
...         'myquery', '9', '0', '9', '+', 'mytarget',
...         '10', '1', '10', '?', '4', '60',
...         'cs:Z:=TG~gg5ac=AT', 'AS:i:2']))
>>> a_keep_introns = [tup for tup in parsePAF(paf_file)][0][1]
>>> _ = paf_file.seek(0)
>>> a_introns_to_gaps = [tup for tup in parsePAF(paf_file,
...         targets=targets, introns_to_gaps=True)][0][1]
>>> a_keep_introns.cigar_str
'=TG~gg5ac=AT'
>>> a_introns_to_gaps.cigar_str
'=TG-ggaac=AT'
dms_tools2.minimap2.removeCIGARmutations(cigar, muts_to_remove)[source]

Removes point mutations from CIGAR string.

Args:
cigar (str)

Long format CIGAR string.

muts_to_remove (dict)

Dict keyed by site number in 0-based numbering of target starting at first target position in cigar, values are nucleotides that we want to make the new target identity for the CIGAR at that site. All of these nucleotides must be the wildtype in the current CIGAR mutation.

Returns:

New CIGAR string expected if cigar was actually to the target where the wildtype identity is what is given by the mutation.

>>> cigar = '=AT-gca=T*at=G+ca*ga=T*at'
>>> muts_to_remove = {6:'T', 8:'A'}
>>> removeCIGARmutations(cigar, muts_to_remove)
'=AT-gca=TTG+ca=AT*at'
dms_tools2.minimap2.shiftIndels(cigar)[source]

Shifts indels to consistent position.

In some cases it is ambiguous where to place insertions / deletions in the CIGAR string. This function moves them to a consistent location (as far forward as possible).

Args:
cigar (str)

PAF long CIGAR string, format is detailed here.

Returns:

A version of cigar with indels shifted as far forward as possible.

>>> shiftIndels('=AAC-atagcc=GGG-ac=T')
'=AA-catagc=CGGG-ac=T'
>>> shiftIndels('=AAC-atagac=GGG-acg=AT')
'=A-acatag=ACGG-gac=GAT'
>>> shiftIndels('=TCC+c=TCAGA+aga=CT')
'=T+c=CCTC+aga=AGACT'
dms_tools2.minimap2.trimCigar(side, cigar)[source]

Trims a nucleotide from CIGAR string.

Currently just trims one site.

Args:
side (str)

“start” trim from start, “end” to trim from end.

cigar (str)

PAF long CIGAR string, format is detailed here.

Returns:

A version of cigar with a single site trimmed from start or end.

>>> trimCigar('start', '=ATG')
'=TG'
>>> trimCigar('end', '=ATG')
'=AT'
>>> trimCigar('start', '*ac=TG')
'=TG'
>>> trimCigar('end', '=AT*ag')
'=AT'
>>> trimCigar('start', '-aac=TG')
'-ac=TG'
>>> trimCigar('end', '=TG+aac')
'=TG+aa'