minimap2

Runs minimap2 aligner. minimap2 must be version 2.17 or higher.

class alignparse.minimap2.Mapper(options, *, prog='minimap2', min_version='2.17', check_cs=True, retain_tags=None)[source]

Bases: object

Run minimap2.

Parameters:
  • options (tuple or list) – Command line options to minimap2. For instance, see OPTIONS_CODON_DMS or OPTIONS_VIRUS_W_DEL.

  • prog (str) – Path to minimap2 executable.

  • min_version (str) – Minimum version of minimap2. Mapper has only been tested with versions >= 2.17.

  • check_cs (bool) – Check that options includes --cs to output the short cs tag.

  • retain_tags (None or list) – Retain these tags in query sequences in output alignments.

options

Options to minimap2.

Type:

list

prog

Path to minimap2 executable.

Type:

str

version

Version of minimap2.

Type:

str

retain_tags

List of tags to retain.

Type:

None or list

Note

If retain_tags is set, then Mapper.map_to_sam() searches for tags in query FASTQ headers like the np` tags here:

@m54228_181120_212724/4194377/ccs   np:i:45

and retains them in the samfile alignment output. Such tags are present in FASTQ files created by PacBio ccs version 4.0. For the above header, you’d use retain_tags=[‘np’].

Examples

First, create toy example target and query file:

>>> tempdir = tempfile.TemporaryDirectory()
>>> targetfile = os.path.join(tempdir.name, 'targets.fasta')
>>> queryfile = os.path.join(tempdir.name, 'queries.fastq')
>>> with open(targetfile, 'w') as f:
...     _ = f.write(textwrap.dedent('''
...         >refseq
...         ATGCAANNNGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGGGTAT
...         ''').strip())
>>> with open(queryfile, 'w') as f:
...     _ = f.write(textwrap.dedent('''
...         @m54228_181120_212724/4194376/ccs   np:i:127
...         ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT
...         +
...         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...         ''').strip())

Now map not retaining tags from the FASTQ (this is default behavior):

>>> samfile = os.path.join(tempdir.name, 'alignments.sam')
>>> mapper = Mapper(OPTIONS_CODON_DMS)
>>> mapper.map_to_sam(targetfile, queryfile, samfile)
>>> for a in pysam.AlignmentFile(samfile):
...     tag_names = [tup[0] for tup in a.get_tags()]
...     print(a.tostring())  
m54228_181120_212724/4194376/ccs 0 refseq 1 1 63M * 0 0
ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NM:i:4 ms:i:111 AS:i:111 nn:i:3 tp:A:P cm:i:7 s1:i:41 s2:i:0 de:f:0.0635
cs:Z::6*na*na*nt:49*ga:4 rl:i:0
>>> print(tag_names)
['NM', 'ms', 'AS', 'nn', 'tp', 'cm', 's1', 's2', 'de', 'cs', 'rl']

Now map retaining the np tag in the FASTQ:

>>> samfile_tags = os.path.join(tempdir.name, 'alignments_tags.sam')
>>> mapper_tags = Mapper(OPTIONS_CODON_DMS, retain_tags=['np'])
>>> mapper_tags.map_to_sam(targetfile, queryfile, samfile_tags)
>>> for a in pysam.AlignmentFile(samfile_tags):
...     tag_names = [tup[0] for tup in a.get_tags()]
...     print(a.tostring())  
m54228_181120_212724/4194376/ccs 0 refseq 1 1 63M * 0 0
ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NM:i:4 ms:i:111 AS:i:111 nn:i:3 tp:A:P cm:i:7 s1:i:41 s2:i:0 de:f:0.0635
cs:Z::6*na*na*nt:49*ga:4 rl:i:0 np:i:127
>>> print(tag_names)  
['NM', 'ms', 'AS', 'nn', 'tp', 'cm', 's1', 's2', 'de', 'cs', 'rl', 'np']

Remove the temporary directory for the example:

>>> tempdir.cleanup()
map_to_sam(targetfile, queryfile, samfile)[source]

Map query sequences to targets and output SAM file.

Parameters:
  • targetfile (str) – Alignment targets (reference), typically FASTA (can be gzipped).

  • queryfile (str) – Queries to align, typically FASTA or FASTQ (can be gzipped).

  • samfile (str) – Name of created SAM file (should have suffix .sam).

alignparse.minimap2.OPTIONS_CODON_DMS = ('-A2', '-B4', '-O12', '-E2', '--end-bonus=13', '--secondary=no', '--cs')

minimap2 options for codon-mutant libraries.

Note

These options work well for sequences expected to have codon-level mutations (3 nucleotides in a row) and not expected to have many indels.

Options have the following meaning / rationale:
  • -A2 : score for a match

  • -B4 : penalty for a mismatch

  • -O12 : gap opening penalty (high, because few indels expected)

  • -E2 : gap extension penalty

  • --end-bonus=13 : favor codon mutation over gap at termini.

  • --secondary=no : do not output secondary alignments.

  • --cs : add short cs tag ((see https://github.com/lh3/minimap2#cs)

Type:

tuple

alignparse.minimap2.OPTIONS_VIRUS_W_DEL = ('-xsplice:hq', '-un', '-C0', '--splice-flank=no', '-M=1', '--end-seed-pen=2', '--end-bonus=2', '--secondary=no', '--cs')

minimap2 options for viral genes.

Note

These options work well for sequences expected to have nucleotide mutations and long internal deletions. It handles long deletions by having minimap2 treat them like introns. This means that in the output, introns should be parsed like deletions.

Options have the following meaning / rationale:
  • -xsplice:hq : preset for long-read high-quality spliced alignment.

  • -un : do not try to find canonical splice sites.

  • -C0 : no cost for non-canonical splice sites.

  • --splice-flank=no : no assumptions about base next to splice donor.

  • -M=1 : mark secondary chain that overlaps with another by this much.

  • --end-seed-pen=2 : described as helping avoid tiny terminal exons.

  • --end-bonus=2 : bonus for extending to end of alignment.

  • --secondary=no : do not output secondary alignments.

  • --cs : add short cs tag ((see https://github.com/lh3/minimap2#cs)

Type:

tuple