minimap2¶
Runs minimap2 aligner.
minimap2
must be version 2.17 or higher.
- class alignparse.minimap2.Mapper(options, *, prog='minimap2', min_version='2.17', check_cs=True, retain_tags=None)[source]¶
Bases:
object
Run minimap2.
- Parameters:
options (tuple or list) – Command line options to
minimap2
. For instance, seeOPTIONS_CODON_DMS
orOPTIONS_VIRUS_W_DEL
.prog (str) – Path to
minimap2
executable.min_version (str) – Minimum version of
minimap2
.Mapper
has only been tested with versions >= 2.17.check_cs (bool) – Check that options includes
--cs
to output the shortcs
tag.retain_tags (None or list) – Retain these tags in query sequences in output alignments.
- options¶
Options to
minimap2
.- Type:
list
- prog¶
Path to
minimap2
executable.- Type:
str
- version¶
Version of
minimap2
.- Type:
str
- retain_tags¶
List of tags to retain.
- Type:
None or list
Note
If retain_tags is set, then
Mapper.map_to_sam()
searches for tags in query FASTQ headers like the np` tags here:@m54228_181120_212724/4194377/ccs np:i:45
and retains them in the samfile alignment output. Such tags are present in FASTQ files created by PacBio
ccs
version 4.0. For the above header, you’d use retain_tags=[‘np’].Examples
First, create toy example target and query file:
>>> tempdir = tempfile.TemporaryDirectory() >>> targetfile = os.path.join(tempdir.name, 'targets.fasta') >>> queryfile = os.path.join(tempdir.name, 'queries.fastq') >>> with open(targetfile, 'w') as f: ... _ = f.write(textwrap.dedent(''' ... >refseq ... ATGCAANNNGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGGGTAT ... ''').strip()) >>> with open(queryfile, 'w') as f: ... _ = f.write(textwrap.dedent(''' ... @m54228_181120_212724/4194376/ccs np:i:127 ... ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT ... + ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ''').strip())
Now map not retaining tags from the FASTQ (this is default behavior):
>>> samfile = os.path.join(tempdir.name, 'alignments.sam') >>> mapper = Mapper(OPTIONS_CODON_DMS) >>> mapper.map_to_sam(targetfile, queryfile, samfile) >>> for a in pysam.AlignmentFile(samfile): ... tag_names = [tup[0] for tup in a.get_tags()] ... print(a.tostring()) m54228_181120_212724/4194376/ccs 0 refseq 1 1 63M * 0 0 ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NM:i:4 ms:i:111 AS:i:111 nn:i:3 tp:A:P cm:i:7 s1:i:41 s2:i:0 de:f:0.0635 cs:Z::6*na*na*nt:49*ga:4 rl:i:0 >>> print(tag_names) ['NM', 'ms', 'AS', 'nn', 'tp', 'cm', 's1', 's2', 'de', 'cs', 'rl']
Now map retaining the np tag in the FASTQ:
>>> samfile_tags = os.path.join(tempdir.name, 'alignments_tags.sam') >>> mapper_tags = Mapper(OPTIONS_CODON_DMS, retain_tags=['np']) >>> mapper_tags.map_to_sam(targetfile, queryfile, samfile_tags) >>> for a in pysam.AlignmentFile(samfile_tags): ... tag_names = [tup[0] for tup in a.get_tags()] ... print(a.tostring()) m54228_181120_212724/4194376/ccs 0 refseq 1 1 63M * 0 0 ATGCAAAATGATGCATAGTATTAGCATAAATAGGATAGCCATAAGGTTACTGCATAAGAGTAT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NM:i:4 ms:i:111 AS:i:111 nn:i:3 tp:A:P cm:i:7 s1:i:41 s2:i:0 de:f:0.0635 cs:Z::6*na*na*nt:49*ga:4 rl:i:0 np:i:127 >>> print(tag_names) ['NM', 'ms', 'AS', 'nn', 'tp', 'cm', 's1', 's2', 'de', 'cs', 'rl', 'np']
Remove the temporary directory for the example:
>>> tempdir.cleanup()
- map_to_sam(targetfile, queryfile, samfile)[source]¶
Map query sequences to targets and output SAM file.
- Parameters:
targetfile (str) – Alignment targets (reference), typically FASTA (can be gzipped).
queryfile (str) – Queries to align, typically FASTA or FASTQ (can be gzipped).
samfile (str) – Name of created SAM file (should have suffix
.sam
).
- alignparse.minimap2.OPTIONS_CODON_DMS = ('-A2', '-B4', '-O12', '-E2', '--end-bonus=13', '--secondary=no', '--cs')¶
minimap2
options for codon-mutant libraries.Note
These options work well for sequences expected to have codon-level mutations (3 nucleotides in a row) and not expected to have many indels.
- Options have the following meaning / rationale:
-A2
: score for a match-B4
: penalty for a mismatch-O12
: gap opening penalty (high, because few indels expected)-E2
: gap extension penalty--end-bonus=13
: favor codon mutation over gap at termini.--secondary=no
: do not output secondary alignments.--cs
: add shortcs
tag ((see https://github.com/lh3/minimap2#cs)
- Type:
tuple
- alignparse.minimap2.OPTIONS_VIRUS_W_DEL = ('-xsplice:hq', '-un', '-C0', '--splice-flank=no', '-M=1', '--end-seed-pen=2', '--end-bonus=2', '--secondary=no', '--cs')¶
minimap2
options for viral genes.Note
These options work well for sequences expected to have nucleotide mutations and long internal deletions. It handles long deletions by having
minimap2
treat them like introns. This means that in the output, introns should be parsed like deletions.- Options have the following meaning / rationale:
-xsplice:hq
: preset for long-read high-quality spliced alignment.-un
: do not try to find canonical splice sites.-C0
: no cost for non-canonical splice sites.--splice-flank=no
: no assumptions about base next to splice donor.-M=1
: mark secondary chain that overlaps with another by this much.--end-seed-pen=2
: described as helping avoid tiny terminal exons.--end-bonus=2
: bonus for extending to end of alignment.--secondary=no
: do not output secondary alignments.--cs
: add shortcs
tag ((see https://github.com/lh3/minimap2#cs)
- Type:
tuple