pacbio¶
Tools for processing PacBio sequencing data.
-
class
dms_tools2.pacbio.
CCS
(samplename, ccsfile, reportfile)[source]¶ Bases:
object
Class to handle results of
ccs
.Holds results of PacBio
ccs
. Has been tested on output ofccs
version 3.0.0.This class reads all data into memory, and so you may need a lot of RAM if ccsfile is large.
- Args:
- samplename (str)
Sample or sequencing run
- ccsfile (str)
File created by
ccs
that holds the CCSs. Theccs
program outputs BAM files. However, you can also pass FASTQ files generated from these BAM files usingsamtools bam2fq -T np,rq <bamfile>
(note that-T np,rq
flag which is needed to preserve the number of passes and accuracy flags). The file format is determined from the file extension, and can be*.bam
,*.fastq
,*.fq
,*.fastq.gz
, or*.fq.gz
.- reportfile (str or None)
Report file created by
ccs
, or None if you have no reports.
- Attributes:
- samplename (str)
Name set at initialization
- ccsfile (str)
ccs
BAM file set at initialization- reportfile (str or None)
ccs
report file set at initialization- zmw_report (pandas.DataFrame or None):
ZMW stats in reportfile, or None if no reportfile. Columns are status, number, percent, and fraction.
- subread_report (pandas.DataFrame or None)
Like zmw_report but for subreads.
- df (pandas.DataFrame)
The CCSs in ccsfile. Each row is a different CCS On creation, there will be the following columns (you can modify to add more):
“name”: the name of the CCS
“samplename”: the sample as set via samplename
“CCS”: the circular consensus sequence
“CCS_qvals”: the Q-values as a numpy array
“passes”: the number of passes of the CCS
“CCS_accuracy”: the accuracy of the CCS
“CCS_length”: the length of the CCS
Here is an example.
First, define the sequences, quality scores, and names for 3 example sequences. The names indicate the barcodes, the accuracy of the barcode, and the polarity. Two of the sequences have the desired termini and a barcode. The other does not. Note that the second sequence has an extra nucleotide at each end, this will turn out to be fine with the match_str we write. The second sequence is also reverse complemented:
>>> termini5 = 'ACG' >>> termini3 = 'CTT' >>> ccs_seqs = [ ... {'name':'barcoded_TTC_0.999_plus', ... 'seq':termini5 + 'TTC' + 'ACG' + termini3, ... 'qvals':'?' * 12, ... }, ... {'name':'barcoded_AGA_0.995_minus', ... 'seq':dms_tools2.utils.reverseComplement( ... 'T' + termini5 + 'AGA' + 'GCA' + termini3 + 'A'), ... 'qvals':''.join(reversed('?' * 4 + '5?9' + '?' * 7)), ... }, ... {'name':'invalid', ... 'seq':'GGG' + 'CAT' + 'GCA' + termini3, ... 'qvals':'?' * 12, ... } ... ] >>> for iccs in ccs_seqs: ... iccs['accuracy'] = qvalsToAccuracy(iccs['qvals'], encoding='sanger')
Now place these in a block of text that meets the CCS SAM specification:
>>> sam_template = '\t'.join([ ... '{0[name]}', ... '4', '*', '0', '255', '*', '*', '0', '0', ... '{0[seq]}', ... '{0[qvals]}', ... 'np:i:6', ... 'rq:f:{0[accuracy]}', ... ]) >>> samtext = '\n'.join([sam_template.format(iccs) for ... iccs in ccs_seqs])
Create small SAM file with these sequences, then convert to BAM file used to initialize a
CCS
(note this requiressamtools
to be installed):>>> samfile = '_temp.sam' >>> bamfile = '_temp.bam' >>> with open(samfile, 'w') as f: ... _ = f.write(samtext) >>> _ = subprocess.check_call(['samtools', 'view', ... '-b', '-o', bamfile, samfile]) >>> ccs = CCS('test', bamfile, None) >>> os.remove(samfile)
We also sometimes create the BAM files created by PacBio
ccs
to FASTQ. Do that usingsamtools bam2fq -T np,rq
to keep flags with number of passes and overall read quality:>>> fastq_data = subprocess.check_output( ... ['samtools', 'bam2fq', '-T', 'np,rq', bamfile])
Show how the resulting FASTQ data keeps the np and rq tags:
>>> print(fastq_data.decode('utf-8').strip().replace('\t', ' ')) @barcoded_TTC_0.999_plus np:i:6 rq:f:0.999 ACGTTCACGCTT + ???????????? @barcoded_AGA_0.995_minus np:i:6 rq:f:0.998144 TAAGTGCTCTCGTA + ???????9?5???? @invalid np:i:6 rq:f:0.999 GGGCATGCACTT + ????????????
Write the FASTQ to a file, and check that
CCS
initialized from the FASTQ is the same as one from the BAM:>>> fastqfile = '_temp.fastq' >>> gzfastqfile = '_temp.fastq.gz' >>> with open(fastqfile, 'wb') as f: ... _ = f.write(fastq_data) >>> with gzip.open(gzfastqfile, 'wb') as f: ... _ = f.write(fastq_data) >>> ccs_fastq = CCS('test', fastqfile, None) >>> ccs_gzfastq = CCS('test', gzfastqfile, None) >>> pandas.testing.assert_frame_equal(ccs_fastq.df, ccs.df) >>> pandas.testing.assert_frame_equal(ccs_gzfastq.df, ccs.df) >>> os.remove(fastqfile) >>> os.remove(gzfastqfile) >>> os.remove(bamfile)
Check ccs.df has correct names, samplename, CCS sequences, and columns:
>>> set(ccs.df.name) == {s['name'] for s in ccs_seqs} True >>> all(ccs.df.samplename == 'test') True >>> set(ccs.df.CCS) == {s['seq'] for s in ccs_seqs} True >>> set(ccs.df.columns) == {'CCS', 'CCS_qvals', 'name', ... 'passes', 'CCS_accuracy', 'CCS_length', 'samplename'} True
Use
matchSeqs()
to match sequences with expected termini and define barcodes and reads in these:>>> match_str = (termini5 + '(?P<barcode>N{3})' + ... '(?P<read>N+)' + termini3) >>> ccs.df = matchSeqs(ccs.df, match_str, 'CCS', 'barcoded')
This matching adds new columns to the new ccs.df:
>>> set(ccs.df.columns) >= {'barcode', 'barcode_qvals', ... 'barcode_accuracy', 'read', 'read_qvals', ... 'read_accuracy', 'barcoded', 'barcoded_polarity'} True
Now make sure df indicates that the correct sequences are barcoded, and that they have the correct barcodes:
>>> bc_names = sorted([s['name'] for s in ccs_seqs if ... 'barcoded' in s['name']]) >>> ccs.df = ccs.df.sort_values('barcode') >>> (ccs.df.query('barcoded').name == bc_names).all() True >>> barcodes = [x.split('_')[1] for x in bc_names] >>> (ccs.df.query('barcoded').barcode == barcodes).all() True >>> (ccs.df.query('not barcoded').barcode == ['']).all() True >>> barcode_accuracies = [float(x.split('_')[2]) for x in bc_names] >>> numpy.allclose(ccs.df.query('barcoded').barcode_accuracy, ... barcode_accuracies, atol=1e-4) True >>> numpy.allclose(ccs.df.query('barcoded').barcode_accuracy, ... [qvalsToAccuracy(qvals) for qvals in ... ccs.df.query('barcoded').barcode_qvals]) True >>> numpy.allclose(ccs.df.query('not barcoded').barcode_accuracy, ... -1, atol=1e-4) True >>> barcoded_polarity = [{'plus':1, 'minus':-1}[x.split('_')[3]] ... for x in bc_names] >>> (ccs.df.query('barcoded').barcoded_polarity == barcoded_polarity).all() True
-
class
dms_tools2.pacbio.
TerminiVariantTag
¶ Bases:
tuple
Variant tag at termini.
-
property
nucleotides
¶ A dict keyed variant nucleotides, values variant name.
-
property
site
¶ Site of tag in termini (0, 1, … numbering).
-
property
termini
¶ Location of tag: termini5 or termini3.
-
property
-
class
dms_tools2.pacbio.
TerminiVariantTagCaller
(features, *, variants=['variant_1', 'variant_2'], trim_termini)[source]¶ Bases:
object
Call variant tags at termini of CCSs.
- Args:
- features (list)
List of BioPython SeqFeature objects. Any features with a type attribute of variant_tag are taken to specify variant tags. These should consist of a single nucleotide, and have qualifiers that give the nucleotide for each variant. The features list should also have features with type attributes termini5 and termini3 used to determine which termin each tag falls in.
- variants (list)
List of variant names, must have nucleotide for variant specified in qualifier for each variant tag.
- trim_termini (int)
The amount trimmed from the 5’ termini of termini5 and the 3’ termini of termini3 when these are passed to
TerminiVariantTagCaller.call
.
- Attributes:
- variant_tags (list)
List of
TerminiVariantTag
objects.- variants (list)
List of variant names set on initialization.
- trim_termini (int)
Value set as argument on initialization.
- termini5 (Bio.SeqFeature.SeqFeature)
The 5’ termini in features
- termini3 (Bio.SeqFeature.SeqFeature)
The 3’ termini in features
Here is an example. First, create list of features that has a single-nucleotide variant tag for each of two possible variants (‘variant_1’ and ‘variant_2’) in each termini:
>>> SeqFeature = Bio.SeqFeature.SeqFeature >>> FeatureLocation = Bio.SeqFeature.FeatureLocation >>> features = [ ... SeqFeature(type='termini5', location=FeatureLocation(0, 147)), ... SeqFeature(type='termini3', location=FeatureLocation(1303, 1342)), ... SeqFeature(type='variant_tag', location=FeatureLocation(32, 33), ... qualifiers={'variant_1':['A'], 'variant_2':['G']}), ... SeqFeature(type='variant_tag', location=FeatureLocation(1310, 1311), ... qualifiers={'variant_1':['T'], 'variant_2':['C']}) ... ]
Now initialize the
TerminiVariantTagCaller
:>>> caller = TerminiVariantTagCaller(features, trim_termini=4) >>> caller.variants ['variant_1', 'variant_2'] >>> int(caller.termini5.location.start) 0 >>> int(caller.termini5.location.end) 147 >>> int(caller.termini3.location.start) 1303 >>> int(caller.termini3.location.end) 1342 >>> len(caller.variant_tags) 2 >>> caller.variant_tags[0].termini 'termini5' >>> caller.variant_tags[0].site 32 >>> caller.variant_tags[0].nucleotides == {'A':'variant_1', 'G':'variant_2'} True >>> caller.variant_tags[1].termini 'termini3' >>> caller.variant_tags[1].site 7 >>> caller.variant_tags[1].nucleotides == {'T':'variant_1', 'C':'variant_2'} True
Do some example variant calling:
>>> caller.call({'termini5':'GGCGTCACACTTTGCTATGCCATAGCATATTTATCC', ... 'termini3':'AGATCGGTAGAGCGTCGTGTAGGGAAAGAGTGTGG'}) 'variant_1' >>> caller.call({'termini5':'GGCGTCACACTTTGCTATGCCATAGCATGTTTATCC', ... 'termini3':'AGATCGGCAGAGCGTCGTGTAGGGAAAGAGTGTGG'}) 'variant_2' >>> caller.call({'termini5':'GGCGTCACACTTTGCTATGCCATAGCATGTTTATCC', ... 'termini3':'AGATCGGTAGAGCGTCGTGTAGGGAAAGAGTGTGG'}) 'mixed' >>> caller.call({'termini5':'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC', ... 'termini3':'AGATCGGTAGAGCGTCGTGTAGGGAAAGAGTGTGG'}) 'invalid' >>> caller.call({'termini5':'GGC', 'termini3':'AGAT'}) 'unknown' >>> caller.call({}) 'unknown'
-
call
(termini_seqs)[source]¶ Call variant identity.
- Args:
- termini_seqs (dict, namedtuple, pandas row)
Some object that has attributes that can be accessed as termini5 and termini3. These termini are assumed to have the amount specified by
TerminiVariantTagCaller.trim_termini
trimmed from the 5’ termini of termini5 and the 3’ termini of termini3.
- Returns:
A str that can be any of the following:
If all tag sites in termini match the same variant, return the name of that variant.
If different tag sites match different variants, return “mixed”.
If any tag sites have a nucleotide that matches no known variants, return “invalid”.
If termini_seqs lacks a termini or has a termini that is too short to contain the tag sites, return “unknown”.
-
checkTagNucleotides
(amplicon)[source]¶ Check amplicon carrying tags has right ambiguous nucleotides.
- Arguments:
- amplicon (BioPython SeqRecord)
The full amplicon that contains the termini and variant tags
This method checks that the amplicon correctly has IUPAC ambiguous nucleotides that cover the possible diversity at the site of each variant tag. If so, it does nothing and returns None. If not, it raises a ValueError.
-
dms_tools2.pacbio.
alignSeqs
(df, mapper, query_col, aligned_col, *, add_alignment=True, add_target=True, add_n_trimmed=True, add_n_additional=True, add_n_additional_difftarget=True, targetvariants=None, mutationcaller=None, overwrite=True, paf_file=None)[source]¶ Align sequences in a dataframe to target sequence(s).
- Arguments:
- df (pandas DataFrame)
Data frame in which one column holds sequences to match. There also must be a column named “name” with unique names.
- mapper (
dms_tools2.minimap2.Mapper
) Align using the
dms_tools2.minimap2.Mapper.map
function of mapper. Target sequence(s) to which we align are specified when initializing mapper.- query_col (str)
Name of column in df with query sequences to align. If we are to use Q-values, there must also be a column with this name suffixed by “_qvals”.
- aligned_col (str)
Name of column added to df. Elements of column are True if query_col aligns, and False otherwise.
- add_alignment (bool)
Add column with the
dms_tools2.minimap2.Alignment
.- add_target (bool)
Add column giving target (reference) to which sequence aligns.
- add_n_trimmed (bool)
Add columns giving number of nucleotides trimmed from ends of both the query and target in the alignment.
- add_n_additional (bool)
Add column specifying the number of additional alignments.
- targetvariants (
dms_tools2.minimap2.TargetVariants
) Call target variants of aligned genes using the call function of this object. Note that this also adjusts the returned alignments / CIGAR if a variant is called. If the variantsites_min_acc attribute is not None, then df must have a column with the name of query_col suffixed by ‘_qvals’ that gives the Q-values to compute accuracies.
- mutationcaller (
dms_tools2.minimap2.MutationCaller
) Call mutations of aligned genes using the call function of this object. Note that any target variant mutations are handled first and then removed and not called here.
- add_n_additional_difftarget (bool)
Add columns specifying number of additional alignments to a target other than the one in the primary alignment.
- overwrite (bool)
If True, we overwrite any existing columns to be created that already exist. If False, raise an error if any of the columns already exist.
- paf_file (None or str)
If a str, is the name of the PAF file created by mapper (see outfile argument of
dms_tools2.minimap2.Mapper.map
) Otherwise this file is not saved.
- Returns:
A copy of df with new columns added. The exact columns to add are specified by the calling arguments. Specifically:
We always add a column with the name given by aligned_col that is True if there was an alignment and False otherwise.
If add_alignment is True, add column named aligned_col suffixed by “_alignment” that gives the alignment as a
dms_tools2.minimap2.Alignment
object, or None if there is no alignment. Note that if there are multiple alignments, then this is the “best” alignment, and the remaining alignments are in thedms_tools2.minimap2.Alignment.additional
attribute.If add_target is True, add column named aligned_col suffixed by “_target” that gives the target to which the sequence aligns in the “best” alignment, or an empty string if no alignment.
If add_n_trimmed is True, add column named aligned_col suffixed by “_n_trimmed_query_start”, “_n_trimmed_query_end”, “_n_trimmed_target_start”, and “_n_trimmed_target_end” that give the number of nucleotides trimmed from the query and target in the “best” alignment. Are all zero if the zero if the alignment is end-to-end. Are -1 if no alignment.
If add_n_additional is True, add column named aligned_col suffixed by “_n_additional” that gives the number of additional alignments (in
dms_tools2.minimap2.Alignment.additional
), or -1 if there is no alignment.If add_n_additional_difftarget is True, add column named aligned_col suffixed by “_n_additional_difftarget” that gives the number of additional alignments to different targets that are not isoforms, or -1 if if there is no alignment. See the target_isoforms attribute of
dms_tools2.minimap2.Mapper
.If targetvariants is not None, add a column named aligned_col suffixed by “_target_variant” that has the values returned for that alignment by
dms_tools2.minimap2.TargetVariants.call
, or an empty string if no alignment.If mutationcaller is not None, column named aligned_col suffixed by “_mutations” giving the
dms_tools2.minimap2.Mutations
object returned bydms_tools2.minimap2.MutationCaller.call
, or None if there is no alignment.
-
dms_tools2.pacbio.
matchAndAlignCCS
(ccslist, mapper, *, termini5, gene, spacer, umi, barcode, termini3, termini5_fuzziness=0, gene_fuzziness=0, spacer_fuzziness=0, umi_fuzziness=0, barcode_fuzziness=0, termini3_fuzziness=0, targetvariants=None, mutationcaller=None, terminiVariantTagCaller=None, tagged_termini_remove_indels=True, rc_barcode_umi=True)[source]¶ Identify CCSs that match pattern and align them.
This is a convenience function that runs
matchSeqs()
andalignSeqs()
for a common use case. It takes one or moreCCS
objects, looks for CCS sequences in them that match a specific pattern, and aligns them to targets. It returns a pandas data frame with all the results. The CCS sequences are assumed to be molecules that have the following structure, although potentially in either orientation:5'-...-termini5-gene-spacer-umi-barcode-termini3-...-3'
As indicated by the
...
, there can be sequence before and after our expected pattern that we ignore. The gene element is the aligned to the targets. The full CCS is also aligned in the absence of the pattern matching.- Args:
- ccslist (
CCS
object or list of them) Analyze the CCS’s in the df attributes. If there are multiple
CCS
objectes, they are concatenated. However, they must have the same columns.- mapper (
dms_tools2.minimap2.Mapper
) Mapper used to perform alignments.
- termini5 (str or None)
Expected sequence at 5’ end as str that can be compiled to regex object. Passed through
re_expandIUPAC()
. For instance, make it ‘ATG|CTG’ if the sequence might start with either ATG or CTG. Set to None if no expected 5’ termini.- gene (str)
Like termini5 but gives the gene to match. For instance, ‘N+’ if the gene can be arbitrary sequence and length.
- spacer (str or None)
Like termini5, but for the spacer after gene.
- umi (str or None)
Like termini5, but for UMI.
- barcode (str or None)
Like termini5, but for barcode. For instance, ‘N{10}’ if 10-nucleotide barcode.
- termini3 (str or None)
Like termini5, but for termini3.
- termini5_fuzziness, …, termini3_fuzziness (int)
The matching for the sequence patterns uses regex, which enables fuzzy matching. Set termini5_fuzziness to enable a specific number of differences (can be insertion, deletion, or mismatch) when matching termini5. Likewise for gene_fuzziness, etc. Note that the fuzzy matching uses the BESTMATCH flag to try to find the best fuzzy match. Note also that you can not both use fuzzy matching charcters in the strings to match (e.g., termini5 and set fuzziness to a value > 0: choose one or the other way to specify fuzzy matches.
- targetvariants (
dms_tools2.minimap2.TargetVariants
) Call target variants. See docs for same argument to
alignSeqs()
.- mutationcaller (
dms_tools2.minimap2.MutationCaller
) Call mutations. See docs for same argument to
alignSeqs()
.- terminiVariantTagCaller (
TerminiVariantTagCaller
) Call variants in termini.
- tagged_termini_remove_indels (bool)
If terminiVariantTagCaller is being used and this, is True, then use remove_indels flag when calling matchSeqs for the termini. This is useful if using fuzzy matching from the termini, as it aids in the calling of tags as it doesn’t cause indels to misplace the tag. Has no meaning if terminiVariantTagCaller is not being used.
- rc_barcode_umi (bool)
Do we reverse complement the barcode and UMI in the returned data frame relative to the orientation of the gene. Typically this is desirable because actual barcode sequencing goes in the reverse direction of the gene.
- ccslist (
- Returns:
A pandas dataframe that will have all columns already in the df attribute of the input
CCS
objects with the following columns added:barcoded: True if CCS matches full expected pattern, False otherwise.
barcoded_polarity: 1 of the match is in the polarity of the CCS, -1 if to the reverse complement, 0 if no match.
Columns named termini5, gene, spacer, UMI, barcode, and termini3 (except if any of these elements are None). If barcoded is True for that CCS, these columns give the sequence for that element. If it is False, they are empty strings. There are likewise columns with these same names suffixed with “_accuracy” that give the CCS accuracy for that element, and columns suffixed with “_qvals” that give the quality scores for the elements.
For each of termini5, spacer, and termini3 that are not None, a column named has_termini5, etc that indicates if that element is matched in isolate even if the full pattern is not matched.
gene_aligned is True if the CCS matches the expected pattern (is barcoded), and gene can further be aligned using mapper. It is False otherwise.
gene_aligned_alignment, gene_aligned_target, gene_aligned_n_trimmed_query_start, gene_aligned_n_trimmed_query_end, gene_aligned_n_trimmed_target_start, gene_aligned_n_trimmed_target_end, gene_aligned_n_additional, and gene_aligned_n_additional_difftarget give the
dms_tools2.minimap2.Alignment
, the alignment target, number of nucleotides trimmed from ends of the query gene or target, the number of additional alignments if gene_aligned, and the number of additional alignments to different targets (see target_isoforms attribute ofdms_tools2.minimap2.Mapper
). If the gene is not aligned, these are None, empty strings, or -1.If targetvariants is not None, column named gene_aligned_target_variant giving target variant returned by
dms_tools2.minimap2.TargtVariants.call
.If mutationcaller is not None, column named gene_aligned_mutations giving the
dms_tools2.minimap2.Mutations
object returned bydms_tools2.minimap2.MutationCaller.call
, or None if there is no alignment.If terminiVariantTagCaller is not None, column named termini_variant giving the termini variant returned by
TerminiVariantTagCaller.call
, or the str “unknown” if both termini are not matched.CCS_aligned is True if the CCS can be aligned using mapper even if a gene cannot be matched, and False otherwise. CCS_aligned_alignment and CCS_aligned_target give the
dms_tools2.minimap2.Alignment
(or None) and the target (or empty string).
-
dms_tools2.pacbio.
matchSeqs
(df, match_str, col_to_match, match_col, *, add_polarity=True, add_group_cols=True, remove_indels=[], add_accuracy=True, add_qvals=True, expandIUPAC=True, overwrite=False)[source]¶ Identify sequences in a dataframe that match a specific pattern.
- Args:
- df (pandas DataFrame)
Data frame with column holding sequences to match.
- match_str (str)
A string that can be passed to regex.compile that gives the pattern that we are looking for, with target subsequences as named groups. See also the expandIUPAC parameter, which simplifies writing match_str. If None we just return df. Note that we use regex rather than re, so fuzzy matching is enabled. Note that the matching uses the BESTMATCH flag to find the best match.
- col_to_match (str)
Name of column in df that contains the sequences to match.
- match_col (str)
Name of column added to df. Elements of columns are True if col_to_match matches match_str for that row, and False otherwise.
- add_polarity (bool)
Add a column specifying the polarity of the match?
- add_group_cols (bool)
Add columns with the sequence of every group in match_str?
- remove_indels (list)
Only meaningful if match_str specifies to allow fuzzy matching for a group, and add_group_cols is True. Then for each named group in match_str, in sequence for that group that is added to the returned df, indicate indels by adding a - gap character for deletions, and removing the inserted nucleotide called by regex if there is an insertion.
- add_accuracy (bool)
For each group in the match, add a column giving the accuracy of that group’s sequence? Only used if add_group_cols is True.
- add_qvals (bool)
For each group in the match, add a column giving the Q values for that group’s sequence? Only used if add_group_cols is True.
- expandIUPAC (bool)
Use IUPAC code to expand ambiguous nucleotides (e.g., “N”) by passing match_str through the
re_expandIUPAC()
function.- overwrite (bool)
If True, we overwrite any existing columns to be created that already exist. If False, raise an error if any of the columns already exist.
- Returns:
A copy of df with new columns added. The exact columns to add are specified by the calling arguments. Specifically:
We always add a column with the name given by match_col that is True if there was a match and False otherwise.
If add_polarity is True, add a column that is match_col suffixed by “_polarity” which is 1 if the match is directly to the sequence in col_to_match, and -1 if it is to the reverse complement of this sequence. The value is 0 if there is no match.
If add_group_cols is True, then for each group in match_str specified using the re group naming syntax, add a column with that group name that gives the sequence matching that group. These sequences are empty strings if there is no match. These added sequences are in the polarity of the match, so if the sequence in match_col has to be reverse complemented for a match, then these sequences will be the reverse complement that matches. Additionally, when add_group_cols is True:
If add_accuracy is True, we also add a column suffixed by “_accuracy” that gives the accuracy of that group as computed from the Q-values. The value -1 if there is match for that row. Adding accuracy requires a colum in df with the name given by match_col suffixed by “_qvals.”
If add_qvals is True, we also add a column suffixed by “_qvals” that gives the Q-values for that sequence. Adding these Q-values requires that there by a column in df with the name given by match_col suffixed by “_qvals”. The Q-values are in the form of a numpy array, or an empty numpy array if there is no match for that row.
See docs for
CCS
for example uses of this function.Here is a short example that uses the fuzzy matching of the regex model for the polyA tail:
>>> gene = 'ATGGCT' >>> polyA = 'AAAACAAAA' >>> df_in = pandas.DataFrame({'CCS':[gene + polyA]}) >>> match_str = '(?P<gene>N+)(?P<polyA>AA(A{5,}){e<=1}AA)' >>> df = matchSeqs(df_in, match_str, 'CCS', 'matched', ... add_accuracy=False, add_qvals=False) >>> expected = df.assign(gene=gene, polyA=polyA, ... matched=True, matched_polarity=1) >>> (df.sort_index(axis=1) == expected.sort_index(axis=1)).all().all() True
Here is a short example with fuzzy matching that uses the remove_indels option. First, do not remove the indels:
>>> termini5 = 'ACAT' >>> termini3 = 'ATAC' >>> match_str2 = '^(?P<termini5>AAT){e<=1}(?P<gene>ATGGCT){e<=1}(?P<termini3>ATGAC){e<=1}$' >>> df_in2 = pandas.DataFrame({'CCS':[termini5 + gene + termini3]}) >>> df2 = matchSeqs(df_in2, match_str2, 'CCS', 'matched', ... add_accuracy=False, add_qvals=False) >>> df2.gene.values[0] == gene True >>> df2.termini5.values[0] == termini5 True >>> df2.termini3.values[0] == termini3 True
Now remove the indels in just termini3:
>>> df2_rm = matchSeqs(df_in2, match_str2, 'CCS', 'matched', ... remove_indels=['termini3'], ... add_accuracy=False, add_qvals=False) >>> df2_rm.gene.values[0] == gene True >>> df2_rm.termini5.values[0] == termini5 True >>> df2_rm.termini3.values[0] == termini3 False
Now remove indels in both termini:
>>> df2_rm2 = matchSeqs(df_in2, match_str2, 'CCS', 'matched', ... remove_indels=['termini5', 'termini3'], ... add_accuracy=False, add_qvals=False) >>> df2_rm2.gene.values[0] == gene True >>> df2_rm2.termini5.values[0] == termini5 False >>> df2_rm2.termini5.values[0] 'AAT' >>> df2_rm2.termini3.values[0] == termini3 False
-
dms_tools2.pacbio.
qvalsToAccuracy
(qvals, encoding='numbers', no_avg=False)[source]¶ Converts set of quality scores into average accuracy.
- Args:
- qvals (numpy array or number or str)
List of Q-values, assumed to be Phred scores. For how they are encoded, see encoding.
- encoding (str)
If it is “numbers” then qvals should be a numpy array giving the Q-values, or a number with one Q-value. If it is “sanger”, then qvals is a string, with the score being the ASCII value minus 33.
- no_avg (bool)
Compute the accuracies of individual Q-values rather than the average of the array or list.
- Returns:
A number giving the average accuracy, or nan if qvals is empty.
Note that the probability \(p\) of an error at a given site is related to the Q-value \(Q\) by \(Q = -10 \log_{10} p\).
>>> qvals = numpy.array([13, 77, 93]) >>> round(qvalsToAccuracy(qvals), 3) == 0.983 True >>> round(qvalsToAccuracy(qvals[1 : ]), 3) == 1 True >>> qvalsToAccuracy(numpy.array([])) nan
>>> qvals_str = '.n~' >>> round(qvalsToAccuracy(qvals_str, encoding='sanger'), 3) == 0.983 True
>>> round(qvalsToAccuracy(15), 3) == 0.968 True
>>> [round(a, 5) for a in qvalsToAccuracy(qvals, no_avg=True)] == [0.94988, 1, 1] True
-
dms_tools2.pacbio.
re_expandIUPAC
(re_str)[source]¶ Expand IUPAC ambiguous nucleotide codes in re search string.
Simplifies writing re search strings that include ambiguous nucleotide codes.
- Args:
- re_str (str)
String appropriate to be passed to regex.compile.
- Returns:
A version of re_str where any characters not in the group names that correspond to upper-case ambiguous nucleotide codes are expanded according to their definitions in the IUPAC code.
>>> re_str = '^(?P<termini5>ATG)(?P<cDNA>N+)A+(?P<barcode>N{4})$' >>> re_expandIUPAC(re_str) '^(?P<termini5>ATG)(?P<cDNA>[ACGT]+)A+(?P<barcode>[ACGT]{4})$'
-
dms_tools2.pacbio.
summarizeCCSreports
(ccslist, report_type, plotfile, plotminfrac=0.005)[source]¶ Summarize and plot CCS reports.
- Args:
- ccslist (CCS object or list of them)
CCS objects to summarize
- report_type (str “zmw” or “subread”)
Which type of report to summarize
- plotfile (str or None)
Name of created bar plot, or None if you want to return the created plot.
- plotminfrac (float)
Only plot status categories with >= this fraction in at least one CCS
Returns:
If plotfile is a str, returns a pandas DataFrame aggregating the reports and creates plotfile.
If plotfile is None, returns the 2-tuple containing the data frame and the plot.