ccs¶
Tools for inspecting output of the PacBio ccs
program:
https://github.com/PacificBiosciences/unanimity/blob/develop/doc/PBCCS.md
- class alignparse.ccs.Summaries(df, *, name_col='name', fastq_col='fastq', report_col='report', ncpus=-1)[source]¶
Bases:
object
Summaries of
ccs
runs.- Parameters:
df (pandas.DataFrame) – Data frame giving information on runs of
ccs
being summarized.name_col (str) – Column in df with name of
ccs
run.fastq_col (str) – Column in df with FASTQ file for run, appropriate for passing to
Summary
as fastqfile.report_col (str or None) – Column in df with report for run, appropriate for passing to
Summary
as reportfile. Set to None if no reports. If there are no reports, then ZMW stats are not available.ncpus (int) – Number of CPUs to use; -1 means all available. Useful as processing all the FASTQ files to compute read accuracies can take a while.
- ccs_stats(variable)[source]¶
Get CCS stats for all runs.
- Parameters:
variable ({'length', 'passes', 'accuracy'}) – Variable for which we get stats. You will get an error if
Summaries.has_stat()
is not true for variable.- Returns:
Data frame with columns of ‘name’ (holding run name) and the value of variable giving variable value for all CCSs.
- Return type:
pandas.DataFrame
- has_stat(variable)[source]¶
Do the summaries contain a statistic?
- Parameters:
variable ({'length', 'passes', 'accuracy'}) – Variable for which we want statistics.
- Returns:
True if all runs have statistics for variable, False otherwise.
- Return type:
bool
- has_zmw_stats()[source]¶
Are ZMW stats available?
- Returns:
True if all runs have ZMW stats, False otherwise.
- Return type:
bool
- plot_ccs_stats(variable, *, trim_frac=0.005, bins=25, histogram_stat='count', maxcol=None, panelsize=1.75)[source]¶
Plot histograms of CCS stats for all runs.
- Parameters:
variable ({'length', 'passes', 'accuracy'}) – Variable for which we plot stats. You will get an error if
Summaries.has_stat()
is not true for variable.trim_frac (float) – Trim this amount of the bottom and top fraction from the data before plotting. Useful if outliers greatly extend scale.
bins (int) – Number of histogram binds
histogram_stat ({'count', 'density'}) – Plot the count of CCSs or their density normalized for each run.
maxcol (None or int) – Max number of columns in faceted plot.
panelsize (float) – Size of each plot panel.
- Returns:
A panel of histograms.
- Return type:
plotnine.ggplot.ggplot
- plot_zmw_stats(**kwargs)[source]¶
Plot of ZMW stats for all runs.
Note
Raises an error if
Summaries.has_zmw_stats()
is not True.- Parameters:
**kwargs (dict) – Keyword arguments passed to
Summaries.zmw_stats()
.- Returns:
Stacked bar graph of ZMW stats for each run.
- Return type:
plotnine.ggplot.ggplot
- zmw_stats(*, minfailfrac=0.01, groupsuccess=True)[source]¶
Get ZMW stats for all runs.
Note
Raises an error if
Summaries.has_zmw_stats()
is not True.- Parameters:
minfailfrac (float) – Group failure categories with <= this fraction ZMWs for all runs.
groupsuccess (bool) – Group all success categories into ‘Success – CCS generated’.
- Returns:
Data frame with stats on ZMW status for all runs.
- Return type:
pandas.DataFrame
- class alignparse.ccs.Summary(name, fastqfile, reportfile)[source]¶
Bases:
object
Summary of a single
ccs
run.- Parameters:
name (str) – Name of the
ccs
run.fastqfile (str) – FASTQ file with circular consensus sequences, typically generated by PacBio
ccs
program. Can be gzipped. Number of passes is determined from thenp
tag in the FASTQ comments if present.reportfile (str or None) – Report file generated by
ccs
using--reportFile
flag, or None if no such report available.
- name¶
Name of
ccs
run.- Type:
str
- fastqfile¶
FASTQ file with the circular consensus sequences.
- Type:
str
- reportfile¶
ccs
report file. Currently handles reports in formats generated byccs
versions 3.* and 4.0.- Type:
str or None
- zmw_stats¶
Stats on ZMW extracted from reportfile.
- Type:
pandas.DataFrame or None
- passes¶
Lists number of passes for all circular consensus sequences, or None if this information is not available in a
np
tag in fastqfile.- Type:
numpy.array or None
- accuracy¶
Lists accuracies for all circular consensus sequences. This is the average accuracy across sites computed from Q-values in fastqfile.
- Type:
numpy.array
- length¶
Lists lengths for all circular consensus sequences.
- Type:
numpy.array
- alignparse.ccs.get_ccs_stats(fastqfile, *, pass_tag='np')[source]¶
Get basic statistics about circular consensus sequences.
- Parameters:
fastqfile (str) – FASTQ file with circular consensus sequences, generated from BAM output of
ccs
.pass_tag (str) – Tag in FASTQ file header giving number of passes (if present).
- Returns:
The 3-tuple (passes, accuracy, length) contains numpy arrays giving number of passes, accuracies, and lengths for sequences in fastqfile. The accuracy (average across sequence) and length are computed from the quality scores and sequence in fastqfile; the number of passes must be provided as pass_tag and are returned as None if that tag is missing from any read in fastqfile.
- Return type:
collections.namedtuple
Example
Test with
np
tag in format fromccs
version 5.0.>>> fastqfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = fastqfile.write(textwrap.dedent(''' ... @m54228_190118_102822/4194373/ccs np=18 ... GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA ... + ... ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~( ... @m54228_190118_102822/4194374/ccs np=51 ... GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT ... + ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ''').lstrip()) >>> fastqfile.flush() >>> get_ccs_stats(fastqfile.name) CCS_Stats(passes=array([18, 51]), accuracy=array([0.99672907, 1. ]), length=array([61, 62])) >>> fastqfile.close()
Test with
np
tag in format fromccs
version 4.0.>>> fastqfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = fastqfile.write(textwrap.dedent(''' ... @m54228_190118_102822/4194373/ccs np:i:18 ... GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA ... + ... ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~( ... @m54228_190118_102822/4194374/ccs np:i:51 ... GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT ... + ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ''').lstrip()) >>> fastqfile.flush() >>> get_ccs_stats(fastqfile.name) CCS_Stats(passes=array([18, 51]), accuracy=array([0.99672907, 1. ]), length=array([61, 62])) >>> fastqfile.close()
Similar test without
np
tag in FASTQ file:>>> fastqfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = fastqfile.write(textwrap.dedent(''' ... @m54228_190118_102822/4194373/ccs ... GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA ... + ... ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~( ... @m54228_190118_102822/4194374/ccs ... GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT ... + ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ''').lstrip()) >>> fastqfile.flush() >>> get_ccs_stats(fastqfile.name) CCS_Stats(passes=None, accuracy=array([0.99672907, 1. ]), length=array([61, 62])) >>> fastqfile.close()
- alignparse.ccs.report_to_stats(reportfile)[source]¶
Parse ZMW statistics from report file.
- Parameters:
reportfile (str) – Report file generated by
ccs
using--reportFile
flag usingccs
version 3.* and 4.0, or the--report-file
flag usingccs
version 5.0.- Returns:
A data frame with the statistics.
- Return type:
pandas.DataFrame
Example
An example of the
ccs
version 6.0.0 output:>>> pd.set_option('display.max_columns', 10) # show many columns >>> reportfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = reportfile.write(textwrap.dedent(''' ... ZMWs input : 3739007 ... ... ZMWs pass filters : 1669857 (44.66%) ... ZMWs fail filters : 2069150 (55.34%) ... ZMWs shortcut filters : 0 (0.000%) ... ... ZMWs with tandem repeats : 1096 (0.053%) ... ... Exclusive failed counts ... Below SNR threshold : 53110 (2.567%) ... Median length filter : 0 (0.000%) ... Lacking full passes : 1382995 (66.84%) ... Heteroduplex insertions : 30137 (1.456%) ... Coverage drops : 2193 (0.106%) ... Insufficient draft cov : 9055 (0.438%) ... Draft too different : 71 (0.003%) ... Draft generation error : 22575 (1.091%) ... Draft above --max-length : 235 (0.011%) ... Draft below --min-length : 227 (0.011%) ... Reads failed polishing : 0 (0.000%) ... Empty coverage windows : 264 (0.013%) ... CCS did not converge : 525 (0.025%) ... CCS below minimum RQ : 567763 (27.44%) ... Unknown error : 0 (0.000%) ... ... Additional passing metrics ... ZMWs missing adapters : 403623 (24.17%) ... ''').lstrip()) >>> reportfile.flush() >>> report_to_stats(reportfile.name) status number fraction 0 Success -- CCS generated 1669857 0.446604 1 Failed -- Below SNR threshold 53110 0.014204 2 Failed -- Median length filter 0 0.000000 3 Failed -- Lacking full passes 1382995 0.369883 4 Failed -- Heteroduplex insertions 30137 0.008060 5 Failed -- Coverage drops 2193 0.000587 6 Failed -- Insufficient draft cov 9055 0.002422 7 Failed -- Draft too different 71 0.000019 8 Failed -- Draft generation error 22575 0.006038 9 Failed -- Draft above --max-length 235 0.000063 10 Failed -- Draft below --min-length 227 0.000061 11 Failed -- Reads failed polishing 0 0.000000 12 Failed -- Empty coverage windows 264 0.000071 13 Failed -- CCS did not converge 525 0.000140 14 Failed -- CCS below minimum RQ 567763 0.151849 15 Failed -- Unknown error 0 0.000000 16 Failed -- shortcut filter 0 0.000000 >>> reportfile.close()
An example of the
ccs
version 5.0.0 output:>>> pd.set_option('display.max_columns', 10) # show many columns >>> reportfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = reportfile.write(textwrap.dedent(''' ... ZMWs input : 535101 ... ... ZMWs pass filters : 325574 (60.84%) ... ZMWs fail filters : 209527 (39.16%) ... ZMWs shortcut filters : 0 (0.00%) ... ... ZMWs with tandem repeats : 98 (0.02%) ... ... Exclusive counts for ZMWs failing filters: ... Below SNR threshold : 0 (0.00%) ... Median length filter : 0 (0.00%) ... Lacking full passes : 94869 (45.28%) ... Heteroduplex insertions : 4373 (2.09%) ... Coverage drops : 526 (0.25%) ... Insufficient draft cov : 2783 (1.33%) ... Draft too different : 1436 (0.69%) ... Draft generation error : 5426 (2.59%) ... Draft above --max-length : 469 (0.22%) ... Draft below --min-length : 152 (0.07%) ... Reads failed polishing : 0 (0.00%) ... Empty coverage windows : 136 (0.06%) ... CCS did not converge : 44 (0.02%) ... CCS below minimum RQ : 99782 (47.62%) ... Unknown error : 0 (0.00%) ... ''').lstrip()) >>> reportfile.flush() >>> report_to_stats(reportfile.name) status number fraction 0 Success -- CCS generated 325574 0.607902 1 Failed -- Below SNR threshold 0 0.000000 2 Failed -- Median length filter 0 0.000000 3 Failed -- Lacking full passes 94869 0.177137 4 Failed -- Heteroduplex insertions 4373 0.008165 5 Failed -- Coverage drops 526 0.000982 6 Failed -- Insufficient draft cov 2783 0.005196 7 Failed -- Draft too different 1436 0.002681 8 Failed -- Draft generation error 5426 0.010131 9 Failed -- Draft above --max-length 469 0.000876 10 Failed -- Draft below --min-length 152 0.000284 11 Failed -- Reads failed polishing 0 0.000000 12 Failed -- Empty coverage windows 136 0.000254 13 Failed -- CCS did not converge 44 0.000082 14 Failed -- CCS below minimum RQ 99782 0.186310 15 Failed -- Unknown error 0 0.000000 16 Failed -- shortcut filter 0 0.000000 >>> reportfile.close()
An example of the
ccs
version 4.0.0 output:>>> pd.set_option('display.max_columns', 10) # show many columns >>> reportfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = reportfile.write(textwrap.dedent(''' ... ZMWs input (A) : 686919 ... ZMWs generating CCS (B) : 182500 (26.57%) ... ZMWs filtered (C) : 504419 (73.43%) ... ... Exclusive ZMW counts for (C): ... No usable subreads : 0 (0.00%) ... Below SNR threshold : 0 (0.00%) ... Lacking full passes : 344375 (68.27%) ... Heteroduplexes : 1056 (0.21%) ... Min coverage violation : 2035 (0.40%) ... Draft generation error : 7636 (1.51%) ... Draft above --max-length : 49 (0.01%) ... Draft below --min-length : 2 (0.00%) ... Lacking usable subreads : 0 (0.00%) ... CCS did not converge : 0 (0.00%) ... CCS below minimum RQ : 149315 (29.60%) ... Unknown error : 0 (0.00%) ... ''').lstrip()) >>> reportfile.flush() >>> report_to_stats(reportfile.name) status number fraction 0 Success -- CCS generated 182500 0.265660 1 Failed -- No usable subreads 0 0.000000 2 Failed -- Below SNR threshold 0 0.000000 3 Failed -- Lacking full passes 344375 0.501297 4 Failed -- Heteroduplexes 1056 0.001537 5 Failed -- Min coverage violation 2035 0.002962 6 Failed -- Draft generation error 7636 0.011116 7 Failed -- Draft above --max-length 49 0.000071 8 Failed -- Draft below --min-length 2 0.000003 9 Failed -- Lacking usable subreads 0 0.000000 10 Failed -- CCS did not converge 0 0.000000 11 Failed -- CCS below minimum RQ 149315 0.217354 12 Failed -- Unknown error 0 0.000000 >>> reportfile.close()
An example of the
ccs
version 3.1.0 output:>>> reportfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = reportfile.write(textwrap.dedent(''' ... ZMW Yield ... Success -- CCS generated,242220,45.57% ... Failed -- Below SNR threshold,0,0.00% ... Failed -- No usable subreads,4877,0.92% ... Failed -- Insert size too long,35,0.00% ... Failed -- Insert size too small,0,0.00% ... Failed -- Not enough full passes,180620,33.98% ... Failed -- Too many unusable subreads,1,0.00% ... Failed -- CCS did not converge,23,0.00% ... Failed -- CCS below minimum predicted accuracy,103801,19.53% ... Failed -- Unknown error during processing,0,0.00% ... ... ... Subread Yield ... Success - Used for CCS,10972010,89.06% ... Failed -- Below SNR threshold,0,0.00% ... Failed -- Alpha/Beta mismatch,171,0.00% ... Failed -- Below minimum quality,0,0.00% ... Failed -- Filtered by size,144209,1.17% ... Failed -- Identity too low,2745750,22.29% ... Failed -- Z-Score too low,0,0.00% ... Failed -- From ZMW with too few passes,274296,2.23% ... Failed -- Other,928871,7.54% ... ''').lstrip()) >>> reportfile.flush() >>> report_to_stats(reportfile.name) status number fraction 0 Success -- CCS generated 242220 0.4557 1 Failed -- Below SNR threshold 0 0.0000 2 Failed -- No usable subreads 4877 0.0092 3 Failed -- Insert size too long 35 0.0000 4 Failed -- Insert size too small 0 0.0000 5 Failed -- Not enough full passes 180620 0.3398 6 Failed -- Too many unusable subreads 1 0.0000 7 Failed -- CCS did not converge 23 0.0000 8 Failed -- CCS below minimum predicted accuracy 103801 0.1953 9 Failed -- Unknown error during processing 0 0.0000 >>> reportfile.close()
An example of the
ccs
version 3.4.1 output:>>> reportfile = tempfile.NamedTemporaryFile(mode='w') >>> _ = reportfile.write(textwrap.dedent(''' ... ZMW Yield ... Success (without retry) -- CCS generated,202033,29.44% ... Success (with retry) -- CCS generated,2,0.00% ... Failed -- Below SNR threshold,0,0.00% ... Failed -- No usable subreads,2093,0.31% ... Failed -- Insert size too long,10,0.01% ... Failed -- Insert size too small,79,0.01% ... Failed -- Not enough full passes,343876,50.12% ... Failed -- Too many unusable subreads,0,0.00% ... Failed -- CCS did not converge,0,0.00% ... Failed -- CCS below minimum predicted accuracy,138083,20.12% ... Failed -- Unknown error during processing,0,0.00% ... ... ... ''').lstrip()) >>> reportfile.flush() >>> report_to_stats(reportfile.name) status number fraction 0 Success (without retry) -- CCS generated 202033 0.2944 1 Success (with retry) -- CCS generated 2 0.0000 2 Failed -- Below SNR threshold 0 0.0000 3 Failed -- No usable subreads 2093 0.0031 4 Failed -- Insert size too long 10 0.0001 5 Failed -- Insert size too small 79 0.0001 6 Failed -- Not enough full passes 343876 0.5012 7 Failed -- Too many unusable subreads 0 0.0000 8 Failed -- CCS did not converge 0 0.0000 9 Failed -- CCS below minimum predicted accuracy 138083 0.2012 10 Failed -- Unknown error during processing 0 0.0000 >>> reportfile.close()