ccs

Tools for inspecting output of the PacBio ccs program: https://github.com/PacificBiosciences/unanimity/blob/develop/doc/PBCCS.md

class alignparse.ccs.Summaries(df, *, name_col='name', fastq_col='fastq', report_col='report', ncpus=-1)[source]

Bases: object

Summaries of ccs runs.

Parameters:
  • df (pandas.DataFrame) – Data frame giving information on runs of ccs being summarized.

  • name_col (str) – Column in df with name of ccs run.

  • fastq_col (str) – Column in df with FASTQ file for run, appropriate for passing to Summary as fastqfile.

  • report_col (str or None) – Column in df with report for run, appropriate for passing to Summary as reportfile. Set to None if no reports. If there are no reports, then ZMW stats are not available.

  • ncpus (int) – Number of CPUs to use; -1 means all available. Useful as processing all the FASTQ files to compute read accuracies can take a while.

summaries

List of Summary objects for each run.

Type:

list

ccs_stats(variable)[source]

Get CCS stats for all runs.

Parameters:

variable ({'length', 'passes', 'accuracy'}) – Variable for which we get stats. You will get an error if Summaries.has_stat() is not true for variable.

Returns:

Data frame with columns of ‘name’ (holding run name) and the value of variable giving variable value for all CCSs.

Return type:

pandas.DataFrame

has_stat(variable)[source]

Do the summaries contain a statistic?

Parameters:

variable ({'length', 'passes', 'accuracy'}) – Variable for which we want statistics.

Returns:

True if all runs have statistics for variable, False otherwise.

Return type:

bool

has_zmw_stats()[source]

Are ZMW stats available?

Returns:

True if all runs have ZMW stats, False otherwise.

Return type:

bool

plot_ccs_stats(variable, *, trim_frac=0.005, bins=25, histogram_stat='count', maxcol=None, panelsize=1.75)[source]

Plot histograms of CCS stats for all runs.

Parameters:
  • variable ({'length', 'passes', 'accuracy'}) – Variable for which we plot stats. You will get an error if Summaries.has_stat() is not true for variable.

  • trim_frac (float) – Trim this amount of the bottom and top fraction from the data before plotting. Useful if outliers greatly extend scale.

  • bins (int) – Number of histogram binds

  • histogram_stat ({'count', 'density'}) – Plot the count of CCSs or their density normalized for each run.

  • maxcol (None or int) – Max number of columns in faceted plot.

  • panelsize (float) – Size of each plot panel.

Returns:

A panel of histograms.

Return type:

plotnine.ggplot.ggplot

plot_zmw_stats(**kwargs)[source]

Plot of ZMW stats for all runs.

Note

Raises an error if Summaries.has_zmw_stats() is not True.

Parameters:

**kwargs (dict) – Keyword arguments passed to Summaries.zmw_stats().

Returns:

Stacked bar graph of ZMW stats for each run.

Return type:

plotnine.ggplot.ggplot

zmw_stats(*, minfailfrac=0.01, groupsuccess=True)[source]

Get ZMW stats for all runs.

Note

Raises an error if Summaries.has_zmw_stats() is not True.

Parameters:
  • minfailfrac (float) – Group failure categories with <= this fraction ZMWs for all runs.

  • groupsuccess (bool) – Group all success categories into ‘Success – CCS generated’.

Returns:

Data frame with stats on ZMW status for all runs.

Return type:

pandas.DataFrame

class alignparse.ccs.Summary(name, fastqfile, reportfile)[source]

Bases: object

Summary of a single ccs run.

Parameters:
  • name (str) – Name of the ccs run.

  • fastqfile (str) – FASTQ file with circular consensus sequences, typically generated by PacBio ccs program. Can be gzipped. Number of passes is determined from the np tag in the FASTQ comments if present.

  • reportfile (str or None) – Report file generated by ccs using --reportFile flag, or None if no such report available.

name

Name of ccs run.

Type:

str

fastqfile

FASTQ file with the circular consensus sequences.

Type:

str

reportfile

ccs report file. Currently handles reports in formats generated by ccs versions 3.* and 4.0.

Type:

str or None

zmw_stats

Stats on ZMW extracted from reportfile.

Type:

pandas.DataFrame or None

passes

Lists number of passes for all circular consensus sequences, or None if this information is not available in a np tag in fastqfile.

Type:

numpy.array or None

accuracy

Lists accuracies for all circular consensus sequences. This is the average accuracy across sites computed from Q-values in fastqfile.

Type:

numpy.array

length

Lists lengths for all circular consensus sequences.

Type:

numpy.array

alignparse.ccs.get_ccs_stats(fastqfile, *, pass_tag='np')[source]

Get basic statistics about circular consensus sequences.

Parameters:
  • fastqfile (str) – FASTQ file with circular consensus sequences, generated from BAM output of ccs.

  • pass_tag (str) – Tag in FASTQ file header giving number of passes (if present).

Returns:

The 3-tuple (passes, accuracy, length) contains numpy arrays giving number of passes, accuracies, and lengths for sequences in fastqfile. The accuracy (average across sequence) and length are computed from the quality scores and sequence in fastqfile; the number of passes must be provided as pass_tag and are returned as None if that tag is missing from any read in fastqfile.

Return type:

collections.namedtuple

Example

Test with np tag in format from ccs version 5.0.

>>> fastqfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = fastqfile.write(textwrap.dedent('''
...   @m54228_190118_102822/4194373/ccs np=18
...   GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA
...   +
...   ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~(
...   @m54228_190118_102822/4194374/ccs np=51
...   GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT
...   +
...   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...   ''').lstrip())
>>> fastqfile.flush()
>>> get_ccs_stats(fastqfile.name)  
CCS_Stats(passes=array([18, 51]),
          accuracy=array([0.99672907, 1.        ]),
          length=array([61, 62]))
>>> fastqfile.close()

Test with np tag in format from ccs version 4.0.

>>> fastqfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = fastqfile.write(textwrap.dedent('''
...   @m54228_190118_102822/4194373/ccs np:i:18
...   GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA
...   +
...   ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~(
...   @m54228_190118_102822/4194374/ccs np:i:51
...   GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT
...   +
...   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...   ''').lstrip())
>>> fastqfile.flush()
>>> get_ccs_stats(fastqfile.name)  
CCS_Stats(passes=array([18, 51]),
          accuracy=array([0.99672907, 1.        ]),
          length=array([61, 62]))
>>> fastqfile.close()

Similar test without np tag in FASTQ file:

>>> fastqfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = fastqfile.write(textwrap.dedent('''
...   @m54228_190118_102822/4194373/ccs
...   GGTACCACACTCTTTCCCTACACGACGCTCTGCCGATCTCGGCCATTACGTGTTTTATCTA
...   +
...   ~~~~{~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i~~~~~~~(
...   @m54228_190118_102822/4194374/ccs
...   GCACGGCGTCACACTTTGCTATGCCATAGCATGTTTATCCATAAGATTAGCGGATCCTACCT
...   +
...   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...   ''').lstrip())
>>> fastqfile.flush()
>>> get_ccs_stats(fastqfile.name)  
CCS_Stats(passes=None,
          accuracy=array([0.99672907, 1.        ]),
          length=array([61, 62]))
>>> fastqfile.close()
alignparse.ccs.report_to_stats(reportfile)[source]

Parse ZMW statistics from report file.

Parameters:

reportfile (str) – Report file generated by ccs using --reportFile flag using ccs version 3.* and 4.0, or the --report-file flag using ccs version 5.0.

Returns:

A data frame with the statistics.

Return type:

pandas.DataFrame

Example

An example of the ccs version 6.0.0 output:

>>> pd.set_option('display.max_columns', 10)  # show many columns
>>> reportfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = reportfile.write(textwrap.dedent('''
...     ZMWs input               : 3739007
...
...     ZMWs pass filters        : 1669857 (44.66%)
...     ZMWs fail filters        : 2069150 (55.34%)
...     ZMWs shortcut filters    : 0 (0.000%)
...
...     ZMWs with tandem repeats : 1096 (0.053%)
...
...     Exclusive failed counts
...     Below SNR threshold      : 53110 (2.567%)
...     Median length filter     : 0 (0.000%)
...     Lacking full passes      : 1382995 (66.84%)
...     Heteroduplex insertions  : 30137 (1.456%)
...     Coverage drops           : 2193 (0.106%)
...     Insufficient draft cov   : 9055 (0.438%)
...     Draft too different      : 71 (0.003%)
...     Draft generation error   : 22575 (1.091%)
...     Draft above --max-length : 235 (0.011%)
...     Draft below --min-length : 227 (0.011%)
...     Reads failed polishing   : 0 (0.000%)
...     Empty coverage windows   : 264 (0.013%)
...     CCS did not converge     : 525 (0.025%)
...     CCS below minimum RQ     : 567763 (27.44%)
...     Unknown error            : 0 (0.000%)
...
...     Additional passing metrics
...     ZMWs missing adapters    : 403623 (24.17%)
...     ''').lstrip())
>>> reportfile.flush()
>>> report_to_stats(reportfile.name) 
                                status  number  fraction
0             Success -- CCS generated 1669857  0.446604
1        Failed -- Below SNR threshold   53110  0.014204
2       Failed -- Median length filter       0  0.000000
3        Failed -- Lacking full passes 1382995  0.369883
4    Failed -- Heteroduplex insertions   30137  0.008060
5             Failed -- Coverage drops    2193  0.000587
6     Failed -- Insufficient draft cov    9055  0.002422
7        Failed -- Draft too different      71  0.000019
8     Failed -- Draft generation error   22575  0.006038
9   Failed -- Draft above --max-length     235  0.000063
10  Failed -- Draft below --min-length     227  0.000061
11    Failed -- Reads failed polishing       0  0.000000
12    Failed -- Empty coverage windows     264  0.000071
13      Failed -- CCS did not converge     525  0.000140
14      Failed -- CCS below minimum RQ  567763  0.151849
15             Failed -- Unknown error       0  0.000000
16           Failed -- shortcut filter       0  0.000000
>>> reportfile.close()

An example of the ccs version 5.0.0 output:

>>> pd.set_option('display.max_columns', 10)  # show many columns
>>> reportfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = reportfile.write(textwrap.dedent('''
...     ZMWs input               : 535101
...
...     ZMWs pass filters        : 325574 (60.84%)
...     ZMWs fail filters        : 209527 (39.16%)
...     ZMWs shortcut filters    : 0 (0.00%)
...
...     ZMWs with tandem repeats : 98 (0.02%)
...
...     Exclusive counts for ZMWs failing filters:
...     Below SNR threshold      : 0 (0.00%)
...     Median length filter     : 0 (0.00%)
...     Lacking full passes      : 94869 (45.28%)
...     Heteroduplex insertions  : 4373 (2.09%)
...     Coverage drops           : 526 (0.25%)
...     Insufficient draft cov   : 2783 (1.33%)
...     Draft too different      : 1436 (0.69%)
...     Draft generation error   : 5426 (2.59%)
...     Draft above --max-length : 469 (0.22%)
...     Draft below --min-length : 152 (0.07%)
...     Reads failed polishing   : 0 (0.00%)
...     Empty coverage windows   : 136 (0.06%)
...     CCS did not converge     : 44 (0.02%)
...     CCS below minimum RQ     : 99782 (47.62%)
...     Unknown error            : 0 (0.00%)
...     ''').lstrip())
>>> reportfile.flush()
>>> report_to_stats(reportfile.name)
                                status  number  fraction
0             Success -- CCS generated  325574  0.607902
1        Failed -- Below SNR threshold       0  0.000000
2       Failed -- Median length filter       0  0.000000
3        Failed -- Lacking full passes   94869  0.177137
4    Failed -- Heteroduplex insertions    4373  0.008165
5             Failed -- Coverage drops     526  0.000982
6     Failed -- Insufficient draft cov    2783  0.005196
7        Failed -- Draft too different    1436  0.002681
8     Failed -- Draft generation error    5426  0.010131
9   Failed -- Draft above --max-length     469  0.000876
10  Failed -- Draft below --min-length     152  0.000284
11    Failed -- Reads failed polishing       0  0.000000
12    Failed -- Empty coverage windows     136  0.000254
13      Failed -- CCS did not converge      44  0.000082
14      Failed -- CCS below minimum RQ   99782  0.186310
15             Failed -- Unknown error       0  0.000000
16           Failed -- shortcut filter       0  0.000000
>>> reportfile.close()

An example of the ccs version 4.0.0 output:

>>> pd.set_option('display.max_columns', 10)  # show many columns
>>> reportfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = reportfile.write(textwrap.dedent('''
...     ZMWs input          (A)  : 686919
...     ZMWs generating CCS (B)  : 182500 (26.57%)
...     ZMWs filtered       (C)  : 504419 (73.43%)
...
...     Exclusive ZMW counts for (C):
...     No usable subreads       : 0 (0.00%)
...     Below SNR threshold      : 0 (0.00%)
...     Lacking full passes      : 344375 (68.27%)
...     Heteroduplexes           : 1056 (0.21%)
...     Min coverage violation   : 2035 (0.40%)
...     Draft generation error   : 7636 (1.51%)
...     Draft above --max-length : 49 (0.01%)
...     Draft below --min-length : 2 (0.00%)
...     Lacking usable subreads  : 0 (0.00%)
...     CCS did not converge     : 0 (0.00%)
...     CCS below minimum RQ     : 149315 (29.60%)
...     Unknown error            : 0 (0.00%)
...     ''').lstrip())
>>> reportfile.flush()
>>> report_to_stats(reportfile.name)
                                status  number  fraction
0             Success -- CCS generated  182500  0.265660
1         Failed -- No usable subreads       0  0.000000
2        Failed -- Below SNR threshold       0  0.000000
3        Failed -- Lacking full passes  344375  0.501297
4             Failed -- Heteroduplexes    1056  0.001537
5     Failed -- Min coverage violation    2035  0.002962
6     Failed -- Draft generation error    7636  0.011116
7   Failed -- Draft above --max-length      49  0.000071
8   Failed -- Draft below --min-length       2  0.000003
9    Failed -- Lacking usable subreads       0  0.000000
10      Failed -- CCS did not converge       0  0.000000
11      Failed -- CCS below minimum RQ  149315  0.217354
12             Failed -- Unknown error       0  0.000000
>>> reportfile.close()

An example of the ccs version 3.1.0 output:

>>> reportfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = reportfile.write(textwrap.dedent('''
...     ZMW Yield
...     Success -- CCS generated,242220,45.57%
...     Failed -- Below SNR threshold,0,0.00%
...     Failed -- No usable subreads,4877,0.92%
...     Failed -- Insert size too long,35,0.00%
...     Failed -- Insert size too small,0,0.00%
...     Failed -- Not enough full passes,180620,33.98%
...     Failed -- Too many unusable subreads,1,0.00%
...     Failed -- CCS did not converge,23,0.00%
...     Failed -- CCS below minimum predicted accuracy,103801,19.53%
...     Failed -- Unknown error during processing,0,0.00%
...
...
...     Subread Yield
...     Success - Used for CCS,10972010,89.06%
...     Failed -- Below SNR threshold,0,0.00%
...     Failed -- Alpha/Beta mismatch,171,0.00%
...     Failed -- Below minimum quality,0,0.00%
...     Failed -- Filtered by size,144209,1.17%
...     Failed -- Identity too low,2745750,22.29%
...     Failed -- Z-Score too low,0,0.00%
...     Failed -- From ZMW with too few passes,274296,2.23%
...     Failed -- Other,928871,7.54%
...     ''').lstrip())
>>> reportfile.flush()
>>> report_to_stats(reportfile.name)
                                           status  number  fraction
0                        Success -- CCS generated  242220    0.4557
1                   Failed -- Below SNR threshold       0    0.0000
2                    Failed -- No usable subreads    4877    0.0092
3                  Failed -- Insert size too long      35    0.0000
4                 Failed -- Insert size too small       0    0.0000
5                Failed -- Not enough full passes  180620    0.3398
6            Failed -- Too many unusable subreads       1    0.0000
7                  Failed -- CCS did not converge      23    0.0000
8  Failed -- CCS below minimum predicted accuracy  103801    0.1953
9       Failed -- Unknown error during processing       0    0.0000
>>> reportfile.close()

An example of the ccs version 3.4.1 output:

>>> reportfile = tempfile.NamedTemporaryFile(mode='w')
>>> _ = reportfile.write(textwrap.dedent('''
...     ZMW Yield
...     Success (without retry) -- CCS generated,202033,29.44%
...     Success (with retry)    -- CCS generated,2,0.00%
...     Failed -- Below SNR threshold,0,0.00%
...     Failed -- No usable subreads,2093,0.31%
...     Failed -- Insert size too long,10,0.01%
...     Failed -- Insert size too small,79,0.01%
...     Failed -- Not enough full passes,343876,50.12%
...     Failed -- Too many unusable subreads,0,0.00%
...     Failed -- CCS did not converge,0,0.00%
...     Failed -- CCS below minimum predicted accuracy,138083,20.12%
...     Failed -- Unknown error during processing,0,0.00%
...
...
...     ''').lstrip())
>>> reportfile.flush()
>>> report_to_stats(reportfile.name)  
                                           status  number fraction
0        Success (without retry) -- CCS generated  202033   0.2944
1        Success (with retry)    -- CCS generated       2   0.0000
2                   Failed -- Below SNR threshold       0   0.0000
3                    Failed -- No usable subreads    2093   0.0031
4                  Failed -- Insert size too long      10   0.0001
5                 Failed -- Insert size too small      79   0.0001
6                Failed -- Not enough full passes  343876   0.5012
7            Failed -- Too many unusable subreads       0   0.0000
8                  Failed -- CCS did not converge       0   0.0000
9  Failed -- CCS below minimum predicted accuracy  138083   0.2012
10      Failed -- Unknown error during processing       0   0.0000
>>> reportfile.close()