Single-cell virus sequencing

This example shows how to use alignparse to process full-length PacBio CCSs of viral cDNAs generated in single-cell virus sequencing of influenza-infected cells. Specifically, it processes a snippet of data taking from Russell et al, 2019, which was generated by PacBio sequencing of viral cDNAs with unique molecular identifiers (UMIS) and cell barcodes added using a 10X Chromium (v2 reagents). This notebook aligns these CCSs to the influenza transcripts and parses the barcodes / UMIs and viral mutations.

Set up for analysis

Import necessary Python modules. We use alignparse for most of the operations, plotnine for ggplot2-like plotting:

[1]:
import os
import warnings

import numpy

import pandas as pd

from plotnine import *

import alignparse.ccs
import alignparse.consensus
import alignparse.minimap2
import alignparse.targets
from alignparse.constants import CBPALETTE

Suppress warnings that clutter output:

[2]:
warnings.simplefilter("ignore")

Directory for output:

[3]:
outdir = "./output_files/"
os.makedirs(outdir, exist_ok=True)

Target amplicons

The alignment targets consist of PCR amplicons covering the influenza virus mRNAs (WSN strain) with a 5’ termini corresponding to the primer binding site, the 3’ polyA tail used for reverse transcription, the UMI and cell barcode, and the common 3’ termini. There are also variant tags distinguishing the synonymously tagged viral variants (see Russell et al, 2019).

First, let’s look at the Genbank file holding the targets. The targets are defined in Genbank Flat File format Below we show the first 85 lines (the actual file is quite large as it defined all 10 influenza mRNAs):

[4]:
targetfile = "input_files/flu_WSN_amplicons.gb"

nlines_to_show = 85
with open(targetfile) as f:
    print("".join(next(f) for _ in range(nlines_to_show)))
LOCUS       fluPB2                  2395 bp    DNA              UNK 01-JAN-1980
DEFINITION  fluPB2.
ACCESSION   fluPB2
VERSION     fluPB2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     termini5        1..34
     variant_tag_1   143..143
     variant_tag_2   149..149
     variant_tag_3   2163..2163
     variant_tag_4   2171..2171
     sequenced_mRNA  35..2316
     final_mRNA_nts  2317..2319
     polyA           2320..2347
     UMI             2348..2357
     cellbarcode     2358..2373
     termini3        2374..2395
ORIGIN
        1 gcgaaagcag gtcaattata ttcaatatgg aaagaataaa agaactaagg aatctaatgt
       61 cgcagtctcg cactcgcgag atactcacaa aaaccaccgt ggaccatatg gccataatca
      121 agaagtacac atcaggaaga cargagaara acccagcact taggatgaaa tggatgatgg
      181 caatgaaata tccaattaca gcagacaaga ggataacgga aatgattcct gagagaaatg
      241 agcagggaca aactttatgg agtaaaatga atgacgccgg atcagaccga gtgatggtat
      301 cacctctggc tgtgacatgg tggaatagga atggaccagt gacaagtaca gttcattatc
      361 caaaaatcta caaaacttat tttgaaaaag tcgaaaggtt aaaacatgga acctttggcc
      421 ctgtccattt tagaaaccaa gtcaaaatac gtcgaagagt tgacataaat cctggtcatg
      481 cagatctcag tgccaaagag gcacaggatg taatcatgga agttgttttc cctaacgaag
      541 tgggagccag gatactaaca tcggaatcgc aactaacgac aaccaaagag aagaaagaag
      601 aactccaggg ttgcaaaatt tctcctctga tggtggcata catgttggag agagaactgg
      661 tccgcaaaac gagattcctc ccagtggctg gtggaacaag cagtgtgtac attgaagtgt
      721 tgcatttgac ccaaggaaca tgctgggaac agatgtacac tccaggaggg gaggcgagga
      781 atgatgatgt tgatcaaagc ttaattattg ctgctagaaa catagtaaga agagccacag
      841 tatcagcaga tccactagca tctttattgg agatgtgcca cagcacgcag attggtggaa
      901 taaggatggt aaacatcctt aggcagaacc caacagaaga gcaagccgtg gatatttgca
      961 aggctgcaat gggactgaga attagctcat ccttcagttt tggtggattc acatttaaga
     1021 gaacaagcgg atcatcagtc aagagagagg aagaggtgct tacgggcaat cttcagacat
     1081 tgaagataag agtgcatgag ggatatgaag agttcacaat ggttgggaga agagcaacag
     1141 ctatactcag aaaagcaacc aggagattga ttcagctgat agtgagtggg agagacgaac
     1201 agtcgattgc cgaagcaata attgtggcca tggtattttc acaagaggat tgtatgataa
     1261 aagcagttag aggtgacctg aatttcgtca atagggcgaa tcagcgattg aatcccatgc
     1321 accaactttt gagacatttt cagaaggatg caaaggtgct ctttcaaaat tggggaattg
     1381 aatccatcga caatgtgatg ggaatgatcg ggatattgcc cgacatgact ccaagcaccg
     1441 agatgtcaat gagaggagtg agaatcagca aaatgggggt agatgagtat tccagcgcgg
     1501 agaagatagt ggtgagcatt gaccgttttt tgagagttag ggaccaacgt gggaatgtac
     1561 tactgtctcc cgaggagatc agtgaaacac agggaacaga gaaactgaca ataacttact
     1621 catcgtcaat gatgtgggag attaatggtc ctgaatcagt gttggtcaat acctatcagt
     1681 ggatcatcag aaactgggaa actgttaaaa ttcagtggtc ccagaatcct acaatgctgt
     1741 acaataaaat ggaatttgag ccatttcagt ctttagttcc aaaggccgtt agaggccaat
     1801 acagtgggtt tgtgagaact ctgttccaac aaatgaggga tgtgcttggg acatttgata
     1861 ccgctcagat aataaaactt cttcccttcg cagccgctcc accaaagcaa agtagaacgc
     1921 agttctcctc attgactata aatgtgaggg gatcaggaat gagaatactt gtaaggggca
     1981 attctccagt attcaactac aacaagacca ctaaaagact cacagttctc ggaaaggatg
     2041 ctggcccttt aactgaagac ccagatgaag gcacagctgg agttgagtcc gcagttctga
     2101 gaggattcct cattctgggc aaagaagaca ggagatatgg accagcatta agcataaatg
     2161 aaytgagcaa ycttgcgaaa ggagagaagg ctaatgtgct aattgggcaa ggagacgtgg
     2221 tgttggtaat gaaacggaaa cggaactcta gcatacttac tgacagccag acagcgacca
     2281 aaagaattcg gatggccatc aattagtgtc gaatagttta aaaaaaaaaa aaaaaaaaaa
     2341 aaaaaaannn nnnnnnnnnn nnnnnnnnnn nnnagatcgg aagagcgtcg tgtag
//
LOCUS       fluPB1                  2395 bp    DNA              UNK 01-JAN-1980
DEFINITION  fluPB1.
ACCESSION   fluPB1
VERSION     fluPB1
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     termini5        1..22
     variant_tag_1   146..146
     variant_tag_2   152..152
     variant_tag_3   2159..2159
     variant_tag_4   2162..2162
     sequenced_mRNA  23..2316
     final_mRNA_nts  2317..2319
     polyA           2320..2347
     UMI             2348..2357
     cellbarcode     2358..2373
     termini3        2374..2395
ORIGIN
        1 gcgaaagcag gcaaaccatt tgaatggatg tcaatccgac tttacttttc ttaaaagtgc
       61 cagcacaaaa tgctataagc acaactttcc cttatactgg agaccctcct tacagccatg

We also have the YAML file specifying how to parse the features in the alignments. Note how this file uses the YAML syntax for defaults to avoid repeating the same specifications for each target:

[5]:
feature_parse_specs_file = "input_files/flu_WSN_feature_parse_specs.yaml"
with open(feature_parse_specs_file) as f:
    print(f.read())
# default for genes with two variant tags
default_2tags: &default_2tags
  query_clip5: 4
  query_clip3: 4
  termini5:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
  sequenced_mRNA:
    filter:
      mutation_nt_count: null
      mutation_op_count: 10
    return: [mutations, accuracy]
  final_mRNA_nts:
    filter:
      mutation_nt_count: null
      mutation_op_count: 2
  polyA:
    filter:
      mutation_nt_count: 12
      mutation_op_count: 4
  UMI:
    return: [sequence, accuracy]
  cellbarcode:
    return: [sequence, accuracy]
  termini3:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
  variant_tag_1:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
    return: sequence
  variant_tag_2:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
    return: sequence

# default for genes with four variant tags
default_4tags: &default_4tags
  <<: *default_2tags
  variant_tag_3:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
    return: sequence
  variant_tag_4:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
    return: sequence

# use defaults to define specs for actual genes
fluPB2:
  <<: *default_4tags

fluPB1:
  <<: *default_4tags

fluPA:
  <<: *default_4tags

fluHA:
  <<: *default_4tags

fluNP:
  <<: *default_4tags

fluNA:
  <<: *default_4tags

fluM1:
  <<: *default_4tags

fluM2:
  <<: *default_2tags

fluNS1:
  <<: *default_4tags

fluNS2:
  <<: *default_2tags

We now read the targets into an alignparse.targets.Targets object with the feature-parsing specs (ignoring the keys that define thet defaults):

[6]:
targets = alignparse.targets.Targets(
    seqsfile=targetfile,
    feature_parse_specs=feature_parse_specs_file,
    ignore_feature_parse_specs_keys=["default_2tags", "default_4tags"],
    allow_extra_features=True,
)

Plot the Targets:

[7]:
_ = targets.plot(ax_width=10)
_images/flu_virus_seq_example_14_0.png

Several things of note about how these targets are defined. These all relate to the fact that the most difficult region to parse is the polyA tail and the UMI / cell barcode–the reason being that the polyA tail is not sequenced very accurately (it is a homopolymer) plus appears to be somewhat impure in the actual primers, so is difficult to align correctly. This is compounded by the fact that we don’t know the actual UMI sequence and so don’t want indels of the polyA tail to “bleed” into the UMI or gene alignment. Therefore:

  • We define a separate feature for the final nucleotides (last 3) of the mRNA upstream of the sequenced mRNA, so polyA indels don’t get assigned to the sequenced mRNA too often.

  • We define the polyA to be just 28 nucleotides in the alignment even though the length in the primer is 30, so that deletions don’t affect the UMI.

These empirically help with the alilgnments.

PacBio CCSs

Now let’s look at the PacBio CCSs. First, define a data frame with the information on the PacBio runs (here we just have one):

[8]:
pacbio_runs = pd.DataFrame(
    {"name": ["flu_WSN"], "fastq": ["input_files/flu_WSN_ccs.fastq"]}
)

pacbio_runs
[8]:
name fastq
0 flu_WSN input_files/flu_WSN_ccs.fastq

Create an alignparse.ccs.Summaries object:

[9]:
ccs_summaries = alignparse.ccs.Summaries(pacbio_runs, report_col=None)

Statistics on the CCSs (length, number of subread passes, quality):

[10]:
# NBVAL_IGNORE_OUTPUT
for stat in ["length", "passes", "accuracy"]:
    if ccs_summaries.has_stat(stat):
        p = ccs_summaries.plot_ccs_stats(stat)
        p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
        _ = p.draw()
    else:
        print(f"No {stat} statistics available.")

Align and parse

First, we create an alignparse.minimap2.Mapper to run minimap2, which is used for the alignments. We use minimap2 options that are tailored for viruses with potentially large internal deletions (which are handled like spliced introns); these options are specified by alignparse.minimap2.OPTIONS_VIRUS_W_DEL:

[11]:
# NBVAL_IGNORE_OUTPUT

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_VIRUS_W_DEL)

print(
    f"Using `minimap2` {mapper.version} with these options:\n"
    + " ".join(mapper.options)
)
Using `minimap2` 2.22-r1101 with these options:
-xsplice:hq -un -C0 --splice-flank=no -M=1 --end-seed-pen=2 --end-bonus=2 --secondary=no --cs

Now use Targets.align_and_parse to align and parse the CCSs. First, create the output directory:

[12]:
align_and_parse_outdir = os.path.join(outdir, "flu_WSN_align_and_parse")

Now do the alignments and parsing:

[13]:
readstats, aligned, filtered = targets.align_and_parse(
    df=pacbio_runs,
    mapper=mapper,
    outdir=align_and_parse_outdir,
    name_col="name",
    queryfile_col="fastq",
    overwrite=True,  # overwrite any existing output
    ncpus=-1,  # use all available CPUs
)

First, let’s look at the read alignment stats:

[14]:
readstats
[14]:
name category count
0 flu_WSN filtered fluPB2 15
1 flu_WSN aligned fluPB2 50
2 flu_WSN filtered fluPB1 34
3 flu_WSN aligned fluPB1 165
4 flu_WSN filtered fluPA 36
5 flu_WSN aligned fluPA 173
6 flu_WSN filtered fluHA 2
7 flu_WSN aligned fluHA 8
8 flu_WSN filtered fluNP 6
9 flu_WSN aligned fluNP 23
10 flu_WSN filtered fluNA 3
11 flu_WSN aligned fluNA 18
12 flu_WSN filtered fluM1 7
13 flu_WSN aligned fluM1 14
14 flu_WSN filtered fluM2 1
15 flu_WSN aligned fluM2 2
16 flu_WSN filtered fluNS1 3
17 flu_WSN aligned fluNS1 20
18 flu_WSN filtered fluNS2 0
19 flu_WSN aligned fluNS2 0
20 flu_WSN unmapped 108

Plot these stats:

[15]:
# NBVAL_IGNORE_OUTPUT
p = (
    ggplot(
        readstats.assign(
            category=lambda x: pd.Categorical(
                x["category"], x["category"].unique(), ordered=True
            ),
            is_aligned=lambda x: x["category"].str.contains("aligned"),
        ),
        aes("category", "count", fill="is_aligned"),
    )
    + geom_bar(stat="identity")
    + facet_wrap("~ name", nrow=1)
    + theme(
        axis_text_x=element_text(angle=90),
        figure_size=(0.3 * len(readstats), 2.5),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
    + scale_fill_manual(values=CBPALETTE)
)
_ = p.draw()

Now let’s look at the reason that reads that were filtered failed to align. Recall that filtered holds a data frame for each target:

[16]:
for target in targets.target_names[:1]:
    print(f"First few lines of `filtered` for {target}:")
    display(filtered[target].head())
First few lines of `filtered` for fluPB2:
name query_name filter_reason
0 flu_WSN m54228_171201_172448/5308608/ccs termini5 clip5
1 flu_WSN m54228_171201_172448/6554482/ccs polyA mutation_nt_count
2 flu_WSN m54228_171201_172448/7143925/ccs polyA mutation_nt_count
3 flu_WSN m54228_171201_172448/9437481/ccs final_mRNA_nts mutation_op_count
4 flu_WSN m54228_171201_172448/9830886/ccs polyA mutation_nt_count

Concatenate the information for all of the targets and plot reasons why mapped reads were filtered:

[17]:
# NBVAL_IGNORE_OUTPUT
p = (
    ggplot(
        pd.concat([df.assign(gene=gene) for gene, df in filtered.items()]).assign(
            gene=lambda x: pd.Categorical(x["gene"], x["gene"].unique(), ordered=True)
        ),
        aes("filter_reason"),
    )
    + geom_bar()
    + facet_wrap("~ gene", ncol=5)
    + theme(
        axis_text_x=element_text(angle=90),
        figure_size=(9, 4),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
)

_ = p.draw()

Now let’s look at the first few aligned reads for the first few genes:

[18]:
for target in targets.target_names[:4]:
    print(f"\nFirst few entries in `aligned` for {target}:")
    display(aligned[target].head())

First few entries in `aligned` for fluPB2:
name query_name query_clip5 query_clip3 sequenced_mRNA_mutations sequenced_mRNA_accuracy UMI_sequence UMI_accuracy cellbarcode_sequence cellbarcode_accuracy variant_tag_1_sequence variant_tag_2_sequence variant_tag_3_sequence variant_tag_4_sequence
0 flu_WSN m54228_171201_172448/5243086/ccs 0 0 del279to1904 C2077A C2157A C2214A 1.0 CAATAGGTAC 1.0 AGTAGTCACTTCAAGC 1.0 G G C C
1 flu_WSN m54228_171201_172448/5833039/ccs 0 0 del117to2070 1.0 AGCGCTCTTT 1.0 ATAAGGTGACCTCTCA 1.0 A A T T
2 flu_WSN m54228_171201_172448/5898400/ccs 0 0 G67T del152to1989 1.0 TGTCACTGGC 1.0 TCGGAGCTGGTCTTAG 1.0 G G C C
3 flu_WSN m54228_171201_172448/6160599/ccs 0 0 G242T C430G del551to2282 1.0 CTCGGTCCGC 1.0 GTGCGCATGGAAATCA 1.0 G G
4 flu_WSN m54228_171201_172448/6161088/ccs 0 0 del152to1989 G2057T G2118T 1.0 GGTTGGCGCT 1.0 CGGAGCTGGTCTTTAG 1.0 G G C C

First few entries in `aligned` for fluPB1:
name query_name query_clip5 query_clip3 sequenced_mRNA_mutations sequenced_mRNA_accuracy UMI_sequence UMI_accuracy cellbarcode_sequence cellbarcode_accuracy variant_tag_1_sequence variant_tag_2_sequence variant_tag_3_sequence variant_tag_4_sequence
0 flu_WSN m54228_171201_172448/4784702/ccs 0 0 del171to2016 del2268to2268 1.000000 GCGTATCCTG 1.000000 CCTTCGTGATGATTGC 1.000000 T C C T
1 flu_WSN m54228_171201_172448/4785077/ccs 0 0 del180to2032 1.000000 GTGCATGTGT 1.000000 TTCAGCCTGTTCAAGC 1.000000 T C C T
2 flu_WSN m54228_171201_172448/4915675/ccs 0 0 del171to2016 del2070to2070 1.000000 CACAAACTAT 1.000000 GCACAGCTGACTGATC 1.000000 T C C T
3 flu_WSN m54228_171201_172448/5111911/ccs 0 0 ins397T C468T T1612C 0.999384 AGCGTGCCAC 0.999999 ATCCTGGCTGCGTCCA 0.999998 T C C T
4 flu_WSN m54228_171201_172448/5112319/ccs 0 0 del258to2001 1.000000 CAACGATGAC 1.000000 GGCCACAACGCGAATG 1.000000 T C C T

First few entries in `aligned` for fluPA:
name query_name query_clip5 query_clip3 sequenced_mRNA_mutations sequenced_mRNA_accuracy UMI_sequence UMI_accuracy cellbarcode_sequence cellbarcode_accuracy variant_tag_1_sequence variant_tag_2_sequence variant_tag_3_sequence variant_tag_4_sequence
0 flu_WSN m54228_171201_172448/4587763/ccs 0 0 del117to1927 C2159A 1.000000 AAACTAGTGG 1.0 AAGACATGATCAGTGT 1.0 T T C
1 flu_WSN m54228_171201_172448/4653834/ccs 0 0 del179to1839 1.000000 AGCACACATA 1.0 TCCCATTGAAAACACA 1.0 C T A T
2 flu_WSN m54228_171201_172448/4719236/ccs 0 0 C42A del179to1839 1.000000 ACTGAAACAA 1.0 TCCCGTTGAAAACACA 1.0 C T A T
3 flu_WSN m54228_171201_172448/4784775/ccs 0 0 del66to177 del265to1996 C2028T 0.999981 AATAGTTCCC 1.0 GGCAAACGAGACGCAA 1.0 A T
4 flu_WSN m54228_171201_172448/4850043/ccs 0 0 A64G ins402A del1063to1067 del1073to2180 0.999704 CATCATATTG 1.0 TGTGGTCCTACTGATC 1.0 T C

First few entries in `aligned` for fluHA:
name query_name query_clip5 query_clip3 sequenced_mRNA_mutations sequenced_mRNA_accuracy UMI_sequence UMI_accuracy cellbarcode_sequence cellbarcode_accuracy variant_tag_1_sequence variant_tag_2_sequence variant_tag_3_sequence variant_tag_4_sequence
0 flu_WSN m54228_171201_172448/4522378/ccs 0 0 G397A ins1145A del1251to1722 1.000000 TAGACCCTTT 1.00000 GAGTACACTATCGCGC 1.0 C G
1 flu_WSN m54228_171201_172448/6947580/ccs 0 0 del108to1459 1.000000 CTTTGGGGGG 0.99998 GTAGGTTTGTCACTAG 1.0 G T
2 flu_WSN m54228_171201_172448/7144230/ccs 2 0 C260A C262A ins288G del471to471 C1052A ins1450T 0.999627 CGTACGGCAG 1.00000 AAACTGGCTAGGAGCT 1.0 A A A C
3 flu_WSN m54228_171201_172448/7340424/ccs 0 0 G213A ins663A G1715A del1718to1722 0.999847 GCTTGCCGGC 1.00000 GGTCACTCTGCGTTAT 1.0 A A A C
4 flu_WSN m54228_171201_172448/8520141/ccs 0 0 G9T C153A del771to771 A1253C del1256to1722 0.999810 GTGCCAAGGA 1.00000 GAAAGGCTGGTACGAG 1.0 C G