fastq

Tools for processing FASTQ files.

dms_variants.fastq.iterate_fastq(filename, *, trim=None, check_pair=None, qual_format='str')[source]

Iterate over a FASTQ file.

Parameters:
  • filename (str) – FASTQ file name, can be gzipped (extension .gz).

  • trim (int or None) – If not None, trim reads and Q scores to be longer than this.

  • check_pair ({1, 2, None}) – If not None, check reads are read 1 or read 2 if this info given. Assumes Casava 1.8 or SRA header format.

  • qual_format ({'str', 'array'}) – Return the quality scores as string of ASCII codes or array of numbers?

Yields:

namedtuple

The entries in the tuple are (in order):
  • id : read id

  • seq : read sequence

  • qs : Q scores (qual_format parameter determines format)

  • fail : did read fail chastity filter? (None if no filter info)

Example

>>> f = tempfile.NamedTemporaryFile(mode='w')
>>> _ = f.write(
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 1:N:0:CGATGT\n'
...         'ATGCAATTG\n'
...         '+\n'
...         'GGGGGIIII\n'
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 1:Y:0:CGATGT\n'
...         'ACGCTATTC\n'
...         '+\n'
...         'GHGGGIKII\n'
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT\n'
...         'ACGCTATTC\n'
...         '+\n'
...         'GHGGGIKII\n'
...         )
>>> f.flush()
>>> try:
...     for tup in iterate_fastq(f.name, trim=5, check_pair=1):
...         print(tup)
... except ValueError as e:
...    print(f"ValueError: {e}")
... 
FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984',
           seq='ATGCA',
           qs='GGGGG',
           fail=False)
FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985',
           seq='ACGCT',
           qs='GHGGG',
           fail=True)
ValueError: header not for R1:
@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT
>>> for tup in iterate_fastq(f.name, qual_format='array'):
...    print(tup)
... 
FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984',
           seq='ATGCAATTG',
           qs=array([38, 38, 38, 38, 38, 40, 40, 40, 40]),
           fail=False)
FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985',
           seq='ACGCTATTC',
           qs=array([38, 39, 38, 38, 38, 40, 42, 40, 40]),
           fail=True)
FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985',
           seq='ACGCTATTC',
           qs=array([38, 39, 38, 38, 38, 40, 42, 40, 40]),
           fail=True)
>>> f.close()
dms_variants.fastq.iterate_fastq_pair(r1filename, r2filename, *, r1trim=None, r2trim=None, qual_format='str')[source]

Iterate over paired R1 and R2 FASTQ files.

Parameters:
  • r1filename (str) – R1 FASTQ file name, can be gzipped (extension .gz).

  • r2filename (str) – R2 FASTQ file name, can be gzipped (extension .gz).

  • r1trim (int or None) – If not None, trim R1 reads and Q scores to be longer than this.

  • r2trim (int or None) – If not None, trim R2 reads and Q scores to be longer than this.

  • qual_format ({'str', 'array'}) – Return the quality scores as string of ASCII codes or array of numbers?

Yields:

namedtuple

The entries in the tuple are (in order):
  • id : read id

  • r1_seq : R1 read sequence

  • r2_seq : R2 read sequence

  • r1_qs : R1 Q scores (qual_format parameter determines format)

  • r2_qs : R2 Q scores (qual_format parameter determines format)

  • fail : did either read fail chastity filter? (None if no info)

Example

>>> f1 = tempfile.NamedTemporaryFile(mode='w')
>>> _ = f1.write(
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 1:N:0:CGATGT\n'
...         'ATGCAATTG\n'
...         '+\n'
...         'GGGGGIIII\n'
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 1:Y:0:CGATGT\n'
...         'ACGCTATTC\n'
...         '+\n'
...         'GHGGGIKII\n'
...         )
>>> f1.flush()
>>> f2 = tempfile.NamedTemporaryFile(mode='w')
>>> _ = f2.write(
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 2:N:0:CGATGT\n'
...         'CAGCATA\n'
...         '+\n'
...         'AGGGGII\n'
...         '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT\n'
...         'CTGAATA\n'
...         '+\n'
...         'GHBGGIK\n'
...         )
>>> f2.flush()
>>> for tup in iterate_fastq_pair(f1.name, f2.name, r1trim=8, r2trim=5,
...                               qual_format='array'):
...     print(tup)
... 
FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984',
               r1_seq='ATGCAATT',
               r2_seq='CAGCA',
               r1_qs=array([38, 38, 38, 38, 38, 40, 40, 40]),
               r2_qs=array([32, 38, 38, 38, 38]),
               fail=False)
FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985',
               r1_seq='ACGCTATT',
               r2_seq='CTGAA',
               r1_qs=array([38, 39, 38, 38, 38, 40, 42, 40]),
               r2_qs=array([38, 39, 33, 38, 38]),
               fail=True)
>>> for tup in iterate_fastq_pair(f1.name, f2.name, r1trim=8, r2trim=5):
...     print(tup)
... 
FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984',
               r1_seq='ATGCAATT',
               r2_seq='CAGCA',
               r1_qs='GGGGGIII',
               r2_qs='AGGGG',
               fail=False)
FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985',
               r1_seq='ACGCTATT',
               r2_seq='CTGAA',
               r1_qs='GHGGGIKI',
               r2_qs='GHBGG',
               fail=True)
>>> f1.close()
>>> f2.close()
dms_variants.fastq.qual_str_to_array(q_str, *, offset=33)[source]

Convert quality score string to array of integers.

Parameters:
  • q_str (str) – Quality score string.

  • offset (int) – Offset in ASCII encoding of Q-scores.

Returns:

Array of integer quality scores.

Return type:

numpy.ndarray

Example

>>> qual_str_to_array('!I:0G')
array([ 0, 40, 25, 15, 38])