fastq¶
Tools for processing FASTQ files.
- dms_variants.fastq.iterate_fastq(filename, *, trim=None, check_pair=None, qual_format='str')[source]¶
Iterate over a FASTQ file.
- Parameters:
filename (str) – FASTQ file name, can be gzipped (extension
.gz
).trim (int or None) – If not None, trim reads and Q scores to be longer than this.
check_pair ({1, 2, None}) – If not None, check reads are read 1 or read 2 if this info given. Assumes Casava 1.8 or SRA header format.
qual_format ({'str', 'array'}) – Return the quality scores as string of ASCII codes or array of numbers?
- Yields:
namedtuple –
- The entries in the tuple are (in order):
id : read id
seq : read sequence
qs : Q scores (qual_format parameter determines format)
fail : did read fail chastity filter? (None if no filter info)
Example
>>> f = tempfile.NamedTemporaryFile(mode='w') >>> _ = f.write( ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 1:N:0:CGATGT\n' ... 'ATGCAATTG\n' ... '+\n' ... 'GGGGGIIII\n' ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 1:Y:0:CGATGT\n' ... 'ACGCTATTC\n' ... '+\n' ... 'GHGGGIKII\n' ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT\n' ... 'ACGCTATTC\n' ... '+\n' ... 'GHGGGIKII\n' ... ) >>> f.flush()
>>> try: ... for tup in iterate_fastq(f.name, trim=5, check_pair=1): ... print(tup) ... except ValueError as e: ... print(f"ValueError: {e}") ... FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984', seq='ATGCA', qs='GGGGG', fail=False) FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985', seq='ACGCT', qs='GHGGG', fail=True) ValueError: header not for R1: @DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT
>>> for tup in iterate_fastq(f.name, qual_format='array'): ... print(tup) ... FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984', seq='ATGCAATTG', qs=array([38, 38, 38, 38, 38, 40, 40, 40, 40]), fail=False) FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985', seq='ACGCTATTC', qs=array([38, 39, 38, 38, 38, 40, 42, 40, 40]), fail=True) FastqEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985', seq='ACGCTATTC', qs=array([38, 39, 38, 38, 38, 40, 42, 40, 40]), fail=True)
>>> f.close()
- dms_variants.fastq.iterate_fastq_pair(r1filename, r2filename, *, r1trim=None, r2trim=None, qual_format='str')[source]¶
Iterate over paired R1 and R2 FASTQ files.
- Parameters:
r1filename (str) – R1 FASTQ file name, can be gzipped (extension
.gz
).r2filename (str) – R2 FASTQ file name, can be gzipped (extension
.gz
).r1trim (int or None) – If not None, trim R1 reads and Q scores to be longer than this.
r2trim (int or None) – If not None, trim R2 reads and Q scores to be longer than this.
qual_format ({'str', 'array'}) – Return the quality scores as string of ASCII codes or array of numbers?
- Yields:
namedtuple –
- The entries in the tuple are (in order):
id : read id
r1_seq : R1 read sequence
r2_seq : R2 read sequence
r1_qs : R1 Q scores (qual_format parameter determines format)
r2_qs : R2 Q scores (qual_format parameter determines format)
fail : did either read fail chastity filter? (None if no info)
Example
>>> f1 = tempfile.NamedTemporaryFile(mode='w') >>> _ = f1.write( ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 1:N:0:CGATGT\n' ... 'ATGCAATTG\n' ... '+\n' ... 'GGGGGIIII\n' ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 1:Y:0:CGATGT\n' ... 'ACGCTATTC\n' ... '+\n' ... 'GHGGGIKII\n' ... ) >>> f1.flush() >>> f2 = tempfile.NamedTemporaryFile(mode='w') >>> _ = f2.write( ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 2:N:0:CGATGT\n' ... 'CAGCATA\n' ... '+\n' ... 'AGGGGII\n' ... '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985 2:Y:0:CGATGT\n' ... 'CTGAATA\n' ... '+\n' ... 'GHBGGIK\n' ... ) >>> f2.flush()
>>> for tup in iterate_fastq_pair(f1.name, f2.name, r1trim=8, r2trim=5, ... qual_format='array'): ... print(tup) ... FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984', r1_seq='ATGCAATT', r2_seq='CAGCA', r1_qs=array([38, 38, 38, 38, 38, 40, 40, 40]), r2_qs=array([32, 38, 38, 38, 38]), fail=False) FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985', r1_seq='ACGCTATT', r2_seq='CTGAA', r1_qs=array([38, 39, 38, 38, 38, 40, 42, 40]), r2_qs=array([38, 39, 33, 38, 38]), fail=True)
>>> for tup in iterate_fastq_pair(f1.name, f2.name, r1trim=8, r2trim=5): ... print(tup) ... FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984', r1_seq='ATGCAATT', r2_seq='CAGCA', r1_qs='GGGGGIII', r2_qs='AGGGG', fail=False) FastqPairEntry(id='DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1985', r1_seq='ACGCTATT', r2_seq='CTGAA', r1_qs='GHGGGIKI', r2_qs='GHBGG', fail=True)
>>> f1.close() >>> f2.close()
- dms_variants.fastq.qual_str_to_array(q_str, *, offset=33)[source]¶
Convert quality score string to array of integers.
- Parameters:
q_str (str) – Quality score string.
offset (int) – Offset in ASCII encoding of Q-scores.
- Returns:
Array of integer quality scores.
- Return type:
numpy.ndarray
Example
>>> qual_str_to_array('!I:0G') array([ 0, 40, 25, 15, 38])