sra

Functions for downloading / handling data from the Sequence Read Archive (SRA).

dms_tools2.sra.fastqFromSRA(samples, fastq_dump, fastqdir, aspera=None, overwrite=False, passonly=True, no_downloads=False, ncpus=1)[source]

Download data from SRA and extract FASTQ files.

Currently only works for runs containing paired-end reads.

Args:
samples (pandas.DataFrame)

A dataframe that must have columns named run and name. The run column gives SRA run accessions (e.g., SRR5241726), and the name column gives the name for the run used in the final FASTQ files. Will be modified to include R1 and R2 columns.

fastq_dump (str)

Path to fastq-dump executable. Requires a version >= 2.8.

fastqdir (str)

Directory in which to place the FASTQ files. Created if it does not already exist.

aspera (None or 2-tuple)

If None, use fastq-dump for downloads (this is slower). However, downloads are faster with aspera To use aspera, specify the 2-tuple (ascp, asperakey) where ascp is path to ascp executable, and asperakey is the key.

overwrite (bool)

If file already exists, do we overwrite it or just use the existing one? If False and all output files already exist, then nothing is done and fastq_dump no longer even needs to be a valid path.

passonly (bool)

Keep only reads with a passing READ_FILTER value.

no_downloads (bool)

If True, do not actually download the files, but instead just add the columns to samples.

ncpus (int)

Use this many CPUs to parallelize downloads, downgrades number if it exceeds max available.

Result:

Upon completion, the directory fastqdir contains files of the form <name>_R1.fastq.gz and <name>_R2.fastq.gz for all names in samples. These names have been added as the columns R1 and R2 to samples. Note that the file names but not the directory names are added.