Question: How do I prepare Sequence Read Archive (SRA) data from NCBI for Cell Ranger?
Answer: One of the beauties of open source data in the sequencing age is the ability to reanalyze data generated by other researchers. The primary source of these publicly available datasets in the United States is the Sequence Read Archive (SRA) maintained by NCBI. Using these data in Cell Ranger requires some pre-processing. Before downloading SRA data, first identify the platform and version of the chemistry used to generate the data. The following fix has been tested on Chromium v2 and v3 chemistry.
First, use the NCBI
fastq-dump utility with the
--split-files argument to retrieve the FASTQ files. The command may look like this:
fastq-dump --split-files --gzip SRR6334436
The output would be two FASTQ files:
The number of FASTQ files we retrieved is consistent with what is reported in the metadata of SRR6334436: This run has 2 reads per spot.
Based on the read length, we can infer that
SRR6334436_1.fastq is Read 1 and
SRR6334436_2.fastq is Read 2 (see Sequencing Recommendation for 3' Gene Expression Assays)
Sometimes, the authors may upload 3 or 4 reads for a sample by including index reads. For example, if we try to retrieve FASTQ files from SRR9291388:
fastq-dump --split-files --gzip SRR9291388
The output would be three FASTQ files:
The number of FASTQ files we retrieved is consistent with what is reported in the metadata of SRR9291388: This run has 3 reads per spot.
Based on the read length and the attributes listed in the meta data, we know that:
SRR9291388_1.fastqis Read 1
SRR9291388_2.fastqis Read 2
SRR9291388_3.fastqis Index 1
Cell Ranger requires FASTQ file names to follow the
bcl2fastq file naming convention.
Read Type is one of:
I1: Sample index read (optional)
I2: Sample index read (optional)
R1: Read 1 (required)
R2: Read 2 (required)
Therefore, for Cell Ranger:
- incompatible file name: SRR9291388_1.fastq.gz
- compatible file name: SRR9291388_S1_L001_R1_001.fastq.gz
Changing the file names will allow Cell Ranger (version >=2.1.1) to accept this data as inputs. Note that only R1 and R2 FASTQ files are required for Cell Ranger. I1 and/or I2 FASTQ files are optional.
For more information on FASTQ format requirements, please see Specifying FASTQ files.
Related article: Why do I not see any fastqs (or see incomplete fastqs) for the SRA of my interest?