Question: How do I prepare Sequence Read Archive (SRA) data from NCBI for Cell Ranger?
Answer: One of the beauties of open source data in the sequencing age is the ability to reanalyze data generated by other researchers. The primary source of these publicly available datasets in the United States is the Sequence Read Archive (SRA) maintained by NCBI. Using these data in Cell Ranger requires some pre-processing. Before downloading SRA data, first identify the platform and version of the chemistry used to generate the data. The following fix has been tested on Chromium v2 and v3 chemistry.
First, use the NCBI fastq-dump
utility with the --split-files
argument to retrieve the FASTQ files. The command may look like this:
fastq-dump --split-files --gzip SRR6334436
The output would be two FASTQ files:
SRR6334436_1.fastq.gz
SRR6334436_2.fastq.gz
The number of FASTQ files we retrieved is consistent with what is reported in the metadata of SRR6334436: This run has 2 reads per spot.
Based on the read length, we can infer that SRR6334436_1.fastq
is Read 1 and SRR6334436_2.fastq
is Read 2 (see Sequencing Recommendation for 3' Gene Expression Assays)
Sometimes, the authors may upload 3 or 4 reads for a sample by including index reads. For example, if we try to retrieve FASTQ files from SRR9291388:
fastq-dump --split-files --gzip SRR9291388
The output would be three FASTQ files:
SRR9291388_1.fastq.gz
SRR9291388_2.fastq.gz
SRR9291388_3.fastq.gz
The number of FASTQ files we retrieved is consistent with what is reported in the metadata of SRR9291388: This run has 3 reads per spot.
Based on the read length and the attributes listed in the metadata, we know that:
-
SRR9291388_1.fastq
is Read 1 -
SRR9291388_2.fastq
is Read 2 -
SRR9291388_3.fastq
is Index 1
Cell Ranger requires FASTQ file names to follow the bcl2fastq
file naming convention.
[Sample Name]
_S1_L00[Lane Number]
_[Read Type]
_001.fastq.gz
Where Read Type
is one of:
-
I1
: Sample index read (optional) -
I2
: Sample index read (optional) -
R1
: Read 1 (required) -
R2
: Read 2 (required)
NOTE: Cell Ranger v4.0 and later will accept file names without lane number [Lane Number].
Therefore, for Cell Ranger:
- incompatible file name: SRR9291388_1.fastq.gz
- compatible file name: SRR9291388_S1_L001_R1_001.fastq.gz
- compatible file name: SRR9291388_S1_R1_001.fastq.gz (for Cell Ranger v4.0 and later)
Changing the file names will allow Cell Ranger (version >=2.1.1) to accept these data as input. Note that only R1 and R2 FASTQ files are required for Cell Ranger. I1 and/or I2 FASTQ files are optional.
For more information on FASTQ format requirements, please see Specifying FASTQ files.
Related article: Why do I not see any fastqs (or see incomplete fastqs) for the SRA of my interest?