Question: How do I prepare Sequence Read Archive (SRA) data from NCBI for Cell Ranger?
Answer: One of the beauties of open source data in the sequencing age is the ability to reanalyze data generated by other researchers. The primary source of these publicly available data sets in the United States is the Sequence Read Archive (SRA) maintained by NCBI. Using these data in Cell Ranger requires some pre-processing. Before downloading SRA data, first, identify the platform and version of the chemistry used to generate the data. The following fix has been tested on Chromium v2 and v3 chemistry.
First, use the NCBI fastq-dump utility with the --split-files
argument to retrieve the FASTQ files. The command may look like this:
fastq-dump --split-files SRR6334436
The output would be two FASTQ files:
SRR6334436_1.fastq
SRR6334436_2.fastq
Cell Ranger requires FASTQ file names to follow the bcl2fastq
file naming convention.
[Sample Name]
_S1_L00[Lane Number]
_[Read Type]
_001.fastq.gz
Where Read Type
is one of:
I1
: Sample index read (optional)R1
: Read 1R2
: Read 2
incompatible: SRR6334436_1.fastq
compatible: SRR6334436_S1_L001_R1_001.fastq
Changing the file names will allow Cell Ranger (version >=2.1.1) to accept this data as inputs.
For more information on FASTQ format requirements, please see Specifying FASTQ files.