Question: The Zheng et al. paper from here shows the input files for v1 chemistry data have I1, I2 chunk files, and RA files. How can I use these v1 data formats using the latest Cell Ranger versions?
Answer: v1 chemistry FASTQ files were generated using the old demultiplex pipeline format described here. They are basically interleaved FASTQ files in RA format where R1 and R2 are interleaved.
The 14bp GemCode barcode sequence is deposited in the I1 file reads.
The 8bp sample index is found in the I2 files.
The RA reads consist of both R1 and R2; the format will be 98bp cDNA sequence and 10bp UMI sequence.
Solution (i): One solution would be to use the BAM file output here and use the bamtofastq
tool from here, to convert the BAM to FASTQ files. The reason for doing this is that many of the public archives like SRA do not accept FASTQ files from our version 1 chemistry because the barcode is on an index read.
Download the data with curl
, for example:
curl -O http://cf.10xgenomics.com/samples/cell-exp/1.1.0/frozen_pbmc_donor_a/frozen_pbmc_donor_a_possorted_genome_bam.bam
curl -O http://cf.10xgenomics.com/samples/cell-exp/1.1.0/frozen_pbmc_donor_a/frozen_pbmc_donor_a_possorted_genome_bam_index.bam.bai
Run bamtofastq
:
bamtofastq --cr11 frozen_pbmc_donor_a_possorted_genome_bam.bam ./fastqs
Now we have a directory of FASTQ files that will look something like this:
tree fastqs/
fastqs/
└── gemgroup001
├── bamtofastq_S1_L000_I1_001.fastq.gz
├── bamtofastq_S1_L000_I1_002.fastq.gz
├── bamtofastq_S1_L000_R1_001.fastq.gz
├── bamtofastq_S1_L000_R1_002.fastq.gz
├── bamtofastq_S1_L000_R2_001.fastq.gz
├── bamtofastq_S1_L000_R2_002.fastq.gz
├── bamtofastq_S1_L000_R3_001.fastq.gz
└── bamtofastq_S1_L000_R3_002.fastq.gz
Next run cellranger count
cellranger count --id=pbmc_v1data --fastqs=./fastqs/gemgroup001 --transcriptome=/references/refdata-cellranger-GRCh38-3.0.0
Solution (ii): Rename the files according to the recent chemistry guidelines to run the latest Cell Ranger pipeline. The current format of input files uses the following assignments:
R1= transcript (98bp)
R2 = 14 barcode (i1)
R3 = 10bp UMI
I1 = sample index
1) Use a custom script to split the interleaved RA file into R1 and R2 files.
2) Follow the Illumina file naming conventions here because the latest Cell Ranger pipelines expect the FASTQ files to be named according to the bcl2fastq
convention. The cellranger mkfastq
pipeline generates FASTQ files in the following file naming format, which is:
[Sample Name]
_S1_L00[Lane Number]
_[Read Type]
_001.fastq.gz
So if the file names don't match the above naming format there might be errors such as "Invalid prefix combination where no input FASTQs were found for the requested parameters".
3) Assign the following extensions for the various files, such that there are four files for each tile:
R1= transcript (98 bp)
R2 = barcode (14bp)
R3 = UMI (10bp)
I1 = sample index (8bp)
For example:
sample1_S1_L001_R1
_001.fastq.gz
sample1_S1_L001_R2
_001.fastq.gz
sample1_S1_L001_R3
_001.fastq.gz
sample1_S1_L001_I1
_001.fastq.gz
4) Run cellranger count
as in Solution (i) making appropriate changes to the file paths.
Please note that all the above files ( i.e. R1, R2, R3, and I1 ) need to be present in the directory path to run cellranger count
successfully for v1 chemistry (a related article on this can be found here).
Thus, if formatted properly the single-cell v1 chemistry data can be analyzed with newer versions of Cell Ranger. One can also specify the chemistry parameters on the Cell Ranger command line using --chemistry=SC3Pv1
. This parameter is mandatory in Cell Ranger v4 and above. More information on running cellranger count
can be found here.
Note: If the dataset has 5bp UMI, please use Cell Ranger v3.1 and earlier as the older versions did not have strict checks on the UMI length.
Products: Single Cell Gene Expression
Disclaimer: This workaround is provided for instructional purposes only. 10x Genomics does not support or guarantee the instructions above.