Question: The Zheng at el paper from here shows the input files for v1 chemistry data have I1, I2 chunk files and RA files. How to use these v1 data format using latest cellranger versions ?
Answer: V1 chemistry FASTQ files were generated using the old demultiplex pipeline format described here. They are basically interleaved FASTQ files in RA format where R1 and R2 are interleaved.
The 14bp GemCode barcode sequence are deposited in the I1 file reads.
The 8bp sample index is found in the I2 files.
The RA reads consists of both R1 & R2, the format will be like 98bp cDNA sequence and 10bp UMI sequence. So both read1 and read2 are in RA file.
One solution would be to use the BAM file output here and use the bamtofastq tool from here, to convert the BAM to FASTQ files. The reason for doing this is because many of the public archives like SRA do not accept FASTQ files from our version 1 chemistry, because the barcode is on an index read.
Download the data with curl (Eg:)
curl -O http://cf.10xgenomics.com/samples/cell-exp/1.1.0/frozen_pbmc_donor_a/frozen_pbmc_donor_a_possorted_genome_bam.bam
curl -O http://cf.10xgenomics.com/samples/cell-exp/1.1.0/frozen_pbmc_donor_a/frozen_pbmc_donor_a_possorted_genome_bam_index.bam.bai
Issue bamtofastq:
bamtofastq --cr11 frozen_pbmc_donor_a_possorted_genome_bam.bam ./fastqs
Now we have a directory of FASTQ files that will look something like this:
tree fastqs/
fastqs/
└── gemgroup001
├── bamtofastq_S1_L000_I1_001.fastq.gz
├── bamtofastq_S1_L000_I1_002.fastq.gz
├── bamtofastq_S1_L000_R1_001.fastq.gz
├── bamtofastq_S1_L000_R1_002.fastq.gz
├── bamtofastq_S1_L000_R2_001.fastq.gz
├── bamtofastq_S1_L000_R2_002.fastq.gz
├── bamtofastq_S1_L000_R3_001.fastq.gz
└── bamtofastq_S1_L000_R3_002.fastq.gz
Next run cellranger count
cellranger count --id=pbmc_v1data --fastqs=./fastqs/gemgroup001 --transcriptome=/references/refdata-cellranger-GRCh38-3.0.0
Solution (ii): Rename the files according to the recent chemistry guidelines to run the latest cellranger pipeline. The current format of input files is of the following assignments:
R1= transcript (98bp)
R2 = 14 barcode (i1)
R3 = 10bp UMI
I1 = sample index
1) Use a custom script to split the interleaved RA file to R1 and R2 files.
2) Follow the Illumina file naming conventions here because the latest cell ranger pipelines expect the FASTQ files to be named according to the bcl2fastq convention. The cellranger mkfastq pipeline generates FASTQ files in the following file naming format, which is:
[Sample Name]
_S1_L00[Lane Number]
_[Read Type]
_001.fastq.gz
So if the file names don't match the above naming format there might be errors such as "Invalid prefix combination where no input FASTQs were found for the requested parameters".
3) Assign the following extensions for the various files.
R1= transcript (98 bp)
R2 = barcode (14bp)
R3 = UMI (10bp)
I1 = sample index (8bp)
4) Run cellranger count as in Sol (1) making appropriate changes to the paths. Please note that all the above files (i.e. R1, R2, R3 and I1) needs to be present in the directory path to run cellranger count successfully for v1 chemistry (a related article on this can be found here).
Thus, if formatted properly the single cell v1 chemistry data can be analyzed with newer versions of Cell Ranger. One can also specify the chemistry parameters on the cellranger command line using --chemistry=SC3Pv1
. More information on running cellranger count can be found here.