Question: I am analyzing my Single Cell Gene Expression data with Cell Ranger (or Multiome data with Cell Ranger ARC). It produces the following error message:
ERROR: We detected a mixture of different R1 lengths ([{min}-{max}]), which breaks assumptions in how UMIs are tabulated and corrected.
Can you tell me how to resolve this issue and why it is required that we set the R1 length to the shortest observed R1 length?
Answer: In most cases, demultiplexing using the mkfastq
pipeline will not lead to variable read lengths. However, the following scenarios may produce unequal read lengths.
- If you use Illumina's
bcl2fastq
software tool directly to generate the FASTQ files and use the adapter trimming option. - If you trimmed the reads prior to running analysis
- If you sequenced the same library on separate flow cells with different run parameters.
If the cause of the error is the reason (1) above, we recommend rerunning the demultiplexing without the adapter trimming option. If it is (2), you should run the analysis with the original FASTQ files without any preprocessing. If it is (3), you can trim the R1 reads to the shortest observed R1 length (N has to be equal or longer than the 26bp) by providing N to the --r1-length
argument if you are running the count pipeline, or via the r1-length
parameter in the multi config CSV if you are running the multi pipeline.
Below are some details on why we require setting R1 to the shortest observed R1 length:
Taking 3'v3.1 Gene Expression assay as an example, the total R1 length 28 bp is recommended to capture both the 16 bp 10x barcode and the 12 bp UMI. Shown below is the structure of the R1 and R2 reads for the final library. The 16 bp 10x barcode is shown in green and the 12 bp UMI is shown in red.
Starting from Cell Ranger 5, we added a check the read length across all R1 reads. The reason for this check is to protect against double-counting variable-length UMIs as different UMIs. For example, with 3'v3.1 gene expression data, if you have R1 reads with both 26 bp and 28 bp lengths, some reads will have UMIs that are 10 bp long and some will have the expected 12 bp UMIs. If there are 2 reads, Read A and Read B from the same molecule:
- Read A has R1 that is 26 bp long, including a 10bp UMI, for example, AACCGGTTAA
- Read B has R1 that is 28 bp long, including a 10bp UMI, for example, AACCGGTTAACC
Because the UMIs are different, due to different R1 lengths, Cell Ranger will treat them as separate UMI, which can lead to double-counting. To prevent double-counting in such cases, we require that UMIs must be the same length.
Sometimes, you may end up with R1 reads from 3' Gene Expression data that are only 26 bp. Unless the complexity of the library is very high, there should be minimal impact of having a 10 bp UMI compared to 12 bp UMI. The fraction of reads incorrectly flagged as chimera due to UMI collision will be slightly higher. However, the overall results should be fine.