Question: I am analyzing my Single Cell Gene Expression data with Cell Ranger (or Multiome data with Cell Ranger ARC). It produces the following error message:
ERROR: We detected a mixture of different R1 lengths ([{min}-{max}]), which breaks assumptions in how UMIs are tabulated and corrected.
Can you tell me how to resolve this issue and why it is required that we set the R1 length to the shortest observed R1 length?
Answer: In most cases, demultiplexing using the mkfastq
pipeline will not lead to variable read lengths. However, the following scenarios may produce unequal read lengths.
- If you use Illumina's
bcl2fastq
orbcl-convert
software tools directly to generate the FASTQ files and use the adapter trimming option. - If you trimmed the reads prior to running analysis
- If you sequenced the same library on separate flow cells with different run parameters.
If the cause of the error is reason (1) above, we recommend rerunning the demultiplexing without the adapter trimming option. If it is (2), you should run the analysis with the original FASTQ files without any preprocessing. If it is (3), you can trim the R1 reads to the shortest observed R1 length (N has to be equal to or longer than the minimum R1 length for the chemistry, e.g. 26bp for 5'v2) by providing N to the --r1-length
argument if you are running the count pipeline, or via the r1-length
parameter in the multi config CSV if you are running the multi pipeline.
Below are some details on why we require setting R1 to the shortest observed R1 length.
3'v3 Gene Expression example
Take the 3'v3.1 Gene Expression assay as an example. The total R1 length 28 bp is recommended to capture both the 16 bp 10x barcode and the 12 bp UMI. Shown below is the structure of the R1 and R2 reads for the final library. The 16 bp 10x barcode is shown in green and the 12 bp UMI is shown in red.
Cell Ranger v5 adds a check for read length across all R1 reads. The reason for this check is to protect against double-counting variable-length UMIs as different UMIs. For example, with 3'v3.1 gene expression data, if you have R1 reads with both 26 bp and 28 bp lengths, some reads will have UMIs that are 10 bp long and some will have the expected 12 bp UMI. Consider 2 reads, Read A and Read B, stemming from the same insert molecule as follows.
- Read A R1 is 26 bp long that results in a 10bp UMI with sequence AACCGGTTAA
- Read B R1 is 28 bp long, which gives the expected 12bp UMI with sequence AACCGGTTAACC
Because the UMIs are different, due to different R1 lengths, Cell Ranger will treat them as separate UMIs, which can lead to double-counting. To prevent double-counting in such cases, we require UMIs be the same length.
Sometimes, you may end up with R1 reads from 3' Gene Expression data that are only 26 bp. Unless the complexity of the library is very high, there should be minimal impact of having a 10 bp UMI compared to 12 bp UMI. The fraction of reads incorrectly flagged UMI duplicates due to UMI collision will be slightly higher. The effect is a slight depression in UMI counts.
What about 5'v2, 5'v1 and 3'v2 data?
For 5’v2, 5’v1 and 3’v2 data, R1 is already 26 bases, with 16 bases corresponding to the 10x barcode and 10 bases corresponding to the UMI. The UMI space is already small. We strongly recommend generating FASTQs with the expected 10 base UMIs.
Depending on the complexity of the library and study aims, shorter UMI lengths e.g. 9 bases, may be conceptually acceptable. However, cellranger will error for reads shorter than the following for each chemistry.
The minimum read length for different chemistries are:
SFRP - read1: 26, read2: 30, index1: 0
SC5P-R2 - read1: 26, read2: 25, index1: 0
SC5P-PE - read1: 81, read2: 25, index1: 0
SC3Pv1 - read1: 25, read2: 10, index1: 14
SC3Pv2 - read1: 26, read2: 25, index1: 0
SC3Pv3 - read1: 26, read2: 25, index1: 0
SC3Pv3LT - read1: 26, read2: 25, index1: 0
SC3Pv3HT - read1: 26, read2: 25, index1: 0
In such cases, one workaround is to add Ns to the R1 read to meet the minimum length requirement. For recommended read lengths, see table in https://kb.10xgenomics.com/hc/en-us/articles/5568819041805.
Other tips
Be sure to understand why read lengths were less than the sequencing requirements. In general, we recommend assessing sequencing quality by running FastQC on FASTQ file sets, for which instructions are at https://kb.10xgenomics.com/hc/en-us/articles/360048465032. Please write to support@10xgenomics.com if you have unexpected results and follow instructions in https://kb.10xgenomics.com/hc/en-us/articles/360001673231.
Last updated: November 18, 2022