Question: How does Cell Ranger correct barcode sequencing errors?
Answer: The known barcodes for a given assay chemistry are stored in a "inclusion list" file (see also What is a barcode inclusion list (formerly barcode whitelist)?). For example, there are roughly 737,000 barcodes in the inclusion list for the Single Cell 3' v2 and V(D)J assays, and ~3 million barcodes for the Single Cell 3' v3 and v3.1 chemistries.
Cell Ranger uses the following algorithm to correct putative barcode sequences against the inclusion list:
- Count the observed frequency in the dataset of every barcode on the inclusion list.
- For every observed barcode in the dataset that is not on the inclusion list:
- For every inclusion list sequence that is 1-Hamming-distance away:
- Compute the posterior probability that the observed barcode originated from the inclusion list barcode with a sequencing error at the differing base (based on the base Q score).
- Replace the observed barcode with the inclusion list barcode with the highest posterior probability that exceeds 0.975.
- For every inclusion list sequence that is 1-Hamming-distance away:
The corrected barcodes are used for all downstream analysis and output files. In the output BAM file, the original uncorrected barcode is encoded in the CR tag, and the corrected barcode sequence is encoded in the CB tag. Reads that are not able to be assigned a corrected barcode will not have a CB tag.
Note: the content here can also be found on the Gene Expression Algorithms page.