Question: How does Cell Ranger correct barcode sequencing errors?
Answer: The known barcodes for a given assay chemistry are stored in a "whitelist" file (see also What is a barcode whitelist?). For example, there are roughly 737,000 barcodes in the whitelist for the Single Cell 3' v2 and V(D)J assays, and ~3 million barcodes for the Single Cell 3' v3 and v3.1 chemistries.
Cell Ranger uses the following algorithm to correct putative barcode sequences against the whitelist:
- Count the observed frequency in the dataset of every barcode on the whitelist.
- For every observed barcode in the dataset that is not on the whitelist:
- For every whitelist sequence that is 1-Hamming-distance away:
- Compute the posterior probability that the observed barcode originated from the whitelist barcode with a sequencing error at the differing base (based on the base Q score).
- Replace the observed barcode with the whitelist barcode with the highest posterior probability that exceeds 0.975.
- For every whitelist sequence that is 1-Hamming-distance away:
The corrected barcodes are used for all downstream analysis and output files. In the output BAM file, the original uncorrected barcode is encoded in the CR tag, and the corrected barcode sequence is encoded in the CB tag. Reads that are not able to be assigned a corrected barcode will not have a CB tag.
Note: the content here can also be found on the Gene Expression Algorithms page.