Question: How is the PCR duplication rate calculated in Long Ranger? I used Picard and found a much higher rate.
Answer: The algorithm Long Ranger uses to identify PCR duplicates is not the same one used by Picard. Long Ranger uses its own algorithm, which can be found within your installation:
longranger-x.x.x/longranger-cs/x.x.x/mro/stages/reads/mark_duplicates/__init__.py
The algorithm follows these steps.
1. Group all the read pairs by the following key (alignment position of R1, alignment position of R2, strand of R1, corrected 10x barcode sequence).
2. Within each of these groups, select the first read to be the the 'non-duplicate' read of the group. Assign all the other reads the 'duplicate' BAM flag (1024).
3. (Only for non-MiSeq instruments) Reconstruct the x,y coordinates of each read from the information contained in the read name. Find any pairs of reads that are within 25,000 pixels of each other. For each group of reads that are within 25,000 pixels of each other, count 1 read as an 'original' molecule, and count the rest as 'Exclusion Amp duplicates'. Subtract the count of 'Exclusion Amp duplicates' from the total count of duplicates, and count the remainder as PCR duplicates. Note that the top-level stats reported by LongRanger and Loupe only record the PCR duplicates. Current 4k and HiSeq X data is observed to have a very low rate 'Exclusion Amp duplicates.