Question: How is "Sequencing Saturation" calculated?
Answer: The web_summary.html output from cellranger count
includes a metric called "Sequencing Saturation". This metric quantifies the fraction of reads originating from an already-observed UMI. More specifically, this is the fraction of confidently mapped, valid cell-barcode, valid UMI reads that are non-unique (match an existing cell-barcode, UMI, gene combination).
The formula for calculating this metric is as follows:
Sequencing Saturation = 1 - (n_deduped_reads / n_reads)
where
n_deduped_reads = Number of unique (valid cell-barcode, valid UMI, gene) combinations among confidently mapped reads.
n_reads = Total number of confidently mapped, valid cell-barcode, valid UMI reads.
Note that the numerator of the fraction is n_deduped_reads, not the non-unique reads that are mentioned in the definition. n_deduped_reads is a degree of uniqueness, not a degree of duplication/saturation. Therefore we take the complement of (n_deduped_reads / n_reads) to measure saturation.
Below we provide two example calculations using public data.
[1] Example calculation for gene expression sample
Uses dataset at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3, where the web summary report gives a sequencing saturation value of 0.7085123.
The sequencing saturation calculation below matches the 0.7085123 sequencing saturation given in the web summary report.
unique_confidently_mapped_reads = 10,196,940
duplicate_reads = 24,785,461
x = 1 - (unique_confidently_mapped_reads/(unique_confidently_mapped_reads + duplicate_reads))
x = 1 - (10,196,940/(10,196,940 + 24,785,461))
x = 1 - (10,196,940/34,982,401)
x = 1 - 0.29148771
x = 0.70851229
The unique_confidently_mapped_reads
are the reads with xf tag value 25, as described on https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam. The bits are 1, 8, 16. You can obtain theunique_confidently_mapped_reads
by parsing the BAM like so.
samtools view pbmc_1k_v3_possorted_genome_bam.bam | grep 'xf:i:25' | wc -l
The duplicate_reads
are those marked with the sam flag 0x400 as explained on https://broadinstitute.github.io/picard/explain-flags.html and can be obtained by running samtools flagstat
.
samtools flagstat pbmc_1k_v3_possorted_genome_bam.bam
76920923 + 0 in total (QC-passed reads + QC-failed reads)
10319036 + 0 secondary
0 + 0 supplementary
24785461 + 0 duplicates
73840063 + 0 mapped (95.99% : N/A)
...
All of these samflag 0x400 reads have an xf tag value of 17, which consist of bits 1 and 16. This also means these reads do not have the xf bit of 8, which mark representative reads from a group of duplicates. The converse isn't true though. The xf17 consist mostly of samflag 0x400 duplicate reads but also of samflag nonduplicate reads.
[2] Example calculation for gene expression plus feature barcode sample
Uses dataset at https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/SC3_v3_NextGem_DI_PBMC_CSP_1K, where the web summary report gives a sequencing saturation value of 69.8%.
For this context the formula above must be modified to:
x = 1 - (unique_confidently_mapped_reads/(unique_confidently_mapped_reads + (duplicate_reads - unmapped_duplicates)))
where unmapped_duplicates
is the count of the antibody feature barcode duplicates.
As described in the last table on https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam, all of the feature barcode reads are unmapped. Of these unmapped, some of the reads also have the duplicate bit set such that the SAM flag value is 1028. None of these reads will have xf=25, as this value appears to be reserved for gene expression data. So we need only subtract out the contribution of antibody feature barcode reads from the duplicate counts. Doing so gives
0.6983225298606541
which is 69.8% rounded.
You can count the unmapped_duplicates
with the following one-liner.
samtools view SC3_v3_NextGem_DI_PBMC_CSP_1K_possorted_genome_bam.bam | grep 'fb:Z:' | cut -f2 | grep '1028' | wc -l
Related resources
Article last updated January 26, 2023
Disclaimer: This article is provided for instructional purposes only. 10x Genomics does not support or guarantee the listed tools or code.