Question: How can I get the corresponding read counts for the observed UMIs?
Answer: There are two Cell Ranger output files with read-level information:
1. BAM: Indexed BAM file containing position-sorted reads aligned to the genome and transcriptome.
2. Molecule Info: HDF5 file containing per-molecule information for all molecules that contain a valid cell-barcode and valid UMI.
Obtaining the read-level information from each file will require some custom coding.
For example, from the molecule_info.h5 file, you can get read-level support for each valid barcode, valid UMI, and gene combination. If you are comfortable working with H5 files in R or Python, you can aggregate the confidently-mapped reads per cell, gene, and UMI .
The Bioconductor library, DropletUtils has a function called read10xMoleInfo
that we can use to read in the h5 file. Another R library that enables reading HDF5 formatted h5 files is rhdf5, with the h5ls
function.
Installation in R, verified with R version 3.6.1 (2019-07-05):
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DropletUtils")
library("DropletUtils")
Read the h5 file into the R environment, importing it as a data frame. This h5 file can be found in the dataset: 5k PBMCs from a healthy donor with cell surface proteins (Next GEM).
mol.info.file <- "5k_pbmc_protein_v3_nextgem_molecule_info.h5"
molinfo <- read10xMolInfo(mol.info.file)
molinfo
The result of this will be a data frame that looks like this:
$data
DataFrame with 61910072 rows and 5 columns
cell umi gem_group gene reads
<character> <integer> <integer> <integer> <integer>
1 AAACCCAAGAAACACT 14447931 1 30393 1
2 AAACCCAAGAAACTGT 4330830 1 33499 3
3 AAACCCAAGAACAAGG 3584363 1 8233 2
4 AAACCCAAGAACCCGG 10064873 1 30480 1
5 AAACCCAAGAACCGCA 14499623 1 11895 1
... ... ... ... ... ...
61910068 TTTGTTGTCTTTCTTC 12915344 1 1016 2
61910069 TTTGTTGTCTTTCTTC 16461395 1 10874 1
61910070 TTTGTTGTCTTTGATC 7724732 1 14376 2
61910071 TTTGTTGTCTTTGATC 4027663 1 17849 1
61910072 TTTGTTGTCTTTGCGC 14951664 1 4638 1
$genes
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000239906" "ENSG00000241599" "ENSG00000236601"
Once you load in the molecule_info file to the data frame, you have access to the confidently-mapped reads count data for each valid barcode, GEM group, gene, and valid UMI.
Disclaimer: This article and code-snippet are provided for instructional purposes only. 10x Genomics does not support or guarantee the code.