Question: How to find multi-mapped reads from the possorted_genome_bam.bam
or the sample_alignments.bam
file?
Answer:
Cell Ranger has specific criteria on which reads are considered for UMI counting (described here) and one of them is that it filters out reads that maps to more than one locus. It might be of general interest to investigate how many reads are multi-mapping (described here) to the genome ; or there may be a dataset with high mapping to the genome but low confident mapping to the transcriptome, and the question may be whether the reads mapping to the genome are multi-mapped reads that are being filtered out.
Multi-mapped reads are included in the possorted_genome_bam.bam
(generated by the cellranger count
pipeline) or the sample_alignments.bam
(generated by the cellranger multi
pipeline). In this article we provide guidance for extracting multi-mapped reads from Cell Ranger BAM files.
Step 1: Install Samtools
As we make use of samtools in the below examples, if Cell Ranger is already installed on the system, the below steps will activate the environment for samtools
.
cd /path/cellranger-7.1.0 source sourceme.bash
For some reason if cellranger
installation is not known, you can consider to install samtools
directly (download link).
Step 2: Find multi-mapped reads that map to more than one loci using the NH tag (which gives the number of loci the read can map to), and also a MAPQ score not equal to 255.
samtools view -h possorted_genome_bam.bam | grep -E "^\@|NH:i:2" | awk 'BEGIN{FS="\t"} $5!=255' > multi_mapped_reads.sam
In the above, -h preserves the header lines in the output.
Step 3: Generate a multi-mapped BAM file
samtools view -S -b multi_mapped_reads.sam -o multi_mapped_reads.bam
In the above, -S option treats the input file as a SAM file, -b option outputs a BAM formatted result and -o is the stdout or filename for the output file.
Please note that multi-mapping is not exactly the same as "reads that are assigned to multiple genes". The latter can be deduced from the GX or GN tags in the output BAM file, which are generated by Cell Ranger after alignment with STAR. Uniquely mapped reads will have one gene ID for GX and one gene name for GN , while multi-mapped reads will list multiple gene IDs and names.
Note: The older versions of cellranger (such as v3.1.0,) will output all the alignments of multi-mapping reads to the BAM file. Newer versions of cellranger retains just one record of a multi-mapping read.
Products: Single Cell Gene Expression