Question: How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?
Answer: You can identify the unmapped reads using the flags from column 2 in in the BAM file. These flags are described in the SAM/BAM specification here. The flag identifying an un-mapped read is the number 4. But if you only took the BAM entries with a 4 in column 2 you would miss some unaligned reads. This is because BAM flags are additive. For example, if a read is paired (flag = 1) and unmapped (flag = 4), the flag in the BAM file would be 5 (1 + 4 = 5). The online tool here is useful for interpreting SAM flags.
To identify all of the unmapped reads in a BAM file, you can use samtools
, which comes bundled with our pipelines (source sourceme.bash
).The -f argument to samtools view
takes a flag and will return all of the entries that contain that flag in SAM format. The command below uses samtools
to identify all unmapped reads, pipes those SAM entries to the Unix utility cut
, where the -f1 argument parses column 1 (the read ID), and finally writes it to the file unmapped_reads.txt
.
samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt
samtools
is aware of the additive nature of the BAM flags. samtools view -f 4
on a BAM file from Long Ranger returned an entry with a BAM flag of 77. This means the read is paired (flag = 1), the read is unmapped (flag = 4), the other read in the pair is unmapped (flag = 8), and this read is the first in the pair (flag = 64) 1 + 4 + 8+ 64 = 77.
At the end you have a file called unmapped_reads.txt
containing the IDs of the unmapped reads from your experiment. You can use this list to identify the original reads from the input FASTQ files.