Question: Are my sample’s antisense and intronic reads levels normal or of concern?
Answer: Researchers frequently ask whether the extent of antisense and intronic reads their samples exhibit are normal or of concern. The presence of antisense and intronic reads in samples is sporadic and can vary to high levels, in particular for nuclei samples. The Technical Note here showcases various cell and nuclei types and their observed antisense and intronic read fractions (in Table 1 on page 5). Researchers should compare mapping metrics from the web summary report to the cell type that most closely resembles the sample to see if the metrics are within the same ballpark.
- If yes, great. Because cellranger omits antisense reads from counts, you can proceed with downstream data analyses.
- If not, and if the samples do not show the expected biology, then consider sample prep optimization. Contact email@example.com for a consultation. For example, if the intent was to isolate cells, then antisense reads may suggest unintended nuclei isolation. The consult would discuss maintaining cell viability and gentler isolation.
The TechNote also proposes molecular mechanisms by which antisense reads could arise for 3' and 5’ single-cell samples. The following sections describe two contexts that give rise to high antisense and intronic reads and guidance to researchers who want to dig deeper into their data, e.g. quantify the per-transcript level counts of antisense reads.
Nuclei samples increase antisense and intronic reads
In general, nuclei samples are expected to have increased antisense and intronic reads. For example, we have the following breakdown of deduplicated reads for E18 mouse brain nuclei libraries prepared with 3’ v3.1 chemistry (from Table 1 of the Tech Note).
- 31.6% sense exonic
- 32.2% sense intronic
- 4.6% antisense exonic
- 31.4% antisense intronic
This results in 63.8% reads that cellranger will count, assuming inclusion of intronic reads. We have half as many antisense intronic reads. These counts likely reflect the biology of the embryonic developing brain at day 18.
Both 3’ and 5’ single-cell sample preparations select for sequences containing poly-A via a poly-T containing primer. mRNA pre-processing occurs in the nucleus as the cell transcribes the pre-mRNA. Steps include addition of the 5’ m7G cap, splicing and 3’ polyadenylation. Cells then export the polyadenylated mRNA from the nucleus to the cytosol. Thus, the proportion of polyadenylated transcripts to non-polyadenylated species in the nucleus can be small compared to the mostly polyadenylated transcripts we expect in the cytosol.
Another consideration is steady state levels of transcripts. This author’s1 recollection from performing Northern blotting for a eukaryotic organism is for the gene of interest, <10% of transcripts were pre-mRNA versus mRNA.
So, not only does the nucleus present a smaller number of poly-A sites for poly-T primers to bind, it also starts with a fraction of the transcripts the cytosol presents. If cells or nuclei contain very little mRNA to begin with, then other types of nucleic acid entities can dominate, e.g. rRNA, tRNA and their pre-versions, and sample prep amplifies artifacts that arise from their capture, from internal priming and from primer-primer interactions.
If nuclei isolation proves limiting, then consider spatial transcriptomic approaches, e.g. Visium and Xenium, that allow for fresh frozen and FFPE tissue slices that keep cells in their spatial contexts and do not require harsh dissociation and isolation procedures.
Follow guidelines for number of cDNA amplification cycles
One piece of advice to researchers is to limit cDNA amplification cycles. In general, going above the recommended number of cDNA amplification cycles can create artifacts. This could be due in some cases to limiting primers. A 'Tip' in the User Guides, next to the table showing amplification cycles, also states this. If the maximum allowed number of cycles is insufficient, then the recourse is to start with more cells. Do not add more amplification cycles.
For RevD of the 3’v3.1 User Guide, we see the following.
For nuclei, the article What are the best practices for working with nuclei samples for 3' single-cell gene expression?, gives additional guidance for 3’ single-cell samples to increase the cDNA amplification cycles by 1-2 cycles to increase the cDNA yield. Again, do not add more amplification cycles than recommended.
5’ v2 and 3’ v2 cross-library contamination or swapped chemistry can show elevated antisense reads
Cross-sample contamination and sample swaps are not uncommon. Because each product chemistry uses a different barcode whitelist with minimal overlap, the web summary report will show low ‘Valid Barcodes’ in these cases. Furthermore, if the samples differ in their species, mapping metrics will reflect the irregularity. Cellranger only counts reads with valid barcodes and thereby filters cross-chemistry library contamination. You can safely proceed with analyzing cellranger data.
One exception is for 5’v2 and 3’v2 libraries from the same species. These two chemistries share a barcode whitelist that differs from the 3’v3.1 barcodes. Because 5’ and 3’ products use opposite sense-orientation of reads towards counts, the web summary will show elevated antisense reads in cases of cross-library contamination or swapped chemistry versions either incorrectly auto-detected or incorrectly set with the --chemistry parameter.
If true cross-sample contamination, we recommend tracking down the source of contamination and implementing preventative measures to minimize such incidents. Some sources of contamination to start with are from the index adaptor plate at adaptor ligation or from a multiplexed sample in the same sequencer lane either directly (a mistake in multiplexing) or via unwitting contamination (again index adaptor or library). Does the contaminating sample look like a past sample of those who share the index adaptor plate?
How to quantify the extent of antisense reads?
To survey adaptor artifacts, the ‘Overrepresented sequences’ section of a FastQC report is a good place to start. For 3’ single-cell, overrepresentation of TSO at the beginning of read2 indicates short cDNA inserts, under-fragmentation of cDNA and/or degraded transcripts. The last possibility is likely when TSO + poly-A reads are abundant, e.g. 5’-AAGCAGTGGTATCAACGCAGAGTACATGGGAAAAAAAAAAAAAAAAAAAA-3’ and variant readouts. For 5’ single-cell, the equivalent indicator is an enrichment of poly-A at the beginning of read2. If FastQC shows overrepresented sequences other than these, then performing BLAST search on the sequences can yield insights on their identity.
This author1 has observed such poly-A reads piling up in few loci in the genome. One quick approach to surveying such pile-ups is to run IGV’s (https://software.broadinstitute.org/software/igv/) igvtools
countmodule on the sorted BAM from cellranger, load the resulting TDF track and look for unusually tall peaks across the contigs.
- In such surveys, this author1 has come across extremely and disproportionately high peaks in rDNA loci as well as nucleus-localizing lncRNA transcript loci. For human samples, another simple approach to confirming rDNA contamination is a disproportionately higher number of reads for chr21 after running
samtools idxstatson the cellranger BAM.
- Note 3’ R2 reads are in the sense orientation and 5’ R2 reads are antisense to transcripts. Right-click on the reads track in IGV and select ‘Color alignments by’ > ’Read strand’ to differentially color the leftward versus rightward reads.
For researchers who wish to dig deeper, one approach is to quantify the extent of antisense reads across transcripts. This can give more resolution than the single percentage value the web summary report captures.
- If data is 5' v2, you can generate a count matrix of antisense reads simply by forcing --chemistry=SC3Pv2 and vice versa.
- For 3' v3, consider two workarounds. One, swap out the barcode whitelist with that of 5' v2's and force --chemistry=SC5Pv2 to enable counting antisense reads. You can find the barcode whitelists for different chemistries at https://kb.10xgenomics.com/hc/en-us/articles/115004506263. Two, reverse the strand orientation in the 2020-A reference GTF (+ to - and vice versa), recreate the reference with the cellranger mkref module and use this antisense reference with cellranger.
Cellranger counts of antisense reads do not necessarily correlate with their sense counterparts. Do not use the counts from antisense reads for anything other than towards diagnosing the underlying issue. These approaches and the cellranger count run alone will not quantify rRNA reads, as rDNA loci generate multi-mapping reads. To survey multi-mapping, intergenic and unmapped reads, consider surveying the BAM alignment result directly instead of the count matrix data. Furthermore, a paired-end alignment of the R1 and R2 reads, by virtue of the presence of properly paired reads tallied with
samtools flagstat, can reveal unintended library inserts that amplified but did not undergo capture via the 10x barcode + UMI capture oligo.
Towards faster survey iteration, this author1 recommends using subsampled reads using seqtk (https://github.com/lh3/seqtk), e.g. 20K reads. it is important not to take reads in order from the top nor consecutively from the FASTQs, as the reads would then correspond to common flowcell tiles that could manifest edge effects or effects from bubbles. It is important to subsample randomly across the tiles, which seqtk's
sample module enables. Subsampling with the same random seed with each FASTQ, e.g. R1 and R2, guarantees paired reads.
seqtk sample -s100 read2.fastq.gz 20000 > sub2.fastq
seqtk sample -s100 read1.fastq.gz 20000 > sub1.fastq
Compress each file with
gzip to generate the
.fastq.gz equivalents. For compatibility with cellranger and cellranger-arc, be sure to follow file naming conventions as shown in ERROR: There were no reads to process.
1: This author is Soo Hee Lee of the 10x Genomics Applied Bioinformatics team
Date modified: September 29, 2023
Products: 3’ and 5’ single-cell gene expression products, including multiome, processed with cellranger and cellranger-arc software