Question:
I have a CRISPR dataset with a low fraction of guide reads and high protospacers not recognized in the Web Summary. How can I resolve this issue?
Answer:
A scenario where over 90% of protospacers are unrecognized typically indicates a mismatch between the pattern or sequence field in the feature reference CSV file and the Read 2 CRISPR FASTQ file.
To troubleshoot:
- Verify the feature reference CSV file: Ensure the pattern or sequence field matches the Read 2 CRISPR FASTQ.
- For 5’ CRISPR experiments: The sequence in the CSV file may need to be reverse-complemented.
- Example: If your CSV contains "CTTTCACTATGTGCACCAGG", but your data actually requires the reverse complement, update it to "CCTGGTGCACATAGTGAAAG".
- After this change, the pipeline should detect the guide RNAs correctly.
- If uncertain about the ‘pattern’ field, consider using the un-tethered approach (see example screenshot).
- This approach allows Cell Ranger v4 and later to dynamically detect guide RNA sequences.
- Note: If there are a large number of guides, runtime may increase significantly.
Manual Inspection of Read 2 CRISPR FASTQ Sequences
If the protospacer sequences used in the experiment are unknown:
- Contact the personnel who performed the experiment to obtain the correct sequences.
- Manually inspect the Read 2 FASTQ file:
- Compute the frequencies of the protospacer sequences
- Identify the top protospacer sequences in the dataset.
- Compare these sequences with those in the feature reference CSV file.
- Example mismatch case (below screenshot):
Feature reference CSV: "TTCCAGCATAGCTCTTAAAC"
Read 2 FASTQ detected: "TGCTATTTCTAGCTCTAAAA"
For additional guidance, refer to the official documentation: Cell Ranger Feature Reference CSV - CRISPR
Using Unix Scripts for Troubleshooting
For 3’ CRISPR, the following Unix command can extract and count 20-base protospacer sequences before the constant sequence pattern (pattern used here is from our CRISPR public datasets):
zcat filename.fastq.gz | grep -oP ".{20}(?=GTTTAAGAGCTAAGCTGGAA)" | awk '{count[$0]++} END {for (seq in count) {print seq, count[seq]}}' | sort -k2,2nr > output.txt
A screenshot from such an output is shown below, that shows the protospacer sequences in the first column and the corresponding frequency in the 2nd column.
For 5’ CRISPR, extract sequences after the pattern (useful for 5’ CRISPR cases with reverse-complemented sequence issues):
zcat filename.fastq.gz | grep -oP "(?<=TTCCAGCATAGCTCTTAAAC).{20}" | awk '{count[$0]++} END {for (seq in count) {print seq, count[seq]}}' | sort -k2,2nr > subset_seqfreq.txt
Next Steps if the Issue Persists
If:
- The manual inspection shows no sequence matches, and
- The un-tethered approach still results in >50% unrecognized sequences,
Then there may be library preparation or sequencing issues. In such cases:
Contact support@10xgenomics.com and provide: Feature reference CSV file, FASTQ Subset of first 100k Read 2, FASTQC HTML reports and BioATraces.
Product: Universal 3' Gene Expression, Universal 5' Gene Expression
Last Updated: Feb, 2025