Question: I am trying to build a custom reference using
cellranger mkrefand am encountering errors. How can I resolve this?
Answer: Usually errors from the
cellranger mkref command is due to formatting problems with the input GTF files. The error message from
cellranger mkref will have more information about the problem and will depend on the source of the input files.
In general, we recommend formatting the FASTA and GTF annotation files based on the criteria here that will help in generating a successful reference. Ensembl references are recommended because there is no additional formatting required for making it compatible with Cell Ranger. Genomes from other sources such as NCBI, UCSC, or Refseq need additional formatting to make them compatible with
cellranger mkref pipeline.
Shown below are a few common errors noticed when building a custom reference using genomes from NCBI, UCSC, or RefSeq:
Start position of feature = 100000 > End position of feature = 80000
Invalid gene annotation input: in GTF
records for gene_id 'ENSG00000163737' are not contiguous in the file
Error while parsing GTF file
Property 'transcript_id' not found in GTF line 9:
In all of the above cases, the reasons range from either duplicate/missing features or poorly formatted entries. To troubleshoot such issues, the following steps can be implemented using custom scripts:
- Recommended to retain only
- Verify for any redundancy and order genes in the annotation file
- Replace or remove the gene_ids that have empty values.
gene_idmust be converted as unique (eg: unknown_transcript_1 fields)
- Finally, make sure that all annotation records for a single gene are found together in order, one after the other (this step will need custom scripts).
The above formatting techniques will help to build a custom reference successfully using
- What criteria should I use for making a custom reference?
- How to troubleshoot Cell Ranger count Segmentation Fault (Duplicate contigs)?
Products: Single Cell Gene Expression, Single Cell Immune Profiling, Single Cell Multiome ATAC + Gene Expression