Question: I am trying to build a custom reference using cellranger mkref
and am encountering errors. How can I resolve this?
Answer: Usually errors from the cellranger mkref
command is due to formatting problems with the input GTF files. The error message from cellranger mkref
will have more information about the problem and will depend on the source of the input files.
In general, we recommend formatting the FASTA and GTF annotation files based on the criteria here that will help in generating a successful reference. Ensembl references are recommended because there is no additional formatting required for making it compatible with Cell Ranger. Genomes from other sources such as NCBI, UCSC, or Refseq need additional formatting to make them compatible with cellranger mkref
pipeline.
Shown below are a few common errors noticed when building a custom reference using genomes from NCBI, UCSC, or RefSeq:
-
Start position of feature = 100000 > End position of feature = 80000
-
Invalid gene annotation input: in GTF
records for gene_id 'ENSG00000163737' are not contiguous in the file -
Error while parsing GTF file
Property 'transcript_id' not found in GTF line 9:
In all of the above cases, the reasons range from either duplicate/missing features or poorly formatted entries. To troubleshoot such issues, the following steps can be implemented using custom scripts:
- Recommended to retain only
gene_id
,transcript_ids
, andgene_name
attributes. - Verify for any redundancy and order genes in the annotation file
- Replace or remove the gene_ids that have empty values.
- Duplicate
transcript_ids
for multiplegene_id
must be converted as unique (eg: unknown_transcript_1 fields) - Finally, make sure that all annotation records for a single gene are found together in order, one after the other (this step will need custom scripts).
The above formatting techniques will help to build a custom reference successfully using cellranger's mkref
pipelines.
Related articles:
- What criteria should I use for making a custom reference?
- How to troubleshoot Cell Ranger count Segmentation Fault (Duplicate contigs)?
Products: Single Cell Gene Expression, Single Cell Immune Profiling, Single Cell Multiome ATAC + Gene Expression