Question: In Ensembl FTP for genome downloads, I find there are four flavors of genome reference FASTA files such as, primary, top-level, sm, and rm that are available for download. Which one should I use for custom reference creation using cellranger mkref?
Answer: For custom reference generation you will need files whose names include "dna" and "primary assembly".
Eg: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
The "dna" indicates the unmasked genomic DNA sequences. If might find files with "rm", it means those that are repeat sequences hard masked with Ns. The "sm" files indicate repeat sequences that are soft masked with lower case ATGC letters. rm and sm files are not used for custom reference creation.
The "primary assembly" contains all the top-level sequence regions, excluding the regions that are haplotypes and patches. If the primary assembly file is not present for any species, that indicates that there were no haplotype/patch regions, and in such cases, the 'top-level file is used.
Eg: Homo_sapiens.GRCh38.dna.top-level.fa.gz
Such regions would otherwise count low mapping quality reads in Cell Ranger and would impact the overall mapping rate metrics.
Products: Single Cell Gene Expression