Question: We have 10x gene expression data collected from COVID19+ patients. How can we make a custom reference with SARS-CoV-2 added to the human GRCh38 reference? Could you direct us to a comprehensive list of SARS-CoV-2 sequences?
Answer: You can customize your reference to include the SARS-CoV-2 genome as follows.
- It is up to the end user to choose their reference sequence, as strain evolution is ongoing and geography may matter depending on the research question. You will need to locate a genome sequence FASTA file. There are many genome sequences publicly available for SARS-CoV-2. For example, the underlying data used by the Nextstrain SARS-CoV-2 phylogenetic analysis website are available in FASTA format from https://www.gisaid.org/ (requires registration and login). Another resource is the NCBI, which hosts SARS-CoV-2 genomes. If you do not think the regional specificity of the SARS-CoV-2 genome sequence matters, then choosing any SARS-CoV-2 genome FASTA may be sufficient.
- After you find a SARS-CoV-2 genome that matches your experiment, you will download the genome FASTA, then append it to the human reference genome. You will also need to make a GTF file for SARS-CoV-2. We recommend annotating the whole viral genome as a single "gene" with the feature type marked as "exon". Then, you will append that GTF to the human genes.gtf annotation file. For more detailed step-by-step instructions for making a custom reference with an added gene, please see the Custom References Tutorial.