Question: What are the major steps for building a custom T2T reference in Cell Ranger ARC?
Answer: We have general resources for building a custom reference here: https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/inputs/mkref
Below is a step-by-step tutorial on constructing a T2T human reference (a result of the human T2T Genome Consortium Project, reference) that can be used with cellranger-arc.
Step 1: Download FASTA and GTF
Source FTP: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/
Genome file:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
Annotation file:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz
Motifs file:
wget https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt
Note:
- The downloaded FASTA and GTF files are compressed. To decompress them, please use gunzip or gzip like utilities. Eg:
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gzip -d GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz
- To simplify filenames, the resulting decompressed FASTA file will be renamed to t2t_genome.fa and the decompressed GTF file will be renamed to t2t.gtf. Eg:
mv GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
t2t_genome.fa
mv GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz t2t.gtf
Step 2: Formatting FASTA file
- Retain sequence only and remove other descriptions
- Convert everything to upper case
awk '{if ($1 ~ />/){print $1} else {print $_}}' t2t_genome.fa | tr '[:lower:]' '[:upper:]' > t2t_v1.fa
Step 3: Filter Annotations using mkgtf
#!/usr/bin/env bash
#$ -N "gtf"
#$ -V
#$ -pe threads 12
#$ -l mem_free=96G
#$ -cwd
#$ -o stdout.out
#$ -e stderr.out
#$ -S "/usr/bin/env bash"
export PATH=$PATH:/path/cellranger-arc/cellranger-arc-2.0.2/cellranger-arc
gtfpath=/path/cellranger-arc/custom/t2t/t2t.gtf
cellranger-arc mkgtf $gtfpath t2t_filt.gtf --attribute=gene_biotype:protein_coding --attribute=gene_biotype:lincRNA --attribute=gene_biotype:antisense --attribute=gene_biotype:IG_LV_gene --attribute=gene_biotype:IG_V_gene --attribute=gene_biotype:IG_V_pseudogene --attribute=gene_biotype:IG_D_gene --attribute=gene_biotype:IG_J_gene --attribute=gene_biotype:IG_J_pseudogene --attribute=gene_biotype:IG_C_gene --attribute=gene_biotype:IG_C_pseudogene --attribute=gene_biotype:TR_V_gene --attribute=gene_biotype:TR_V_pseudogene --attribute=gene_biotype:TR_D_gene --attribute=gene_biotype:TR_J_gene --attribute=gene_biotype:TR_J_pseudogene --attribute=gene_biotype:TR_C_gene
Step 4: Generate Reference using mkref
Config:
{
organism: "human"
genome: ["t2t"]
input_fasta: ["/path/cellranger-arc/custom/t2t/t2t_genome.fa"]
input_gtf: ["/path/cellranger-arc/custom/t2t/t2t_filt.gtf"]
input_motifs: "/path/cellranger-arc/custom/t2t/JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt"
}
Reference generation:
cellranger-arc mkref --config=t2t.config --memgb=96
(Optional) Step 5: Validate step 4 using public data
Dataset: PBMC 3K granulocytes sorted: https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0
(Optional) Step 6: adding enhancer and promoter regions file to the input GTF
This step provides guidance if there is a need to add the enhancers and promoters or even blacklist information similar to that of our ATAC pre-built references. As the original GTF file has enhancers and promoters information, this can be obtained from genes.gtf file and output as a BED formatted file.
Below simple script helps to retrieve the defined enhancer regions from the genes.gtf file.
less input.gtf | grep enhancer | awk '{ print $1"\t"$4"\t"$5"\t"$6"\t"$6"\t"$6}' | sed -e 's/";//g' -e 's/"//g' | sort -k2,2 | uniq > enhancer.bed
less input.gtf | grep promoter | awk '{ print $1"\t"$4"\t"$5"\t"$6"\t"$6"\t"$6}' | sed -e 's/";//g' -e 's/"//g' | sort -k2,2 | uniq > promoter.bed
Note: The bed files from ATAC pre-built reference will not work for ARC references due to the chromosome names & co-ordinates that differ between the two references. Thus, for this example we make use of the T2T annotations. After adding this, you can validate it using your dataset.
Disclaimer: 10x Genomics does not provide support for custom scripts or community-developed tools and does not guarantee their function or performance.
Products: Single Cell Multiome ATAC + Gene Expression
Last Updated: Aug 2024