Question: Is there a compact and self-contained representation of the gene-barcode matrix from Cell Ranger 1.x and 2.x that I can open in a text editor?
Answer: The MEX matrix format is compact but it does require users to integrate information from three different files. Luckily, there is a way to use the contents of the gene-barcode matrix folder to create a self-contained four-column CSV file.
First, go to the directory containing the gene-barcode matrix data (e.g. /outs/raw_gene_bc_matrices/GRCh38/
), then copy and paste the entire code block below at once into a bash shell and hit ENTER.
# Print line number along with contents of barcodes.tsv and genes.tsv
awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1}' barcodes.tsv | sort -t, -k 1b,1 > numbered_barcodes.csv
awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1,$2}' genes.tsv | sort -t, -k 1b,1 > numbered_genes.csv
# Skip the header lines and sort matrix.mtx
tail -n +4 matrix.mtx | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 1b,1 > gene_sorted_matrix.csv
tail -n +4 matrix.mtx | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 2b,2 > barcode_sorted_matrix.csv
# Use join to replace line number with barcodes and genes
join -t, -1 1 -2 1 numbered_genes.csv gene_sorted_matrix.csv | cut -d, -f 2,3,4,5 | sort -t, -k 3b,3 | join -t, -1 1 -2 3 numbered_barcodes.csv - | cut -d, -f 2,3,4,5 > final_matrix.csv
# Remove temp files
rm -f barcode_sorted_matrix.csv gene_sorted_matrix.csv numbered_barcodes.csv numbered_genes.csv
The column definitions of the output final_matrix.csv
are as follows:
- 10x Genomics cellular barcode
- Gene ID
- Gene name
- UMI count
Here is a sample of what final_matrix.csv
looks like:
AAACCTGAGAACTGTA-1,ENSG00000075415,SLC25A3,1
AAACCTGAGAACTGTA-1,ENSG00000112306,RPS12,1
AAACCTGAGAACTGTA-1,ENSG00000134419,RPS15A,1
AAACCTGAGAACTGTA-1,ENSG00000143119,CD53,1
AAACCTGAGAACTGTA-1,ENSG00000174748,RPL15,1
AAACCTGAGAACTGTA-1,ENSG00000198755,RPL10A,1
AAACCTGAGAACTGTA-1,ENSG00000213246,SUPT4H1,1
AAACCTGAGAACTGTA-1,ENSG00000250317,SMIM20,1
AAACCTGAGAACTGTA-1,ENSG00000251562,MALAT1,1
AAACCTGAGAACTGTA-1,ENSG00000254772,EEF1G,1
...
Disclaimer: This article and code-snippet are provided for instructional purposes only. 10x Genomics does not support or guarantee the code.