Question: Is there a compact and self-contained representation of the feature-barcode matrix from Cell Ranger 3+ that I can open in a text editor?
Answer: The MEX matrix format is sparse but it does require users to integrate information from three different files. Luckily, there is a way to use the contents of the feature-barcode matrix folder to create a self-contained five-column CSV file.
First, go to the directory containing the feature-barcode matrix data (e.g. /outs/filtered_feature_bc_matrix
from cellranger count
output directory), then copy and paste the entire code block below at once into a bash shell and hit ENTER.
# Print line number along with contents of barcodes.tsv.gz and genes.tsv.gz
zcat barcodes.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1}' | sort -t, -k 1b,1 > numbered_barcodes.csv
zcat features.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1,$2,$3}' | sort -t, -k 1b,1 > numbered_features.csv
# Skip the header lines and sort matrix.mtx.gz
zcat matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 1b,1 > feature_sorted_matrix.csv
zcat matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 2b,2 > barcode_sorted_matrix.csv
# Use join to replace line number with barcodes and genes
join -t, -1 1 -2 1 numbered_features.csv feature_sorted_matrix.csv | cut -d, -f 2,3,4,5,6 | sort -t, -k 4b,4 | join -t, -1 1 -2 4 numbered_barcodes.csv - | cut -d, -f 2,3,4,5,6 > final_matrix.csv
# Remove temp files
rm -f barcode_sorted_matrix.csv feature_sorted_matrix.csv numbered_barcodes.csv numbered_features.csv
The column definitions of the output final_matrix.csv
are as follows:
- 10x Genomics cellular barcode
- Feature ID
- Feature name
- Feature type
- UMI count
Here is a sample of what final_matrix.csv
looks like:
AAACCTGCACATTAGC-1,ENSG00000005075,POLR2J,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000006015,C19orf60,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000007944,MYLIP,Gene Expression,2
AAACCTGCACATTAGC-1,ENSG00000008128,CDK11A,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000008952,SEC62,Gene Expression,2
AAACCTGCACATTAGC-1,ENSG00000008988,RPS20,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000010626,LRRC23,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000023041,ZDHHC6,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000023892,DEF6,Gene Expression,1
AAACCTGCACATTAGC-1,ENSG00000026025,VIM,Gene Expression,1
...
Disclaimer: This article and code-snippet are provided for instructional purposes only. 10x Genomics does not support or guarantee the code.