Question: Is there a compact and self-contained representation of the feature-barcode matrix from Cell Ranger ARC 1.x that I can open in a text editor?
Answer: The MEX matrix format is sparse but it does require users to integrate information from three different files. Luckily, there is a way to use the contents of the feature-barcode matrix folder to create a self-contained five-column CSV file.
First, go to the directory containing the feature-barcode matrix data (e.g. ANALYSIS/outs/filtered_feature_bc_matrix
), then copy and paste the entire code block at once into a bash shell and hit ENTER.
# Print line number along with contents of barcodes.tsv.gz and genes.tsv.gz
zcat barcodes.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1}' | sort -t, -k 1b,1 > numbered_barcodes.csv
zcat features.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1,$2,$3}' | sort -t, -k 1b,1 > numbered_features.csv
# Skip the header lines and sort matrix.mtx.gz
zcat matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 1b,1 > feature_sorted_matrix.csv
sort -t, -k 2b,2 feature_sorted_matrix.csv > barcode_sorted_matrix.csv
# Use join to replace line number with barcodes and genes
join -t, -1 1 -2 1 numbered_features.csv feature_sorted_matrix.csv | cut -d, -f 2,3,4,5,6 | sort -t, -k 4b,4 | join -t, -1 1 -2 4 numbered_barcodes.csv - | cut -d, -f 2,3,4,5,6 > final_matrix.csv
# Remove temp files
rm -f barcode_sorted_matrix.csv feature_sorted_matrix.csv numbered_barcodes.csv numbered_features.csv
The column definitions of the output final_matrix.csv
are as follows:
- 10x cell barcode
- Feature ID
- Feature name
- Feature type ("Gene Expression" or "Peaks")
- Count of UMI or cut sites
Here is a sample of what final_matrix.csv
looks like:
...
AAACAGCCAAATATCC-1,chrX:78944247-78947291,chrX:78944247-78947291,Peaks,2
AAACAGCCAAATATCC-1,chrX:79095941-79097077,chrX:79095941-79097077,Peaks,1
AAACAGCCAAATATCC-1,chrX:79097784-79098177,chrX:79097784-79098177,Peaks,1
AAACAGCCAAATATCC-1,chrX:79151058-79153900,chrX:79151058-79153900,Peaks,2
AAACAGCCAAATATCC-1,chrX:79161015-79162418,chrX:79161015-79162418,Peaks,2
AAACAGCCAAATATCC-1,chrX:79175316-79176527,chrX:79175316-79176527,Peaks,2
AAACAGCCAAATATCC-1,ENSG00000000457,SCYL3,Gene Expression,1
AAACAGCCAAATATCC-1,ENSG00000001631,KRIT1,Gene Expression,1
AAACAGCCAAATATCC-1,ENSG00000002586,CD99,Gene Expression,2
AAACAGCCAAATATCC-1,ENSG00000002834,LASP1,Gene Expression,1
...
Disclaimer: This article and code-snippet are provided for instructional purposes only. 10x Genomics does not support or guarantee the code.