How is the MEX format used for the gene-barcode matrices?
Question: How is the MEX file format used for the gene-barcode matrices?
Answer: The Market Exchange (MEX) format is used to represent the gene-barcode matrix output by Cell Ranger. This is a sparse matrix format because the matrix for UMI counts for each barcode/gene pair are very large (~35K genes vs hundreds of thousands of barcodes) and most entries are 0.
In the sparse matrix format, there are three files. The following shows the structure of the sparse matrix format:
matrix.mtx: Below is a snapshot of top 6 lines from this file. The top three lines of this file are header lines. The third line contains the total number of rows in all the three files in this folder (genes.tsv, barcodes.tsv, matrix.mtx). The next lines (line number 4 onwards) have three columns: - The first column refers to the "gene id" index. - The second column refers to "cell id" index. - The third column represents the total UMI count per cell and gene combination. The 'gene id' and 'cell id' indices correspond to the entries in the barcodes.tsv and genes.tsv files. The index in the MEX file is 1-based.
For example, if a line in mtx file is 154 1 21, this indicates:
The gene at line number 154 in genes.tsv.
The cell-barcode at line number 1 in barcodes.tsv.
UMI count = 21 for the gene and barcode combination.
genes.tsv: This file contains all annotated genes. Each gene is represented in each row. The first column is the gene_id while the second column is the gene name.
barcodes.tsv: This file contains the barcodes represented in the mtx file. The folder filtered_gene_bc_matrices contains all barcodes that were filtered as cells. The folder raw_gene_bc_matrices contains all valid barcodes (that is all barcodes that came from a barcode inclusion list).
Question: How is the MEX file format used for the gene-barcode matrices?
Answer: The Market Exchange (MEX) format is used to represent the gene-barcode matrix output by Cell Ranger. This is a sparse matrix format because the matrix for UMI counts for each barcode/gene pair are very large (~35K genes vs hundreds of thousands of barcodes) and most entries are 0.
In the sparse matrix format, there are three files. The following shows the structure of the sparse matrix format:
Below are descriptions of each file:
The top three lines of this file are header lines. The third line contains the total number of rows in all the three files in this folder (genes.tsv, barcodes.tsv, matrix.mtx).
The next lines (line number 4 onwards) have three columns:
- The first column refers to the "gene id" index.
- The second column refers to "cell id" index.
- The third column represents the total UMI count per cell and gene combination.
The 'gene id' and 'cell id' indices correspond to the entries in the barcodes.tsv and genes.tsv files. The index in the MEX file is 1-based.
You can also find more information about this file here: Gene-Barcode Matrices.