How should I upload cell multiplexing data to public database (e.g., GEO, SRA, ENA)? How should I analyze cell multiplexing data downloaded from public repositories?
Ｗhen analyzing cell multiplexing data, Cell Ranger generates several output files that are split per sample. One of the per sample output files is a BAM file that contains reads assigned to that sample. Additionally, there is a separate BAM file that contains reads from background barcodes.
To submit the data for a sample from a cell multiplexing experiment to public repositories, you can upload the BAM file for each relevant sample. Importantly, when uploading the per-sample data, you should clearly specify the number of cells detected by
cellranger multi for each sample (as shown in the web summary) in the sample description. This is important because Cell Ranger needs all raw reads (including per sample reads and background reads) for the cell calling step. Since the per sample reads are only a subset of the total reads, the cell calling step will not be reproducible when run with partial FASTQs. To enable the users of the data to reproduce your results as closely as possible, it is important to specify the cell counts for this sample.
If you are submitting all the samples in a cell multiplexing experiment, it is also encouraged to submit the BAM file with background reads. Providing the background BAM together with the sample BAM files can allow people to exactly reproduce the analysis steps for the reason stated above.
If you downloaded cell multiplexing data (per-sample BAM files) from public repositories and would like to run Cell Ranger with it, you can convert them to FASTQ files using the 10x Genomics
bamtofastq tool. The tool will output one folder of FASTQs for each library. See this page for the information about 10x Genomics
Below are two options for reanalyzing the data from public repositories using Cell Ranger pipelines.
(1) There is NO BAM file with background reads associated with the public dataset. The output per sample BAM files contain only those reads that were assigned to a sample in a
cellranger multi run. Cell Ranger needs all raw reads including the per sample reads and background reads to be able to exactly reproduce the original analysis. To be able to replicate the original analysis as closely as possible, you will need to download the BAM file and run bamtofastq to generate the FASTQs. You can ignore the FASTQs from the multiplexing library and only use the gene expression FASTQs. Then you need to run
cellranger count pipeline with the
--force-cells option, specifying the provided cell number detected for this sample in the original analysis. If the
--force-cells option is not specified, fewer cells may be detected by Cell Ranger compared to the original analysis and some data may be lost. Please see Gene Expression Algorithm Overview for details on cell calling in Cell Ranger.
(2) There IS a BAM file with background reads associated with the public dataset. After converting BAM files to FASTQs files, you can run
cellranger multi in Cell Ranger version 6+ with the provided tag information to reproduce the original analysis.