Question: We made use of your Chromium technology to generate and assemble a de novo reference genome for a non-model species. However, while analyzing our assembly we found a 8-12kb gap in the reference, which falls precisely within one of our regions of interest. Is there a way to fill in this gap using our existing data?
Answer: The process detailed below is an experimental workflow for local reassembly used with Long Ranger. This approach may or may not work in this case. The expectation is that Supernova would have already assembled the region if supported by the data.
This workflow describes how to perform barcode-guided local de-novo assembly of linked-read data. The set of barcodes mapped to a locus is used to select a group of reads that are highly enriched for that locus, but subject to much less mapping bias than would be the case with individually mapped short reads, especially if the true sequence contains large deviations from the reference (especially insertions).
The Long Ranger 'lariat' alignments to the reference genome are used to determine the set of barcodes covering a locus. All reads from these barcodes are passed to the Supernova assembler. A BAM file that is sorted and indexed by the BX (linked-read barcode) tag is used to efficiently fetch the set of reads belonging to each barcode.
In the case of trying to fill a gap in an existing Supernova assembly, you would align the reads to your Supernova assembly using longranger align
. Starting from version 2.2, longranger align
uses the Lariat barcode-aware aligner. The reads aligned near your region of interest (gap) can be used to identify/"fish out" the barcodes corresponding to reads that would be used for local re-assembly with Supernova. The bamtofastq
utility is used to generate the FASTQ files with the subset of reads tagged with your barcodes of interest from your BAM file. This FASTQ subset would then be used for a local assembly in Supernova.
If there is an approximate reference genome for your non-model organism that spans your region of interest, you could also try to use that instead of the Supernova assembly to identify the reads to use for local reassembly.
The diagram above depicts the workflow starting from a Long Ranger BAM to local assemblies with Supernova.
Requirements:
- A Long Ranger BAM file, phased_possorted_bam.bam, referred to as <phased_possorted_bam> below
- This BAM file comes with an index file, phased_possorted_bam.bai, which is also required
- A locus of interest.
- bxindex, bamtofastq and samtools are all provided with Long Ranger, and should all be on your PATH.
export PATH=/path/to/longranger-2.2.2:$PATH
source /path/to/longranger-2.2.2/sourceme.bash
Output File Description | Example Output Filename | Command | Notes |
---|---|---|---|
BX list in chosen locus (PMS2 gene in this example) | pms2-bx-list-37366.txt | locus_bx_list chr7:5012870-7791232 <phased_possorted_bam> > pms2-bx-list-37366.txt | Creates a list of barcodes from the chosen locus using the original phased_possorted_bam.bam file (requires the corresponding .bai file) |
BX sorted BAM file | bx-37366.bam | samtools sort -o bx-37366.bam -t BX -@ 8 -m 4G <phased_possorted_bam> | Sorts the original phased_possorted_bam.bam file by BX (barcode) tag to create a BX-sorted BAM file |
BX index | bx-37366.bam.bxi | bxindex bx-37366.bam | Creates a .bxi index file for the BX-sorted BAM file |
FASTQs path from BX list | fastqs_37366_pms2 | bamtofastq --bx-list=pms2-bx-list-37366.txt bx-37366.bam fastqs_37366_pms2 | Creates a FASTQ file from the BX-sorted BAM file (and associated .bxi index file) containing only the barcodes in the barcode list |
SuperNova assembly from FASTQs | asm_37366 | supernova run --fastqs=fastqs_37366_pms2/37366_MissingLibrary_1_CAU5LANXX/ --id=asm37366 --desc=pms2-37366 --localcores=7 --nopreflight | Performs de novo assembly using only the reads in the FASTQ file |
SuperNova generate pseudohap fasta | sm37366-pseudohap.fasta.gz | supernova mkoutput --asmdir=asm37366/outs/assembly/ --outprefix=asm37366-pseudohap --style=pseudohap | Generates FASTA sequences from the assembly |