Question: I have just run some V(D)J data. Now I want to extract the sequences of the V/D/J/C gene segments for a productive chain. How can I do that?
Answer: The pipeline outputs the fully assembled contig sequences in the filtered_contig.fasta
file for each contig reported in the filtered_contig_annotations.csv
. However, the sequence of individual gene segments for the contigs is not reported.
You can get the start and end coordinates of each gene segment on the assembled contig from the all_contig_annotations.json
file. This file contains per contig annotations (similar to the all_contig_annotatiosn.csv
file) in .json
format. It has many more annotations available for each contig than its CSV counterpart.
For example, let's look at the contig AAACCTGAGGACATTA-1_contig_2
in this public dataset. To find the start and end coordinates of the V/D/J/C genes for this contig, use the all_contig_annotations.json
file. Here is the record for the contig AAACCTGAGGACATTA-1_contig_2
. The "bold" potions of the record indicate the start and end positions of the genes along the contig:
"aa_sequence": "MAWTPLLLPLLTFCTVSEASYELTQPPSVSVSPGQTARITCSGDALPKNYAYWYQQKSGQAPVLVIYEDNKRPSEIPERFSGSSSGTVATLTISGAQVDDEADYYCYSTDSSYNHRVFGGGTK
LTVLGQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADSSPVKAGVETTTPSKQSNNKYAASSY",
"annotations": [
{
"annotation_length": 351,
"annotation_match_end": 347,
"annotation_match_start": 0,
"cigar": "57S347M248S",
"contig_match_end": 404,
"contig_match_start": 57,
"feature": {
"chain": "IGL",
"display_name": "IGLV3-10",
"feature_id": 655,
"gene_name": "IGLV3-10",
"region_type": "L-REGION+V-REGION"
},
"mismatches": [],
"score": 654
},
{
"annotation_length": 151,
"annotation_match_end": 151,
"annotation_match_start": 116,
"cigar": "404S35M213S",
"contig_match_end": 439,
"contig_match_start": 404,
"feature": {
"chain": "IGL",
"display_name": "IGLJ3",
"feature_id": 654,
"gene_name": "IGLJ3",
"region_type": "J-REGION"
},
"mismatches": [],
"score": 70
},
{
"annotation_length": 317,
"annotation_match_end": 211,
"annotation_match_start": 0,
"cigar": "439S211M2S",
"contig_match_end": 650,
"contig_match_start": 439,
"feature": {
"chain": "IGL",
"display_name": "IGLC2",
"feature_id": 324,
"gene_name": "IGLC2",
"region_type": "C-REGION"
},
"mismatches": [],
"score": 422
},
{
"annotation_length": 35,
"annotation_match_end": 35,
"annotation_match_start": 0,
"cigar": "22S35M595S",
"contig_match_end": 57,
"contig_match_start": 22,
"feature": {
"chain": "IGL",
"display_name": "IGLV3-10",
"feature_id": 346,
"gene_name": "IGLV3-10",
"region_type": "5'UTR"
},
"mismatches": [],
"score": 70
}
],
"barcode": "AAACCTGCACACTGCG-1",
"cdr3": "CYSTDSSYNHRVF",
"cdr3_seq": "TGTTACTCAACAGACAGCAGTTATAATCATAGGGTGTTC",
"cdr3_start": 372,
"cdr3_stop": 411,
"clonotype": null,
"contig_name": "AAACCTGCACACTGCG-1_contig_2",
"filtered": true,
"frame": null,
"high_confidence": true,
"info": {
"raw_clonotype_id": "clonotype29",
"raw_consensus_id": "clonotype29_consensus_1"
},
"is_cell": true,
"primer_annotations": [],
"productive": true,
"quals": "III]Y]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]",
"read_count": 4640,
"sequence": "CGGGCCTTTCTTATATGGGGCTGTGGGCTCAGGAGGCAGAGCTCTGGGAATCTCACCATGGCCTGGACCCCTCTCCTGCTCCCCCTCCTCACTTTCTGCACAGTCTCTGAGGCCTCCTATGAGCTGACACAGCCACCCTCGGTGTCAGTGTCCCCAGGACAAACGGCCAGGATCACCTGCTCTGGAGATGCATTGCCAAAAAACTATGCTTATTGGTACCAGCAGAAGTCAGGCCAGGCCCCTGTGCTGGTCATCTATGAGGACAACAAACGACCCTCCGAGATCCCTGAGAGATTCTCTGGCTCCAGCTCAGGGACAGTGGCCACCTTGACTATCAGTGGGGCCCAGGTGGACGATGAAGCTGACTATTACTGTTACTCAACAGACAGCAGTTATAATCATAGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCGCCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTACC",
"start_codon_pos": 57,
"stop_codon_pos": null,
"umi_count": 33
},
We get the following information about genes:
1) 5'UTR region, IGLV3-10, contig_match_start = 22, contig_match_end = 57
2) L-REGION+V-REGION, IGLV3-10, contig_match_start = 57, contig_match_end = 404
3) J-REGION, IGLJ3, contig_match_start = 404, contig_match_end = 439
4) C-REGION, IGLC2, contig_match_start = 439, contig_match_end = 650