Question: I have just run some VDJ data. Now I want to extract the sequences of the V/D/J/C gene segments for a productive chain. How can I do that ?
Answer: The pipeline does not output separated sequences for each gene segment for the contigs (transcripts). However, it outputs the full assembled contig sequence in filtered_contig.fasta file for each contig reported in filtered_contig_annotations.csv. You can get the start and end coordinates of each gene segment on the assembled contig from the all_contig_annotations.csv file. This file contains per contig annotations just like the all_contig_annotatiosn.csv file. But it is in .json format, and more importantly, it has many more annotations available for each contig than in the corresponding CSV file.
For example, in one of our public datasets, let us consider contig AAACCTGAGGACATTA-1_contig_2. If you want to find out the start and end co-ordinates of the V/D/J/C genes for a this contig, you can do so, using the all_contig_annotations.json file. Below record is for the contigAAACCTGAGGACATTA-1_contig_2. The "bold" potions of the record indicate the start and end positions of the genes along the contig.
"aa_sequence": "MAWTPLLLPLLTFCTVSEASYELTQPPSVSVSPGQTARITCSGDALPKNYAYWYQQKSGQAPVLVIYEDNKRPSEIPERFSGSSSGTVATLTISGAQVDDEADYYCYSTDSSYNHRVFGGGTK
LTVLGQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADSSPVKAGVETTTPSKQSNNKYAASSY",
"annotations": [
{
"annotation_length": 351,
"annotation_match_end": 347,
"annotation_match_start": 0,
"cigar": "57S347M248S",
"contig_match_end": 404,
"contig_match_start": 57,
"feature": {
"chain": "IGL",
"display_name": "IGLV3-10",
"feature_id": 655,
"gene_name": "IGLV3-10",
"region_type": "L-REGION+V-REGION"
},
"mismatches": [],
"score": 654
},
{
"annotation_length": 151,
"annotation_match_end": 151,
"annotation_match_start": 116,
"cigar": "404S35M213S",
"contig_match_end": 439,
"contig_match_start": 404,
"feature": {
"chain": "IGL",
"display_name": "IGLJ3",
"feature_id": 654,
"gene_name": "IGLJ3",
"region_type": "J-REGION"
},
"mismatches": [],
"score": 70
},
{
"annotation_length": 317,
"annotation_match_end": 211,
"annotation_match_start": 0,
"cigar": "439S211M2S",
"contig_match_end": 650,
"contig_match_start": 439,
"feature": {
"chain": "IGL",
"display_name": "IGLC2",
"feature_id": 324,
"gene_name": "IGLC2",
"region_type": "C-REGION"
},
"mismatches": [],
"score": 422
},
{
"annotation_length": 35,
"annotation_match_end": 35,
"annotation_match_start": 0,
"cigar": "22S35M595S",
"contig_match_end": 57,
"contig_match_start": 22,
"feature": {
"chain": "IGL",
"display_name": "IGLV3-10",
"feature_id": 346,
"gene_name": "IGLV3-10",
"region_type": "5'UTR"
},
"mismatches": [],
"score": 70
}
],
"barcode": "AAACCTGCACACTGCG-1",
"cdr3": "CYSTDSSYNHRVF",
"cdr3_seq": "TGTTACTCAACAGACAGCAGTTATAATCATAGGGTGTTC",
"cdr3_start": 372,
"cdr3_stop": 411,
"clonotype": null,
"contig_name": "AAACCTGCACACTGCG-1_contig_2",
"filtered": true,
"frame": null,
"high_confidence": true,
"info": {
"raw_clonotype_id": "clonotype29",
"raw_consensus_id": "clonotype29_consensus_1"
},
"is_cell": true,
"primer_annotations": [],
"productive": true,
"quals": "III]Y]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]",
"read_count": 4640,
"sequence": "CGGGCCTTTCTTATATGGGGCTGTGGGCTCAGGAGGCAGAGCTCTGGGAATCTCACCATGGCCTGGACCCCTCTCCTGCTCCCCCTCCTCACTTTCTGCACAGTCTCTGAGGCCTCCTATGAGCTGACACAGCCACCCTCGGTGTCAGTGTCCCCAGGACAAACGGCCAGGATCACCTGCTCTGGAGATGCATTGCCAAAAAACTATGCTTATTGGTACCAGCAGAAGTCAGGCCAGGCCCCTGTGCTGGTCATCTATGAGGACAACAAACGACCCTCCGAGATCCCTGAGAGATTCTCTGGCTCCAGCTCAGGGACAGTGGCCACCTTGACTATCAGTGGGGCCCAGGTGGACGATGAAGCTGACTATTACTGTTACTCAACAGACAGCAGTTATAATCATAGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCGCCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTACC",
"start_codon_pos": 57,
"stop_codon_pos": null,
"umi_count": 33
},
Above gives following information about genes:
1) 5'UTR region, IGLV3-10, contig_match_start = 22, contig_match_end = 57
2) L-REGION+V-REGION, IGLV3-10, contig_match_start = 57, contig_match_end = 404
3) J-REGION, IGLJ3, contig_match_start = 404, contig_match_end = 439
4) C-REGION, IGLC2, contig_match_start = 439, contig_match_end = 650