Question: How do you decompress the 2-bit encoded barcode sequences in the molecule_info.h5 file?
Answer: The cell-barcode and UMI sequences are encoded in 2-bits as described here.
Here is a sample Python function to convert an encoded number to UMI sequence. It goes through the following steps:
- Convert the number to binary (base 2).
- Pad with leading zero if necessary to ensure consistent length.
- Split the binary string to get chunks with 2 digits each.
- Convert the chunks to nucleotides.
#!/usr/bin/env python
'''
Parameter definitions:
number: what is stored for UMI in molecule_info.h5
length: length of UMI sequence. This is typically 10 for 3'v2 or 5' chemistry, and 12 for 3'v3 chemistry
'''
def UMI_seq(number, length):
# Binary string needs to be zero-padded to ensure consistent length
format_string = "0" + str(2*length) + "b"
binary_str = str(format(number, format_string))
### Given number = 322127, length = 10 ...
### binary_str = "01001110101001001111"
# Split binary_str into chunks of 2
nuc_list = [binary_str[i:i+2] for i in range(0, len(binary_str), 2)]
### with sample input above...
### nuc_list = ['01','00','11','10','10','10','01','00','11','11']
UMI_seq = ""
for i in nuc_list:
if i == '00' :
UMI_seq += 'A'
elif i == '01' :
UMI_seq += 'C'
elif i == '10' :
UMI_seq += 'G'
else :
UMI_seq += 'T'
### with sample input above...
### UMI_seq = "CATGGGCATT"
return UMI_seq
# example 10 bp UMI encoded as 322127 in molecule_info.h5
num = 322127
print(UMI_seq(num, 10))
Disclaimer: This article and code-snippet are provided for instructional purposes only. 10x Genomics does not support or guarantee the code.