Question: When and how to use OverrideCycles or --use-bases-mask options while demultiplexing?
Answer: Sometimes, a library is sequenced more than what is recommended in the sequencing requirements for any 10x application. For example, for dual index 3' gene expression libraries with v4 chemistry the sequencing requirements include the lengths R1=28bp, R2=90bp, and I1 and I2=10 bp each. However, you may choose to sequence more cycles than recommended for any of the reads, such as R1=150, R2=150, and I1 and I2=10 bases each. In this case, you could choose to ignore or mask the extra cycles (bases) when demultiplexing. Here are the instructions for how to do this using the demultiplexing software tools available.
BCL Convert
If you are using Illumina's newer demultiplexing software tool BCL Convert, you will need to know exactly how many cycles were run for each of your reads. First, locate the RunInfo.xml file and open it in a text editor to see what were the sequencing lengths for each of the reads. For example:
<Read Number="1" NumCycles="150" IsIndexedRead="N" IsReverseComplement="N"/>
<Read Number="2" NumCycles="10" IsIndexedRead="Y" IsReverseComplement="N"/>
<Read Number="3" NumCycles="10" IsIndexedRead="Y" IsReverseComplement="Y"/>
<Read Number="4" NumCycles="150" IsIndexedRead="N" IsReverseComplement="N"/>
The NumCycles for the 4 reads in this configuration are R1 = 150 cycles, Index 1 (i5) = 10 cycles, Index 2 (i7) = 10 cycles, R2 = 150 cycles.
In the BCL Convert [Settings] section of the sample sheet, there is an option for OverrideCycles. You will need to specify the read lengths to keep and then mask out or ignore the remaining cycles. For the read lengths in the example above, here is the setting to use:
[Settings]
OverrideCycles,Y28N122,I10,I10,Y90N60
In this notation above, we expect there to be 4 reads, each separated by a comma (R1,I1,I2,R2). In the R1, we will keep 28 cycles with a Y28. You will need to specify the exact number of cycles to be ignored with N122. We will keep all of the Index I (i5) reads with I10. In the R2 read, we will keep 90 cycles with Y90 and ignore the remaining cycles of the 150 bp read with N60.
Please note that in BCL Convert, wildcard entries are not supported with a * character like they are in bcl2fastq, as we explain below.
For further information on recommended demultiplexing settings to generate FASTQ files, see the 10x Genomics support documentation page for "Generating FASTQs with BCL Convert (Illumina Software)". If you are using BaseSpace for demultiplexing, see the 10x Genomics article for more information on "How to setup a 10x Genomics sample sheet in Illumina's BaseSpace for demultiplexing".
Here is further information on OverrideCycles from Illumina's BCL Convert documentation:
In some cases, for example if you are using Illumina's BaseSpace for demultiplexing, you may see U, which means UMI, instead of a Y. If a U is used, then the sequence will be included in the read header (line 1 of the FASTQ) and the read quality information will be lost. Cell Ranger will only count the UMI if it is located as a FASTQ read (line 2 of the FASTQ) so it is important that the sequence is also included in the read. In Illumina's BaseSpace, the read length is specified with a U as follows:
In this case, you will need to also include this additional setting TrimUMI,0 because the default behavior of BCL Convert is to trim the UMI from the read:
[Settings]
OverrideCycles,U28N122,I10,I10,Y90N60
TrimUMI,0
Reference: BCL Convert user guide
bcl2fastq or mkfastq
If you are using bcl2fastq or mkfastq, you can add this option to the command for the example above: --use-bases-mask=Y28n*,I10,I10,Y90n* .
In the above notation, we expect there to be four reads, each separated by a comma, with the number indicating the desired length in base pairs. For the above example,
Read 1 will have 28 bases, the index will be 10bp, the second index (if applicable) will be 10bp, and Read 2 will have 90bp in the final FASTQ output that is generated.
'Y' refers to yes and 'N' refers to no (N is not used in this case as this is a dual indexed library, so we only use I).
-'n' refers to ignore any bases after 28bp and '*' means wildcard to ignore everything that follows. So 'n*' means ignore everything after the first 28 bases until the end of the read.
Note: The above only applies only to mkfastq and bcl2fastq.Reference: bcl2fastq2 user guide