Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SureCell / ddSeq support #42

Open
AskPascal opened this issue Aug 9, 2018 · 7 comments
Open

SureCell / ddSeq support #42

AskPascal opened this issue Aug 9, 2018 · 7 comments

Comments

@AskPascal
Copy link

Hej,

I just got some data, generated with SureCell libraries on a ddSeq machine (i.e. the protocol by Illumina and Bio-Rad). I would like to test your pipeline for the analysis but I'm not sure if it can be used and if so how to fill the config.yaml.
Barcodes are in Read 1, however, they are not at a fixed position, and the cell barcode is split into three parts by spacer sequences:

single-cell-rna-algorithm-tech-note-1070-2016-015

Below is a small example from the first read fastq file of one of my samples.

Is it possible to process this data with dropSeqPipe?

Cheers


@D00457:259:HKWJNBCX2:1:1105:1128:2079 1:N:0:CCTAAGAC
CTCGGCGTTAGCCATCGCATTGCGGATTGTACCTCTGAGCTGAATCGCCTACGTCCCCGGAGACCNNT
+
<DDD0<CFHHHIIIIIIIIIIIIIIIHIHIGHHHIHHHGHFHHHIHHHIIIIHIIIIIEHHHIII##<
@D00457:259:HKWJNBCX2:1:1105:1168:2089 1:N:0:CCTAAGAC
AATGGAGTAGCCATCGCATTGCACCTTCTACCTCTGAGCTGAAGAAATAACGCCTACGAAGACTTNNT
+
<<<D01<<D1ECH?F0=CEE?<1DG@<1CGEH@HHHHIIHGEGCGEHFHIHGHHHHHIEHHHHEF##<
@D00457:259:HKWJNBCX2:1:1105:1122:2104 1:N:0:CCTAAGAC
ACCCAATAGCCATCGCATTGCCCGTAATACCTCTGAGCTGAATAAGCTACGAAACTGTGGACTTTNNT
+
0<DDDIHHIIEEHHGHIIEHIFDGHHHIIIHIIIH?GHHIIH1<FH1FGHIGHIIHIFHIHE@FH##<
@D00457:259:HKWJNBCX2:1:1105:1102:2126 1:N:0:CCTAAGAC
TTCGTAGAGGTAGCCATCGCATTGCTGAGACTACCTCTGAGCTGAACTCAATACGCTTCGAGCGANNT
+
0<<DBDHHHFCFHEGHIHIHIIIIHHIHGEHIHHIHIHIHI?1<1GHHIHIIIIIGIIGHHGHIH##<
@D00457:259:HKWJNBCX2:1:1105:1158:2127 1:N:0:CCTAAGAC
ACATAGATAGCCATCGCATTGCTAATAGTACCTCTGAGCTGAAGCGAATACGTCCCCCCTGACTTNNT
+
@@B@0<CEGHIIHHI=GEEHCGHEHHEEHHIHFHCHEHCHIHIIHIHIIHHHHI0EHHIII?@1<##<
@Hoohm
Copy link
Owner

Hoohm commented Aug 9, 2018

Hello @pascal-git

As of now, it is definitely not compatible. The split barcode pattern is not the big issue here, you could give those positional arguments in and it should work. Although I haven't tried it.

The main issue is the shift in base on the first read.

Right now the barcodes are picked by given position of the bases in R1, so it can't be shifted. One way to overcome this would be to first "deshift" R1, then run dropSeqPipe.

Although I don't have the time to try it out now, it might be a good idea to change the way I find the barcode and umi and use a similar idea to umi-tools which I recommend you try out.

This construct seems overly complicated though, would you know what are the advantages over 10x for example?

@AskPascal
Copy link
Author

AskPascal commented Aug 10, 2018

Hej @Hoohm

No, I am also still wondering what would be better with this approach than the 10x way. To not have all barcodes at the same position might hedge for systemic biases in sequencing maybe? Or its just a intellectual property thing...

Thanks for clarifying what the problem would be to get the data into dropSeqPipe and how to potentially solve it. umi-tools for sure looks interesting. I found however another tool yesterday: umis, which has even example code for SureCell / ddSeq available and in my preliminary tests it looks promising. I might use it in combination with dropSeqPipe or just as standalone...

@Hoohm
Copy link
Owner

Hoohm commented Aug 10, 2018

Hey @pascal-git
I've come up with a small script that should be able to handle funky barcode structures.

You can check it out here

I'm working on a new version of dropSeqPipe (see develop branch) which is going to use cutadapt instead of trimmomatic. The main reason was to add adapter presence in R1 and R2 instead of just R2 trimmed as it is now.

To do this, I'm also changing a lot in the filtering. I'm trimming R1 and R2 separately and repair them after trimming. This cuts down running time as well as give more insight into the potential problem with the protocol.

Since I'm not depending on dropseq tools for this first part anymore, I'm capturing barcodes differently. This would make it easier for me to allow for fancy barcode structure.

So, keep checking, your protocol might be compatible in one month or so.

@AskPascal
Copy link
Author

Hej @Hoohm

This is really interesting. I'll keep my eyes open for the new version then!

@Hoohm Hoohm self-assigned this Oct 9, 2018
@Hoohm
Copy link
Owner

Hoohm commented Nov 28, 2018

As you can see, this is not implemented yet at all.

I sadly I haven't found the time to work on it since this is not some technology we use.

I hope someone could help out on for integrating a universal cell barcode structure module

@TomKellyGenetics
Copy link

TomKellyGenetics commented Mar 19, 2020

I've written a sed solution to extract the barcodes and UMI from R1. This will return a Read1 with an 18bp barcode and 8bp UMI.

Read1s=("Sample_S1_L001_R1_001.fastq" "Sample_S1_L002_R1_001.fastq")
Read2s=("Sample_S1_L001_R2_001.fastq" "Sample_S1_L002_R2_001.fastq")

    #remove adapter from SureCell (and correct phase blocks)
        for File in "${Read1s[@]}"; do
            #remove phase blocks and linkers
            sed -E '
                /.*(.{6})TAGCCATCGCATTGC(.{6})TACCTCTGAGCTGAA(.{6})ACG(.{8})GAC/ {
                s/.*(.{6})TAGCCATCGCATTGC(.{6})TACCTCTGAGCTGAA(.{6})ACG(.{8})GAC.*/\1\2\3\4/g
                n
                n
                s/.*(.{6}).{15}(.{6}).{15}(.{6}).{3}(.{8}).{3}/\1\2\3\4/g
                }' $File > .temp
            mv $.temp $File
        done

@Hoohm
Copy link
Owner

Hoohm commented Mar 21, 2020

Thank you @TomKellyGenetics !

I'm gonna add this to the documentation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants