Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter fastq file (performance) #22

Open
Maarten-vd-Sande opened this issue Nov 29, 2020 · 1 comment
Open

Filter fastq file (performance) #22

Maarten-vd-Sande opened this issue Nov 29, 2020 · 1 comment

Comments

@Maarten-vd-Sande
Copy link

Maarten-vd-Sande commented Nov 29, 2020

I have a paired-end single-cell RNA-seq dataset. R1 consists of all the reads, and R2 of the barcodes necessary to identify which cell a read belongs to. If I now only trim R1 to keep high-quality reads, my R1 and R2 are out of sync..

It seems like pyfastx can solve this problem for me, by simply only keeping the reads in R2 of which we have one in R1:

import gzip

with gzip.open(output, 'wt') as f:
    for read in reads:
        barcode = barcodes[read.id]
        f.write(barcode.raw)

However from a really sloppy benchmark, it seems like just getting barcode.raw takes around 0.0015 seconds per read, the lookup of the read is fast: 1e-6. This is would mean I have to wait two days to filter my fastq. Is there an easier/better/faster way of doing this?

@lmdu
Copy link
Owner

lmdu commented Dec 12, 2020

For your case, barcodes[read.id] means random access to read from file with given id. This step will firstly extract read information from index file (a sqlite database file). This may be very slow when processing large numbers of reads.

I am a little confused that why do you use read.id rather than read.name to extract reads. That means your reads in this two files are synchronous.

If you use name to extract reads from another file, you can use multiple threads to speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants