Filter fastq file (performance) #22

Maarten-vd-Sande · 2020-11-29T15:16:36Z

I have a paired-end single-cell RNA-seq dataset. R1 consists of all the reads, and R2 of the barcodes necessary to identify which cell a read belongs to. If I now only trim R1 to keep high-quality reads, my R1 and R2 are out of sync..

It seems like pyfastx can solve this problem for me, by simply only keeping the reads in R2 of which we have one in R1:

import gzip

with gzip.open(output, 'wt') as f:
    for read in reads:
        barcode = barcodes[read.id]
        f.write(barcode.raw)

However from a really sloppy benchmark, it seems like just getting barcode.raw takes around 0.0015 seconds per read, the lookup of the read is fast: 1e-6. This is would mean I have to wait two days to filter my fastq. Is there an easier/better/faster way of doing this?

The text was updated successfully, but these errors were encountered:

lmdu · 2020-12-12T13:38:54Z

For your case, barcodes[read.id] means random access to read from file with given id. This step will firstly extract read information from index file (a sqlite database file). This may be very slow when processing large numbers of reads.

I am a little confused that why do you use read.id rather than read.name to extract reads. That means your reads in this two files are synchronous.

If you use name to extract reads from another file, you can use multiple threads to speedup.

Maarten-vd-Sande mentioned this issue Nov 30, 2020

BUG: [scRNA: Trimming inconsistencies ] vanheeringen-lab/seq2science#593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter fastq file (performance) #22

Filter fastq file (performance) #22

Maarten-vd-Sande commented Nov 29, 2020 •

edited

lmdu commented Dec 12, 2020

Filter fastq file (performance) #22

Filter fastq file (performance) #22

Comments

Maarten-vd-Sande commented Nov 29, 2020 • edited

lmdu commented Dec 12, 2020

Maarten-vd-Sande commented Nov 29, 2020 •

edited