Chunk size, kmers or seqs? #12

mr-eyes · 2021-02-14T18:18:56Z

Rethinking regarding the chunk size, should we define the chunk size as the number of sequences or the number of kmers?

Chunk size as the number of sequences should work when the sequence lengths are relatively small. In genomes for example, if we set the chunk size to 10k that will consume a lot of memory per single chunk. On the other hand, it will work smoothly when processing transcripts due to their short and the average length is small.

Chunk size as the number of kmers will work just fine on the previous examples and we can set a fixed multiplier of thousands or millions.

@drtamermansour what do you think?

The text was updated successfully, but these errors were encountered:

drtamermansour · 2021-02-15T16:02:47Z

I like that

mr-eyes · 2021-02-15T19:15:04Z

Here is a proposed design.

The user will set a maximum memory, say, 1GB. The chunk size will be auto-calculated chunk_size ≈ 1e9 Bytes ÷ [ kSize + 8 ] that means 1GB can store ~25 millions kmers.

Kmers are stored in the following data structure.

kmerDecoder/include/kmerDecoder.hpp

Line 34 in df39cfd

flat_hash_map<std::string, std::vector<kmer_row>> kmers;

When implementing this design, we will need to attach the sequence name to every kmer, increasing the memory. Why is that? Because a sequence's kmers will most likely split over two chunks. Or we can set another data structure to hold this information without redundancy.

This design will significantly change the kmerDecoder API, which means it needs to be changed in every part used in kProcessor.

So, I think we are going to defer this for now

drtamermansour · 2021-02-15T19:24:07Z

The memory based approach is nice.
I think we can implement this without changing the design.
Here is a suggestion: kmerDecoder will create a temp file. It will read the sequences from the input file as usual. Once it reaches the max no of kmers, it will write the remaining sequence (after adjusting for the kmer overlap of course) in the temp file with the same sequence name. In the next chunk, it will check for the temp file to read from it before reading from the input stream. What do you think?

mr-eyes · 2021-02-15T19:40:20Z

If the chunk size is set to be small, and the sequences file is large, it will require many writing times to the temp file. I will rethink for an alternative design and post it here after implementing the aa-encoding #14 .

mr-eyes mentioned this issue Feb 14, 2021

Chunk size, kmers or seqs? #13

Closed

mr-eyes mentioned this issue Feb 16, 2021

Performance drawback in decoding genomes #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk size, kmers or seqs? #12

Chunk size, kmers or seqs? #12

mr-eyes commented Feb 14, 2021

drtamermansour commented Feb 15, 2021

mr-eyes commented Feb 15, 2021

drtamermansour commented Feb 15, 2021

mr-eyes commented Feb 15, 2021

Chunk size, kmers or seqs? #12

Chunk size, kmers or seqs? #12

Comments

mr-eyes commented Feb 14, 2021

drtamermansour commented Feb 15, 2021

mr-eyes commented Feb 15, 2021

drtamermansour commented Feb 15, 2021

mr-eyes commented Feb 15, 2021