Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk size, kmers or seqs? #12

Open
mr-eyes opened this issue Feb 14, 2021 · 4 comments
Open

Chunk size, kmers or seqs? #12

mr-eyes opened this issue Feb 14, 2021 · 4 comments

Comments

@mr-eyes
Copy link
Member

mr-eyes commented Feb 14, 2021

Rethinking regarding the chunk size, should we define the chunk size as the number of sequences or the number of kmers?

Chunk size as the number of sequences should work when the sequence lengths are relatively small. In genomes for example, if we set the chunk size to 10k that will consume a lot of memory per single chunk. On the other hand, it will work smoothly when processing transcripts due to their short and the average length is small.

Chunk size as the number of kmers will work just fine on the previous examples and we can set a fixed multiplier of thousands or millions.

@drtamermansour what do you think?

@drtamermansour
Copy link
Member

I like that

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 15, 2021

Here is a proposed design.

The user will set a maximum memory, say, 1GB. The chunk size will be auto-calculated chunk_size ≈ 1e9 Bytes ÷ [ kSize + 8 ] that means 1GB can store ~25 millions kmers.

Kmers are stored in the following data structure.

flat_hash_map<std::string, std::vector<kmer_row>> kmers;

When implementing this design, we will need to attach the sequence name to every kmer, increasing the memory. Why is that? Because a sequence's kmers will most likely split over two chunks. Or we can set another data structure to hold this information without redundancy.

This design will significantly change the kmerDecoder API, which means it needs to be changed in every part used in kProcessor.

So, I think we are going to defer this for now

@drtamermansour
Copy link
Member

The memory based approach is nice.
I think we can implement this without changing the design.
Here is a suggestion: kmerDecoder will create a temp file. It will read the sequences from the input file as usual. Once it reaches the max no of kmers, it will write the remaining sequence (after adjusting for the kmer overlap of course) in the temp file with the same sequence name. In the next chunk, it will check for the temp file to read from it before reading from the input stream. What do you think?

@mr-eyes
Copy link
Member Author

mr-eyes commented Feb 15, 2021

If the chunk size is set to be small, and the sequences file is large, it will require many writing times to the temp file. I will rethink for an alternative design and post it here after implementing the aa-encoding #14 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants