Reading sequence records in chunks #2434

mhassankhan90 · 2020-07-05T15:17:07Z

Hi,

I am using SeqAn 2.3.2. I am following sequence I/O tutorial. The steps mentioned to read records in chunks is not very much clear. For example, I have a fastq file with 100 records. I am trying to read 25 records at a time using the code mentioned in the tutorial:
readRecords(ids, seqs, seqFileIn, 25);

This line works well for first 25 records i.e 0-24.
There is no code/ step mentioned to read next 25 record sy 25-49. If I use the above code again, it will read next 25 records but will add to the previously read records. How can I solve this?
Secondly, how can I read records from 50-75 only?

Thanks & Best Regards,

smehringer · 2020-07-06T09:50:33Z

Hi @mhassankhan90,

thanks for your interest in SeqAn. Before I answer your actual question, may I ask which features you plan to use from SeqAn?
Because all our efforts go into our most recent version SeqAn 3 (your question is directed at SeqAn 2) and we recommend to start with the newest version right away although it is not feature complete yet.

A solution for you problem in SeqAn 3 is or exmaple

seqan3::sequence_file_input fin{"my_file.fa"};

for (auto && record : fin | seqan3::views::drop(50) | seqan3::views::take(25))
{
    // loops over records 50-75
}

See the SeqAn 3 IO Tutorial for further details.

mhassankhan90 · 2020-07-06T14:49:59Z

Thanks for your reply. I am interested in using sequence I/O features of SeqAn especially the option to read records in chunks for further analysis. I'll shift to SeqAn v3 to use it. Please let me know one more thing, is there any feature/ function in SeqAn v2 or v3 to generate and count k-mers?

Thanks & Best Regards,

smehringer · 2020-07-07T05:31:59Z

Hi @mhassankhan90,

indeed we have the seqan3::views::kmer_hash view to generate kmers like this:

#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/range/views/kmer_hash.hpp>
 
using seqan3::operator""_dna4; // s.t. we can use "ACGTAGC"_dna4 to create a dna4 vector
 
int main()
{
    std::vector<seqan3::dna4> text{"ACGTAGC"_dna4};
 
    auto hashes = text | seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{3}}); // an ungapped shape
    seqan3::debug_stream << hashes << '\n'; // prints [6,27,44,50,9]
}

But we currently have no special functionality to count kmers, you would need to for example use the std::unordered_map for that.
There are some plans to implement a faster version of std::unordered_map but I don't think we will manage that in the very near future.

May I ask what your use case is, e.g. what you are doing with the kmer counts?

Best Regards.

mhassankhan90 · 2020-07-08T11:56:24Z

I'll be using kmer and kmer count to generate de bruijn graphs.

rrahn · 2020-07-08T11:56:52Z

Hi @mhassankhan90,

I'd like add some thoughts on this as well. 😃

readRecords(ids, seqs, seqFileIn, 25);
This line works well for first 25 records i.e 0-24.
There is no code/ step mentioned to read next 25 record sy 25-49. If I use the above code again, it will read next 25 records but will add to the previously read records. How can I solve this?
Secondly, how can I read records from 50-75 only?

In SeqAn2 you can simply run the above function in a while loop:

while(!atEnd(seqFileIn))
    readRecords(ids, seqs, seqFileIn, 25);

As far as I know, this will simply append the next records to the ids and seqs without clearing the buffers before.
If you need afterwards only infixes of the records you can use the infix functionality which also works on string sets.
So you should be able to do something like:

auto ids_slice = seqan::infix(ids, 25, 50); 
auto seqs_slice = seqan::infix(seqs, 25, 50);

Please let me know one more thing, is there any feature/ function in SeqAn v2 or v3 to generate and count k-mers?

In SeqAn2 you can use a a qgram index to efficiently count k-mers: https://seqan.readthedocs.io/en/master/Tutorial/DataStructures/Indices/QgramIndex.html.
In SeqAn3, you can do this in a different way with the current index structures: http://docs.seqan.de/seqan/3-master-user/tutorial_index_search.html. If you are only interested in the counts, you could configure the search output to get only the index_cursor see here. On them you can call count and get the number of occurrences for every hit. To iterate over the k-mers of a sequence I believe you can use the ranges::views::sliding, which you automatically get when you use seqan3 at the moment. The direct hashing which you would get with the kmer_hash view wouldn't help you much with this I guess. Of course, I am only guessing here what you want to achieve.

But as @smehringer said, we have no dedicated k-mer based indices in Seqan3 yet, so Seqan2 might be more efficient with respect to k-mer counting. This strongly depends on your use case. For example if you are interested in gapped k-mers you are probably better off with SeqAn2, also note, that you can search your k-mers with errors very efficiently in SeqAn3. This however, is not semantically the same as searching with gapped k-mers, where fixed positions in the shape can be erroneous.

mhassankhan90 · 2020-07-09T15:32:50Z

Hi @rrahn thank you for the brief response. You are absolutely right about reading records in chunks using loop will append next records to existing ids and seqs without clearing buffers.

I have tried to use the substring function to generate sub sequences, but it seems like stringset isn't supported with substring. So I tried to use assign function of SeqAn to convert types but it works only with one record at a time i.e. seqs[i]. It doesn't convert whole chunk or maybe I am not familiar with the process.

Anyhow, I am still learning SeqAn and trying new stuff. Thanks for your support.

Regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading sequence records in chunks #2434

Reading sequence records in chunks #2434

mhassankhan90 commented Jul 5, 2020

smehringer commented Jul 6, 2020

mhassankhan90 commented Jul 6, 2020

smehringer commented Jul 7, 2020

mhassankhan90 commented Jul 8, 2020

rrahn commented Jul 8, 2020 •

edited

mhassankhan90 commented Jul 9, 2020

Reading sequence records in chunks #2434

Reading sequence records in chunks #2434

Comments

mhassankhan90 commented Jul 5, 2020

smehringer commented Jul 6, 2020

mhassankhan90 commented Jul 6, 2020

smehringer commented Jul 7, 2020

mhassankhan90 commented Jul 8, 2020

rrahn commented Jul 8, 2020 • edited

mhassankhan90 commented Jul 9, 2020

rrahn commented Jul 8, 2020 •

edited