Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading sequence records in chunks #2434

Open
mhassankhan90 opened this issue Jul 5, 2020 · 6 comments
Open

Reading sequence records in chunks #2434

mhassankhan90 opened this issue Jul 5, 2020 · 6 comments

Comments

@mhassankhan90
Copy link

Hi,

I am using SeqAn 2.3.2. I am following sequence I/O tutorial. The steps mentioned to read records in chunks is not very much clear. For example, I have a fastq file with 100 records. I am trying to read 25 records at a time using the code mentioned in the tutorial:
readRecords(ids, seqs, seqFileIn, 25);

This line works well for first 25 records i.e 0-24.
There is no code/ step mentioned to read next 25 record sy 25-49. If I use the above code again, it will read next 25 records but will add to the previously read records. How can I solve this?
Secondly, how can I read records from 50-75 only?

Thanks & Best Regards,

@smehringer
Copy link
Member

Hi @mhassankhan90,

thanks for your interest in SeqAn. Before I answer your actual question, may I ask which features you plan to use from SeqAn?
Because all our efforts go into our most recent version SeqAn 3 (your question is directed at SeqAn 2) and we recommend to start with the newest version right away although it is not feature complete yet.

A solution for you problem in SeqAn 3 is or exmaple

seqan3::sequence_file_input fin{"my_file.fa"};

for (auto && record : fin | seqan3::views::drop(50) | seqan3::views::take(25))
{
    // loops over records 50-75
}

See the SeqAn 3 IO Tutorial for further details.

@mhassankhan90
Copy link
Author

Thanks for your reply. I am interested in using sequence I/O features of SeqAn especially the option to read records in chunks for further analysis. I'll shift to SeqAn v3 to use it. Please let me know one more thing, is there any feature/ function in SeqAn v2 or v3 to generate and count k-mers?

Thanks & Best Regards,

@smehringer
Copy link
Member

Hi @mhassankhan90,

indeed we have the seqan3::views::kmer_hash view to generate kmers like this:

#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/range/views/kmer_hash.hpp>
 
using seqan3::operator""_dna4; // s.t. we can use "ACGTAGC"_dna4 to create a dna4 vector
 
int main()
{
    std::vector<seqan3::dna4> text{"ACGTAGC"_dna4};
 
    auto hashes = text | seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{3}}); // an ungapped shape
    seqan3::debug_stream << hashes << '\n'; // prints [6,27,44,50,9]
}

But we currently have no special functionality to count kmers, you would need to for example use the std::unordered_map for that.
There are some plans to implement a faster version of std::unordered_map but I don't think we will manage that in the very near future.

May I ask what your use case is, e.g. what you are doing with the kmer counts?

Best Regards.

@mhassankhan90
Copy link
Author

I'll be using kmer and kmer count to generate de bruijn graphs.

@rrahn
Copy link
Contributor

rrahn commented Jul 8, 2020

Hi @mhassankhan90,

I'd like add some thoughts on this as well. 😃

readRecords(ids, seqs, seqFileIn, 25);
This line works well for first 25 records i.e 0-24.
There is no code/ step mentioned to read next 25 record sy 25-49. If I use the above code again, it will read next 25 records but will add to the previously read records. How can I solve this?
Secondly, how can I read records from 50-75 only?

In SeqAn2 you can simply run the above function in a while loop:

while(!atEnd(seqFileIn))
    readRecords(ids, seqs, seqFileIn, 25);

As far as I know, this will simply append the next records to the ids and seqs without clearing the buffers before.
If you need afterwards only infixes of the records you can use the infix functionality which also works on string sets.
So you should be able to do something like:

auto ids_slice = seqan::infix(ids, 25, 50); 
auto seqs_slice = seqan::infix(seqs, 25, 50); 

Please let me know one more thing, is there any feature/ function in SeqAn v2 or v3 to generate and count k-mers?

In SeqAn2 you can use a a qgram index to efficiently count k-mers: https://seqan.readthedocs.io/en/master/Tutorial/DataStructures/Indices/QgramIndex.html.
In SeqAn3, you can do this in a different way with the current index structures: http://docs.seqan.de/seqan/3-master-user/tutorial_index_search.html. If you are only interested in the counts, you could configure the search output to get only the index_cursor see here. On them you can call count and get the number of occurrences for every hit. To iterate over the k-mers of a sequence I believe you can use the ranges::views::sliding, which you automatically get when you use seqan3 at the moment. The direct hashing which you would get with the kmer_hash view wouldn't help you much with this I guess. Of course, I am only guessing here what you want to achieve.

But as @smehringer said, we have no dedicated k-mer based indices in Seqan3 yet, so Seqan2 might be more efficient with respect to k-mer counting. This strongly depends on your use case. For example if you are interested in gapped k-mers you are probably better off with SeqAn2, also note, that you can search your k-mers with errors very efficiently in SeqAn3. This however, is not semantically the same as searching with gapped k-mers, where fixed positions in the shape can be erroneous.

@mhassankhan90
Copy link
Author

Hi @rrahn thank you for the brief response. You are absolutely right about reading records in chunks using loop will append next records to existing ids and seqs without clearing buffers.

I have tried to use the substring function to generate sub sequences, but it seems like stringset isn't supported with substring. So I tried to use assign function of SeqAn to convert types but it works only with one record at a time i.e. seqs[i]. It doesn't convert whole chunk or maybe I am not familiar with the process.

Anyhow, I am still learning SeqAn and trying new stuff. Thanks for your support.

Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants