-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading sequence records in chunks #2434
Comments
Hi @mhassankhan90, thanks for your interest in SeqAn. Before I answer your actual question, may I ask which features you plan to use from SeqAn? A solution for you problem in SeqAn 3 is or exmaple seqan3::sequence_file_input fin{"my_file.fa"};
for (auto && record : fin | seqan3::views::drop(50) | seqan3::views::take(25))
{
// loops over records 50-75
} See the SeqAn 3 IO Tutorial for further details. |
Thanks for your reply. I am interested in using sequence I/O features of SeqAn especially the option to read records in chunks for further analysis. I'll shift to SeqAn v3 to use it. Please let me know one more thing, is there any feature/ function in SeqAn v2 or v3 to generate and count k-mers? Thanks & Best Regards, |
Hi @mhassankhan90, indeed we have the seqan3::views::kmer_hash view to generate kmers like this: #include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/range/views/kmer_hash.hpp>
using seqan3::operator""_dna4; // s.t. we can use "ACGTAGC"_dna4 to create a dna4 vector
int main()
{
std::vector<seqan3::dna4> text{"ACGTAGC"_dna4};
auto hashes = text | seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{3}}); // an ungapped shape
seqan3::debug_stream << hashes << '\n'; // prints [6,27,44,50,9]
} But we currently have no special functionality to count kmers, you would need to for example use the May I ask what your use case is, e.g. what you are doing with the kmer counts? Best Regards. |
I'll be using kmer and kmer count to generate de bruijn graphs. |
Hi @mhassankhan90, I'd like add some thoughts on this as well. 😃
In SeqAn2 you can simply run the above function in a while loop: while(!atEnd(seqFileIn))
readRecords(ids, seqs, seqFileIn, 25); As far as I know, this will simply append the next records to the ids and seqs without clearing the buffers before. auto ids_slice = seqan::infix(ids, 25, 50);
auto seqs_slice = seqan::infix(seqs, 25, 50);
In SeqAn2 you can use a a qgram index to efficiently count k-mers: https://seqan.readthedocs.io/en/master/Tutorial/DataStructures/Indices/QgramIndex.html. But as @smehringer said, we have no dedicated k-mer based indices in Seqan3 yet, so Seqan2 might be more efficient with respect to k-mer counting. This strongly depends on your use case. For example if you are interested in gapped k-mers you are probably better off with SeqAn2, also note, that you can search your k-mers with errors very efficiently in SeqAn3. This however, is not semantically the same as searching with gapped k-mers, where fixed positions in the shape can be erroneous. |
Hi @rrahn thank you for the brief response. You are absolutely right about reading records in chunks using loop will append next records to existing ids and seqs without clearing buffers. I have tried to use the substring function to generate sub sequences, but it seems like stringset isn't supported with substring. So I tried to use assign function of SeqAn to convert types but it works only with one record at a time i.e. seqs[i]. It doesn't convert whole chunk or maybe I am not familiar with the process. Anyhow, I am still learning SeqAn and trying new stuff. Thanks for your support. Regards, |
Hi,
I am using SeqAn 2.3.2. I am following sequence I/O tutorial. The steps mentioned to read records in chunks is not very much clear. For example, I have a fastq file with 100 records. I am trying to read 25 records at a time using the code mentioned in the tutorial:
readRecords(ids, seqs, seqFileIn, 25);
This line works well for first 25 records i.e 0-24.
There is no code/ step mentioned to read next 25 record sy 25-49. If I use the above code again, it will read next 25 records but will add to the previously read records. How can I solve this?
Secondly, how can I read records from 50-75 only?
Thanks & Best Regards,
The text was updated successfully, but these errors were encountered: