Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO] Is there a neat way to pass only selected fields to SAM output? #2436

Open
tloka opened this issue Mar 9, 2021 · 4 comments
Open

[IO] Is there a neat way to pass only selected fields to SAM output? #2436

tloka opened this issue Mar 9, 2021 · 4 comments
Assignees
Labels
documentation documentation is missing to close this issue question a user question how to do certain things ready to tackle Fully specified, discussed and understood. Can be tackled.

Comments

@tloka
Copy link
Contributor

tloka commented Mar 9, 2021

Platform

  • SeqAn version: latest
  • Operating system: N/A
  • Compiler: gcc

Question

Hey all,

I am struggling a bit with properly writing unmapped reads to my output alignment files (which is a common feature I would like to have integrated into my software).

I understand that I can write records using emplace_back(...) and specifying all fields. I also know that it is in principle possible to provide less arguments to this function as long as they are not required. So, if seqan3::field::id was the first field specified for the sam output, I could call emplace_back("read1") which would set the ID only and leave all other fields unset. However, there is no way to do the same if the id was the last element in the field list, so if you for example have (minimal example):

seqan3::sam_file_output<
   seqan3::fields<
      seqan3::field::seq,
      seqan3::field::qual,
      seqan3::field::cigar,
      seqan3::field::mapq,
      seqan3::field::ref_id,
      seqan3::field::ref_offset,
      seqan3::field::id>,
   seqan3::type_list<seqan3::format_sam, seqan3::format_bam>,
   std::vector<std::string>> sam_out(/*Init parameters; not important here*/);

it is no longer possible to create an entry having seqan3::field::id set only. Even when setting all other fields manually to default values, SeqAn would throw an error when trying to set seqan3::field::ref_id to "" or "*" as it is not contained in the list of references.

This behavior currently makes it complicated for me to have mixed files of mapped and unmapped reads, where in the latter case I can't provide a valid ref_id. Only workaround for me is currently to change the order of fields, but I am afraid this could in future lead to incompatibility if I need to mix even more combinations of (un-)reported fields.

Thus, my question is if I have overseen a (neat) way to fill the same seqan3::sam_file_output object with entries using different selections of fields?

If no, it would be great if there was a possibility to have some function that allows to speficy the fields to set, e.g. something like this:

sam_out.emplace_back_selected<seqan3::fields<seqan3::field::id>>("Read 1");

which would be valid as long as seqan3::field::id is part of the sam file output object.

Second option, there would be something like an unset value (or similar) to each of the fields that can be used to set fields that are not available to default value and at the same time are not caught by validation steps as, for example, with the reference ID:

sam_out.emplace_back(seqan3::field::ref_id::unset, "Read 1");
@tloka tloka added the question a user question how to do certain things label Mar 9, 2021
@tloka tloka changed the title Is it possible to write mixed files of mapped and unmapped reads? Is there a neat way to provide only selected fields to SAM output? Mar 9, 2021
@tloka tloka changed the title Is there a neat way to provide only selected fields to SAM output? Is there a neat way to pass only selected fields to SAM output? Mar 9, 2021
@marehr
Copy link
Member

marehr commented Mar 11, 2021

Hi tobias,

so if I understand you correctly, you want to switch between record-types (one mapped, one unmapped) in the output?

I don't know if I caught your use case, but do you want to do:

seqan3::sam_file_output<...> out{"./some_path"};

// write out a full record as you already do (field order shouldn't matter)
out.emplace_back(...);

// we just provide the id, everything else is default
using unmapped_read_record = seqan3::sam_record<seqan3::type_list<std::string>, seqan3::fields<seqan3::field::id>>;
unmapped_read_record unmapped{"READ 1"};

out.push_back(unmapped);

(UNTESTED, I still need to compile it :))

We are currently working on it to make the customisation via those fields obsolete.
I'm currently pitching the following idea:

seqan3::sam_file_output out{"./some_path"}; // no template arguments anymore, as they are only needed for emplacing the data

// note: you don't specify ANY fields anymore.
seqan3::sam_record record{.id = "read1"}; // other fields have sensible default types and values
record.id = "read1"; // or later as member assignment
recod.id(); // to access the data. 

out.push_back(std::move(record)); // only set field read1

@tloka
Copy link
Contributor Author

tloka commented Mar 12, 2021

Hey Marcel,

thanks for your response!
I think the first solution is actually catching my case, I wasn't aware that I can push records with a different set of specified fields. This should solve it for now, I will try it out and let you know. However, the template types of sam_record could get pretty long in some cases (as I need to specify a type_list and field_list), which is not a big issue, but a bit unpleasant.

Thus, the new planned solution sounds way more comfortable and would definitively be easier to use. I actually thought about asking whether a removal of the template arguments would be possible, but I wouldn't dare as I expect it to be a bunch of (structural) changes. I am really looking forward for this to be finished!

@tloka
Copy link
Contributor Author

tloka commented Mar 16, 2021

Hey,

thanks again for your answer, it compiled and worked as you proposed.
However, especially when working with views / ranges to sequences, it is not trivial to declare the correct types in the type list. I actually only managed to make it work using decltype. Thus, i am happy for now that there is a working solution for arbitrary combinations of fields, but still I am looking forward to the new version with no need to declare any types!

Closed this thread, as the provided answer works.

@tloka tloka closed this as completed Mar 16, 2021
@marehr marehr reopened this Mar 16, 2021
@marehr
Copy link
Member

marehr commented Mar 16, 2021

(Reopened, as I think we should add documentation for this use case.)

@smehringer smehringer added ready to tackle Fully specified, discussed and understood. Can be tackled. documentation documentation is missing to close this issue labels Oct 20, 2022
@eseiler eseiler changed the title Is there a neat way to pass only selected fields to SAM output? [IO] Is there a neat way to pass only selected fields to SAM output? Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation documentation is missing to close this issue question a user question how to do certain things ready to tackle Fully specified, discussed and understood. Can be tackled.
Projects
None yet
Development

No branches or pull requests

3 participants