Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sequence more abstract, so it can be anything, not just array of chars. #90

Open
Martinsos opened this issue Aug 21, 2017 · 8 comments

Comments

@Martinsos
Copy link
Owner

Multiple people where asking about support for multybyte characters (unicode).
One way to provide that and even more is by making a sequence not an array of chars, but instead an array of objects that satisfy the condition that they have equality operator defined over them.

What would the impact on speed be in this case? I think it would not be big impact, since they are anyway used only to calculate Peq and after that Peq is used.

Would it make it harder to use edlib for usual cases? Would it become to general, hard to use for strings? How could we make sure it is still easy to use while offering flexilibity?

Finally, this might be easier to implement if I decide before that to go with just C++ interface, so I should think about that first.

@Martinsos
Copy link
Owner Author

So far there have been 3 issues asking for multibyte support, so I assigned important label to this feature as it seems to be important to users.

@Martinsos Martinsos added this to To do in Tasks Feb 3, 2018
jbaiter added a commit to jbaiter/edlib that referenced this issue Mar 10, 2019
This is a workaround until Martinsos#90
is implemented.
If either query or target contain non-ascii values, they are mapped into
an ASCII alphabet and the resulting byte sequences are used for doing
the alignment.
jbaiter added a commit to jbaiter/edlib that referenced this issue Mar 10, 2019
This is a workaround until Martinsos#90
is implemented.
If either query or target contain non-ascii values, they are mapped into
an ASCII alphabet and the resulting byte sequences are used for doing
the alignment.
Martinsos pushed a commit that referenced this issue May 4, 2019
This is a workaround until #90
is implemented.
If either query or target contain non-ascii values, they are mapped into
an ASCII alphabet and the resulting byte sequences are used for doing
the alignment. This works only if whole alphabet does not have more than 256 characters.
@Martinsos
Copy link
Owner Author

With @jbaiter 's addition to Python version of Edlib this issue is less pressing, but still, it should be the next one to do.

@Martinsos
Copy link
Owner Author

This is also linked to this: #141 (Unicode support in python edlib).

@Martinsos
Copy link
Owner Author

Martinsos commented Sep 26, 2020

@masri2019 has been working on this for some time now with a little bit of my guidance, so I will document here what has been done and what is yet to be done to call this feature complete!

  • Replacing edlib.h and edlib.cpp with edlib.hpp and edlib.tpp. Additional equalities does not work yet, tests and aligner are not updated, and C interface does not exist anymore. Python binding is also no updated. DONE WITH Supporting Generic Sequences #148 .
  • Updating tests and aligner. DONE WITH Modifying tests and aligner-app #150 .
  • Update additional equalities to work again. DONE WITH Enabling additional equalities for generic sequences (using sets)  #154 . CPP codebase is now fully working. Performance seems to keep up.
  • Update documentation for C/CPP (README.md).
  • Update python binding so it works with new CPP implementation, and also update its documentation. Check https://www.benjack.io/2018/02/02/python-cpp-revisited.html, might be helpful.
  • Consider if we should add C wrappers or not. If yes, we add cedlib.h and cedlib.cpp files, which will be small and short. Possibly we add a test or two, since it is merely using the CPP interface and most of the stuff is checked in compile time. We don't do performance testing, since it is not needed. We also add docs for it. If we don't do this now, we can always do it later. Check Edlib and C (and C++) #80 also, I was pondering more about this there.
  • Do final polishing, check that CI is passing, possibly run some final performance checks, and release new version (both cpp/c and python), with version bumped to 2.0.0 due to the new interface.

We are using "big" feature branch gen-seqs where we are collecting these changes, and will merge them back into master once it is done.

Additional ideas/considerations:

  • Document requirements on templates (Element has to have == operator)?
  • Make sure CMAKE is in good shape (best to do this when rebasing on master?).
  • In some internal functions, consider using name Element instead of AlphabetIdx, if they don't need to know it is AlphabetIdx.
  • Consider making API more C++ish (vectors, strings, ...). We could overload edlibAlign method to take different types of parameters (cpp string, vector, ...). We could also make returned types and other structures that we use nicer. This all should probably be tackled as a separate issue due to the amount of changes.
  • Ensure there are no 'using namespace' in header files.
  • Try including edlib.hpp two times, from two different files, and see if we get double definitions error! Make sure we have automatic test for this and that it is run by CI. @masri2019 already implement the test (https://github.com/Martinsos/edlib/tree/gen-seqs/test/testMultiDefinition), however I am not sure how to best make it part of CI, that is what needs further consideration.

@Martinsos
Copy link
Owner Author

Hey @masri2019, how are you doing? We made great progress with this one and then stopped -> are you still interested in possibly continuing with it, how are you with time?

@mobinasri
Copy link

mobinasri commented Sep 3, 2021 via email

@Martinsos
Copy link
Owner Author

@masri2019 that is awesome :)!! I will also do my best to help you, I believe the two us can finish it together, if needed I can involve myself more, I should also be able to carve out some time.

Yes, the next step is README based on the checklist I created above (which I am now really happy I made because I would have no idea where we stopped otherwise :D). And then python bindings. I am sure we can get both of those done.

Next will be discussion about C wrapper, that might be a bit harder, but ok that is also doable. And then final polishing!

All together sounds like we (you) did the hardest part already, so really looking forward to this. Although, you know how they say: last 20% takes 80% of the time. But let's hope in this case percentages will be gentle to us.

@Martinsos
Copy link
Owner Author

@masri2019 I am guessing it might be a bit hard getting back into it after so much time, so I would advise you do what you can and if you get stuck somewhere no worries, make a draft PR and I can jump in, we will figure it out together. I also forget a lot of things but I am sure we will remember it relatively quickly, since we were writing pretty nice code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Tasks
In Progress
Development

No branches or pull requests

2 participants