New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode normalization #1128
Comments
@Sharsie maybe you can help me with some knowledge. I know nothing about normalization. Sure I am going to look things up but maybe you can help me with some easy digestible information. Especially about these forms like C D KC KD etc. QSrring hat a function for it but the docs are pretty quiet about what it actually does. |
@ManuelSchneid3r It pretty much just follows the unicode standard for normalization. This article describes the process in detail. In short So basically (very basically...and inacurately), all normalization forms will first take the input character, and decompose (or split) the char into their canonical representations in unicode. in this case, both D and C (and their K variants) do the same, because Ohm decomposes into a single unicode character Omega Then you have stuff like german Then K form comes into play. This is useful for stuff like roman numerals.. say In summary: Although I would still vote against it in general, if I were to give an example, I think the Ohm vs Omega is a nice one.
Option 1 is fine, as long as you are aware that something like this is happening under the hood, or even exists. I would answer with a different implementation though: Optionally, multiple normalizations could take place.
You are searching through and even if you type To achieve that, you will have to do run comparison using multiple normalizations seperately So I hope I didn't make a mistake and I explained the rationale why I am against this :) |
To comment on the #707 - I didn't fully read through this, but unicode normalization will not help there anyway. If I got that right, the goal is to match using fuzzing, so stuff like "I don't" match for "I dont" or even "Idont". Normalization will probably not help there in any way... even if you do something like stripping If there is an answer, then it's probably offering some kind of tooling from the albert core. So from an input such as
But honestly, it's hard to say what the developer wants from it. The unicode is just too broad... is However; Fuzzing... that's another story... if you were to implement a tooling for comparison using fuzzing, where one could send in an input string and a dictionary (key value) of strings, you would search through the dictionary, normalize both input and the dictionary values, compare it using fuzzing and then return keys that match, optionally with some form of score value... that would be amazing... Although I'm not sure how difficult such implementation would be...I have worked with tools like Elasticsearch in the past and although these tools are amazing at what they do, that always opens the realm of increasingly complex crap. So unless this can be implemented in a simple manner, bundling a complex search engine in albert is probably not the right way to go :) |
See #707 (comment)
The text was updated successfully, but these errors were encountered: