Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitive prefix search, data is case sensitive. #7

Open
markg85 opened this issue Dec 27, 2017 · 6 comments
Open

Case insensitive prefix search, data is case sensitive. #7

markg85 opened this issue Dec 27, 2017 · 6 comments

Comments

@markg85
Copy link

markg85 commented Dec 27, 2017

Hi,

I admit, i haven't tried it. But looking at the code it seems to be doing case-sensitive searching.
Imagine i put a bunch of files paths in the trie with structures like:

/home/a/Desktop
/home/a/Downloads
/home/a/Pictures

I'm guessing a prefix search of:

tsl::htrie_set<char> set = {"/home/a/Desktop", "/home/a/Downloads", "/home/a/Pictures"};
set.equal_prefix_range("/home/a/d");

note the lowercase d in the equal_prefix_range call, it will probably not match "/home/a/Desktop" and "/home/a/Downloads" while that would be desirable. Just lowercase everything would fix this, but then the data does not represent the actual paths anymore (both pats could exists in a linux/unix world as it is case sensitive). Using a map and have a lowercase -> uppercase mapping could potentially be a solution as well, but that also doubles the data usage (at the very least) thus kinda defeating the point of using a nice HAT trie in the first place (in terms of memory efficiency).

I do realize that there is a bit of trouble in making the equal_prefix_range case-insensitive (ideally optional, case-sensitive and case-insensitive). You don't know the type i'm putting in as string. It might be a std::string, might be a std::string_view, a byte array, a QString.. You just don't know therefore can't expect a call on a character (like toLowercase() for example) to exist.

So i have a bit of a request. Could you add a function that allows me to set how to compare a prefix?

I'm thinking of a:

tsl::htrie_set::set_character_compare(...);

Then in the compare functions you use whatever is set by the user.

Again, i could be completely wrong and it's already possible, but it doesn't look like that from reading the code ;)

I'm curious about your opinion!

Best regards,
Mark

@Tessil
Copy link
Owner

Tessil commented Dec 27, 2017

Hi,

Yes currently only case-sensitive lookups are supported.

I would like to add something like the Traits template parameter in std::string, but it's difficult.

The main problem is that when I go down the trie, I go down byte by byte and not character by character. It's the same for a simple ASCII alphabet but different for multi-bytes UTF-8 characters. So even if I provide a way for the user to use a custom comparator (though in my implementation I don't directly compare two bytes, I convert a char to an unsigned int which will be used as position in the array of children), the user would only be able to compare two bytes and not two characters.

Also if I allow the trie to be case sensitive while providing case insensitive lookups, it means that lookups will have to go in different branches of the trie (possible but more complicate to implement).

I would like to add this feature but I don't really know what would be the best way to proceed to keep an intuitive interface that supports UTF-8 strings.

@markg85
Copy link
Author

markg85 commented Dec 27, 2017

Hi,

It sounds like the current way (casting char to int ad using that as array index) is a way that has to go if you want to consider case insensitive as well (or rather, any for of compare). What would that do for performance?

@Tessil
Copy link
Owner

Tessil commented Dec 27, 2017

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

@markg85
Copy link
Author

markg85 commented Dec 27, 2017

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

@ararunaufc
Copy link

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

This should not be a problem, as long as it is implemented in a separate function.

@ararunaufc
Copy link

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

UTF-8 is a mess to work with. For example, a single accented letter like "á" can be represented in at least two ways: by a single codepoint (U+00E1) or by two codepoints (U+0061 and U+0301).

It will be HARD!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants