Case insensitive prefix search, data is case sensitive. #7

markg85 · 2017-12-27T01:09:15Z

Hi,

I admit, i haven't tried it. But looking at the code it seems to be doing case-sensitive searching.
Imagine i put a bunch of files paths in the trie with structures like:

/home/a/Desktop
/home/a/Downloads
/home/a/Pictures

I'm guessing a prefix search of:

tsl::htrie_set<char> set = {"/home/a/Desktop", "/home/a/Downloads", "/home/a/Pictures"};
set.equal_prefix_range("/home/a/d");

note the lowercase d in the equal_prefix_range call, it will probably not match "/home/a/Desktop" and "/home/a/Downloads" while that would be desirable. Just lowercase everything would fix this, but then the data does not represent the actual paths anymore (both pats could exists in a linux/unix world as it is case sensitive). Using a map and have a lowercase -> uppercase mapping could potentially be a solution as well, but that also doubles the data usage (at the very least) thus kinda defeating the point of using a nice HAT trie in the first place (in terms of memory efficiency).

I do realize that there is a bit of trouble in making the equal_prefix_range case-insensitive (ideally optional, case-sensitive and case-insensitive). You don't know the type i'm putting in as string. It might be a std::string, might be a std::string_view, a byte array, a QString.. You just don't know therefore can't expect a call on a character (like toLowercase() for example) to exist.

So i have a bit of a request. Could you add a function that allows me to set how to compare a prefix?

I'm thinking of a:

tsl::htrie_set::set_character_compare(...);

Then in the compare functions you use whatever is set by the user.

Again, i could be completely wrong and it's already possible, but it doesn't look like that from reading the code ;)

I'm curious about your opinion!

Best regards,
Mark

The text was updated successfully, but these errors were encountered:

Tessil · 2017-12-27T17:52:54Z

Hi,

Yes currently only case-sensitive lookups are supported.

I would like to add something like the Traits template parameter in std::string, but it's difficult.

The main problem is that when I go down the trie, I go down byte by byte and not character by character. It's the same for a simple ASCII alphabet but different for multi-bytes UTF-8 characters. So even if I provide a way for the user to use a custom comparator (though in my implementation I don't directly compare two bytes, I convert a char to an unsigned int which will be used as position in the array of children), the user would only be able to compare two bytes and not two characters.

Also if I allow the trie to be case sensitive while providing case insensitive lookups, it means that lookups will have to go in different branches of the trie (possible but more complicate to implement).

I would like to add this feature but I don't really know what would be the best way to proceed to keep an intuitive interface that supports UTF-8 strings.

markg85 · 2017-12-27T18:06:53Z

Hi,

It sounds like the current way (casting char to int ad using that as array index) is a way that has to go if you want to consider case insensitive as well (or rather, any for of compare). What would that do for performance?

Tessil · 2017-12-27T18:29:45Z

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

markg85 · 2017-12-27T18:40:29Z

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

ararunaufc · 2018-11-07T05:39:15Z

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

This should not be a problem, as long as it is implemented in a separate function.

ararunaufc · 2018-11-07T05:43:30Z

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

UTF-8 is a mess to work with. For example, a single accented letter like "á" can be represented in at least two ways: by a single codepoint (U+00E1) or by two codepoints (U+0061 and U+0301).

It will be HARD!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitive prefix search, data is case sensitive. #7

Case insensitive prefix search, data is case sensitive. #7

markg85 commented Dec 27, 2017

Tessil commented Dec 27, 2017 •

edited

markg85 commented Dec 27, 2017

Tessil commented Dec 27, 2017

markg85 commented Dec 27, 2017

ararunaufc commented Nov 7, 2018

ararunaufc commented Nov 7, 2018

Case insensitive prefix search, data is case sensitive. #7

Case insensitive prefix search, data is case sensitive. #7

Comments

markg85 commented Dec 27, 2017

Tessil commented Dec 27, 2017 • edited

markg85 commented Dec 27, 2017

Tessil commented Dec 27, 2017

markg85 commented Dec 27, 2017

ararunaufc commented Nov 7, 2018

ararunaufc commented Nov 7, 2018

Tessil commented Dec 27, 2017 •

edited