Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: efficient set intersect of multiple HAT-tries #25

Open
bigerl opened this issue Feb 21, 2020 · 3 comments
Open

Question: efficient set intersect of multiple HAT-tries #25

bigerl opened this issue Feb 21, 2020 · 3 comments

Comments

@bigerl
Copy link

bigerl commented Feb 21, 2020

This is more a question, than a real issue.

I am wondering, what is the most efficient way to intersect the keys of two or more hat-tries.

My naive approach would be to iterate over the keys of the HAT-trie with the smallest size and probe the others for the key. e.g.:

std::set<std::string> intersect(std::vector<tsl::htrie_set<char>> operands) {
    // Argument shouldn't be copied or changed. Only for demonstration purposes.
    sort(operands.begin(), operands.end(), 
        [&](const auto & a, const auto & b) { 
            return a.size() < b.size(); 
    });
    auto iterated_operand = operands[0];
    operands.erase(0);
    std::set<std::string> result{};
    for(auto it = iterated_operand.begin(); it != iterated_operand.end(); ++it) {
        bool skip = false;
        for(const auto &probe_operand : operands)
            if (not probe_operand.count(it.key())){
                skip = true;
                break;
            }
        if (not skip) 
            result.insert(it.key());
    }
    return result;
}
@Tessil
Copy link
Owner

Tessil commented Feb 23, 2020

Hi,

It would be possible to optimize this by skipping whole prefix branches. If you currently are on a branch where every key has the prefix 'foo' and the another trie has no key with this prefix, you can then skip the whole branch immediately.

Unfortunately the current interface doesn't provide enough access to the internal of the iterator to do so. I have to check how to properly provide a more flexible iterator

Thibaut

@bigerl
Copy link
Author

bigerl commented Mar 3, 2020

The approach sounds interesting. With the prefix-skipping it would be very fast for a bunch of cases. Actually, it allows to calculate the empty cut of two sets without common prefixes in O(1).
Do you have any plans to implement such an extended iterator or optimized set intersection in the near future?

@Tessil
Copy link
Owner

Tessil commented Mar 5, 2020

I plan to extend the API of the iterator but it may take time before I start to work on it, 2 to 3 months, as I'm currently on holiday and I'll be a bit busy when I get back home.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants