Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feature request) Insert at prefix position #8

Open
markg85 opened this issue Dec 27, 2017 · 10 comments
Open

(feature request) Insert at prefix position #8

markg85 opened this issue Dec 27, 2017 · 10 comments

Comments

@markg85
Copy link

markg85 commented Dec 27, 2017

Hi,

Imagine the case of inserting a whole filesystem structure in the hat-trie. You have folders with a bunch of files, more folders and more files. At the moment in the current code each entry is looking up the appropriate point to insert the entry. For example a folder structure like:

/home/a/Downloads
/home/a/Documents

the insertion logic has to iterate the tree to find the correct spot. That is fine for the first entry, but could be optimized for the second entry and specially for all entries after the second.

I'd like to propose a insert function with a "fixed prefix" that can then take a list (vector or map, depending on the type of container you used for the hat-trie) to insert a bunch of elements at a given prefix. For instance something like this:

tsl::htrie_set::insert("/home/a/", {"Downloads", "Documents"})

This in only really beneficial for filesystem like structures, not for wordbook purposes. But it would still be a neat addition imho.

Best regards,
Mark

Tessil added a commit that referenced this issue Jan 2, 2018
@Tessil
Copy link
Owner

Tessil commented Jan 2, 2018

Hi,

I created a branch to implement the feature.

You can use it this way:

tsl::htrie_map<char, int> map;
map.insert_with_prefix("/home/user/", {{"test", 1}, {"test2", 2},
                                  {"test3", 1}, {"test4", 2},
                                  {"test5", 1}, {"test6", 2},
                                  {"test7", 1}, {"test8", 2},
                                  {"test9", 1}, {"test10", 2},
                                  {"test11", 1}, {"test12", 2},
});

for(auto it = map.begin(); it != map.end(); ++it) {
    std::cout << it.key() << " " << it.value() << std::endl;
}

Note that it's still a work in progress, I didn't test it much yet.

I still hesitate to add it as I wonder if it's really worth it. I have to see a bit regarding how much of a performance boost it provides as the HAT-trie is already quite fast on ordered insertions.

@markg85
Copy link
Author

markg85 commented Jan 2, 2018

Hi,

Thank you very much!
I'm quite surprised by the amount of code it took to add this, i was expecting a lot less lines to be honest :)

I was trying to compile it, but i keep getting a compile error... I must be doing something wrong.
This doesn't compile:

    tsl::htrie_map<char, int> map;
    std::map<char, int> entries;
    entries.emplace("one", 1);
    entries.emplace("two", 1);
    entries.emplace("three", 1);
    map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end());

note the above without the "insert_with_prefix" function compiles thus provided here as illustration example, but the the std::map type is very wrong as that will never give the correct result if it even runs. It should be a char array which cannot be set in there directly (by the standard).
A structure like this should work though:

    tsl::htrie_map<char, int> map;
    std::map<std::string, int> entries;
    entries.emplace("one", 1);
    entries.emplace("two", 1);
    entries.emplace("three", 1);
    map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end());

See the std::string in there. That could also be a std::string_view.
But that probably requires some more tinkering on the HAT-TRIE side ;)

@Tessil
Copy link
Owner

Tessil commented Jan 2, 2018

I'm quite surprised by the amount of code it took to add this, i was expecting a lot less lines to be honest :)

The problem is if part of the end of the prefix end-up in a hash node. To avoid having to concatenate this end of the prefix with all the values in the iterators (which is expensive) to insert them in a hash node, we have to burst all the hash nodes in the way of the prefix so that each letter of the prefix will be in a trie node.

Otherwise I made the correction. My solution was a bit too naive and only supported std::pair<std::string, T>, not std::pair<const std::string, T> (which is the type of the value in std::map).

@markg85
Copy link
Author

markg85 commented Jan 3, 2018

I pulled your changes, but it's not compiling for me (GCC 7.2 on Linux, x64)

    tsl::htrie_map<char, int> map;
    std::map<std::string, int> entries;
    entries.emplace("one", 1);
    entries.emplace("two", 1);
    entries.emplace("three", 1);
    map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end());

The errors i get:

/home/mark/GitProjects/kdirchainrebuild/tsl/htrie_map.h: In instantiation of ‘void tsl::htrie_map<CharT, T, Hash, KeySizeT>::insert_with_prefix(const std::basic_string_view<CharT>&, InputIt, InputIt) [with InputIt = std::_Rb_tree_iterator<std::pair<const std::__cxx11::basic_string<char>, int> >; CharT = char; T = int; Hash = tsl::ah::str_hash<char>; KeySizeT = short unsigned int]’:
/home/mark/GitProjects/kdirchainrebuild/fstrie.cpp:70:70:   required from here
/home/mark/GitProjects/kdirchainrebuild/tsl/htrie_map.h:214:9: error: no matching function for call to ‘tsl::detail_htrie_hash::htrie_hash<char, int, tsl::ah::str_hash<char>, short unsigned int>::insert_with_prefix(const std::basic_string_view<char>&, std::basic_string_view<char>::size_type, std::_Rb_tree_iterator<std::pair<const std::__cxx11::basic_string<char>, int> >&, std::_Rb_tree_iterator<std::pair<const std::__cxx11::basic_string<char>, int> >&)’
         m_ht.insert_with_prefix(prefix, prefix.size(), first, last);
         ^~~~
In file included from /home/mark/GitProjects/kdirchainrebuild/tsl/htrie_map.h:32:0,
                 from /home/mark/GitProjects/kdirchainrebuild/fstrie.h:1,
                 from /home/mark/GitProjects/kdirchainrebuild/fstrie.cpp:1:
/home/mark/GitProjects/kdirchainrebuild/tsl/htrie_hash.h:937:10: note: candidate: template<class InputIt> void tsl::detail_htrie_hash::htrie_hash<CharT, T, Hash, KeySizeT>::insert_with_prefix(const CharT*, tsl::detail_htrie_hash::htrie_hash<CharT, T, Hash, KeySizeT>::size_type, InputIt, InputIt) [with InputIt = InputIt; CharT = char; T = int; Hash = tsl::ah::str_hash<char>; KeySizeT = short unsigned int]
     void insert_with_prefix(const CharT* prefix, size_type prefix_size, InputIt first, InputIt last) {
          ^~~~~~~~~~~~~~~~~~
/home/mark/GitProjects/kdirchainrebuild/tsl/htrie_hash.h:937:10: note:   template argument deduction/substitution failed:
In file included from /home/mark/GitProjects/kdirchainrebuild/fstrie.h:1:0,
                 from /home/mark/GitProjects/kdirchainrebuild/fstrie.cpp:1:
/home/mark/GitProjects/kdirchainrebuild/tsl/htrie_map.h:214:9: note:   cannot convert ‘prefix’ (type ‘const std::basic_string_view<char>’) to type ‘const char*’
         m_ht.insert_with_prefix(prefix, prefix.size(), first, last);
         ^~~~

@Tessil
Copy link
Owner

Tessil commented Jan 3, 2018

Sorry forgot something, could you try again with the latest commit? I'll see to test it a bit more when I have more time.

@markg85
Copy link
Author

markg85 commented Jan 3, 2018

Ha :)

Now it works.

Now for some performance numbers (as that is what i was trying to do in the first place). Note that is completely naive! I could optimize it a lot more. Also, this is just for insertion compared to insertion with prefix!

My case: index a single folder of 500.000 entries (dump entries, named from 1.txt till 500000.txt)
I'm using KDE Frameworks KIO for this. How this works is that KIO emits batches of about 200 entries till it's done. That might be important for the insert_with_prefix as that is called for roughly 500000 / 200 = ~2500 times

Anyhow, take that for granted and as 0 line benchmark.

Indexing files (no hat-trie, no storage): ~365ms
Indexing + hat-trie insert: ~2714ms
Indexing + hat-trie insert with prefix: ~1497ms

That is impressive! Nearly twice as fast.

The code (incase you're interested).
insert

    std::string currentPrefix = m_url.toStdString();

    for (const KIO::UDSEntry &entry: list)
    {
      m_trie.insert(currentPrefix + "/" + entry.stringValue(KIO::UDSEntry::UDS_NAME).toStdString(), entry);
    }

insert_with_prefix

    std::map<std::string, KIO::UDSEntry> entries;
    std::string currentPrefix = m_url.toStdString();

    for (const KIO::UDSEntry &entry: list)
    {
      entries.emplace(entry.stringValue(KIO::UDSEntry::UDS_NAME).toStdString(), entry);
    }

    m_trie.insert_with_prefix(currentPrefix, entries.begin(), entries.end());

Merely changing std::map to std::unordered_map gives a benchmark result of ~1362ms! (well over twice as fast as a normal insert)

I'm verifying if this works by counting the number of entries

    const auto [begin, end] = trie->prefix("file:///home/mark/massive_files/");
    qDebug() << "Entry count:" << std::distance(begin, end);

If that is 500000 then i call it OK :) And it is for the above tests.

Now this is still tremendously wasteful as i get the file entries as QList object, putting it in a std::map<std::string, KIO::UDSEntry> which only serves to be inserted in hat-trie... A more ideal scenario would be to have some kind of magical proxy iterator that would iterate over my list and send back the data on the fly as needed. No, this is not std::transform. That does it partly but puts it in a resulting container. Which is not what i want (unless inserting it directly in hat-trie, but that might be too far fetched).

@Tessil
Copy link
Owner

Tessil commented Jan 7, 2018

Hi,

I did some quick tests on my side using the Wikipedia title dataset used in the README.md using "/home/tessil/" as pefix.

Results:

  • insert: 5208 ms
  • insert_with_prefix: 4480 ms

So a ~14% improvement.

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <chrono>
#include <tsl/htrie_map.h>
#include <map>



int main(int argc, char** argv) {
    const std::string prefix = "/home/tessil/";
    const std::string file_path = argv[1];
    
    std::ifstream file(file_path);
    if(!file.is_open()) {
        throw std::runtime_error("Couldn't read " + file_path + ".");
    }
    
    
    {    
        std::map<std::string, int> lines;
        for(auto it = std::istream_iterator<std::string>(file); it != std::istream_iterator<std::string>(); ++it) {
            lines.insert({*it, 1});
        }
        
        const auto chrono_start = std::chrono::high_resolution_clock::now();
        
        tsl::htrie_map<char, int> map;
        map.insert_with_prefix(prefix, lines.begin(), lines.end());
        
        const auto chrono_end = std::chrono::high_resolution_clock::now();
        
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(chrono_end - chrono_start).count() << 
                    " ms to insert_with_prefix " << map.size() << " elements." << std::endl;
    }
    
    file.clear();
    file.seekg(0);
    
    {    
        std::map<std::string, int> lines;
        for(auto it = std::istream_iterator<std::string>(file); it != std::istream_iterator<std::string>(); ++it) {
            lines.insert({prefix + *it, 1});
        }
        
        const auto chrono_start = std::chrono::high_resolution_clock::now();
        
        tsl::htrie_map<char, int> map;
        map.insert(lines.begin(), lines.end());
        
        const auto chrono_end = std::chrono::high_resolution_clock::now();
        
        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(chrono_end - chrono_start).count() << 
                    " ms to insert " << map.size() << " elements." << std::endl;
    }    
}

As you said, your code could be improved by feeding the paths directly to the HAT-trie, but you could also avoid the creation of a std::string (expensive if long enough) at each iteration by passing a pointer and the length of the string to insert_with_prefix_ks.

Note also that the code only provides the basic exception safety guarantee for now. I have to see to add the strong guarantee before merging the code to master.

@markg85
Copy link
Author

markg85 commented Jan 7, 2018

That's tricky on my part. My strings are QString objects, i have to make std::string objects to put them in your container.

In fact, if you would support other character types (like QChar) then i don't have to do a conversion there. On the other hand, there are many conversion in my code.. The KIO side that reads the directory is - at it's lowest level - just a readdir with char arrays. From there on it gets converted to QString and stays that way (much more happens, but lets keep it simple).

The option that could work is providing a std::string_view implementation that can read from a QString. It would serve both your tree and keep my code as is.

As for your code, i'm surprised the prefix version is that much faster with you. If i run your benchmark the win is much less (still a win though). And if i implement batching (just adding it to the tree in batches of 100.000 lines) then everything slows down (i'm guessing due to the large std::map container which i clean up after every 100.000 lines, it's even worse with std::unordered_map). So i just don't include those numbers here.

Timings using your benchmark
3228 ms to insert_with_prefix 13664913 elements.
3558 ms to insert 13664913 elements.

@Tessil
Copy link
Owner

Tessil commented Jan 7, 2018

EDIT: Yes, I could try to support other char types. I have plans to support wchar_t, char16_t and char32_t so I could also support QChar which use UTF-16, but I still have to see how I can do that in a efficient way (currently I use an array of 256 pointers at each trie node, not feasible with larger char types).

In the meantime, there should be a way to get a raw buffer of chars from the QString when inserting an element and the use a QStringView when retrieving a string from the trie, but I don't know the best way. You can probably use static_cast<char*>(QString::data()), it should respect the strict aliasing rule as we are converting to char*.

Regarding the performance you only get a ~10% improvement (compared to my ~14%, both with GCC 7.2 and Clang 5.0). Our environment must play quite a bit (my CPU is not really the best out there).

@markg85
Copy link
Author

markg85 commented Jan 7, 2018

I see what you mean, yes that could work.
For the other char types, i leave that in your capable hands :)

Anyhow, i think we have now - somewhat - concluded that this feature request can have benefits. In the case of indexing an already hierarchical data structure (like a filesystem tree to this trie), it can provide quite big performance improvements. For non hierarchical data structures (say a dictionary) it probably won't provide any benefit or a very small one at best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants