Generalize unigram and bigram ingest methods #134

organisciak · 2017-04-05T17:59:06Z

create_unigram_book_counts and create_bigram_book_counts are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.

The text was updated successfully, but these errors were encountered:

bmschmidt · 2017-04-05T19:55:49Z

It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.

Just sketching it out, something like this.

    def create_wordcount_table(ngrams, wordid_bytes = 3, bookid_bytes = 3):
        """
        wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the vocabulary to 16 million words.
        bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the library to 16 million documents.
        """
         vartypes = {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
         table_string = "TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])

I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED rather than MEDIUMINT UNSIGNED, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.

A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.

organisciak added the Feature request label Apr 5, 2017

organisciak self-assigned this Apr 5, 2017

organisciak mentioned this issue Apr 11, 2017

HTRC improvements #136

Merged

bmschmidt mentioned this issue Nov 16, 2018

Restoring fast feature counting #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize unigram and bigram ingest methods #134

Generalize unigram and bigram ingest methods #134

organisciak commented Apr 5, 2017

bmschmidt commented Apr 5, 2017

Generalize unigram and bigram ingest methods #134

Generalize unigram and bigram ingest methods #134

Comments

organisciak commented Apr 5, 2017

bmschmidt commented Apr 5, 2017