Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize unigram and bigram ingest methods #134

Open
organisciak opened this issue Apr 5, 2017 · 1 comment
Open

Generalize unigram and bigram ingest methods #134

organisciak opened this issue Apr 5, 2017 · 1 comment
Assignees

Comments

@organisciak
Copy link
Member

create_unigram_book_counts and create_bigram_book_counts are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.

@bmschmidt
Copy link
Member

It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.

Just sketching it out, something like this.

    def create_wordcount_table(ngrams, wordid_bytes = 3, bookid_bytes = 3):
        """
        wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the vocabulary to 16 million words.
        bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25% 
                   and may speed up queries, but limits the library to 16 million documents.
        """
         vartypes = {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
         table_string = "TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])

I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED rather than MEDIUMINT UNSIGNED, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.

A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants