Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Optionally) don't index bookids #137

Open
bmschmidt opened this issue Apr 13, 2017 · 5 comments
Open

(Optionally) don't index bookids #137

bmschmidt opened this issue Apr 13, 2017 · 5 comments

Comments

@bmschmidt
Copy link
Member

bmschmidt commented Apr 13, 2017

@organisciak

The bookid indices take a long time to build on sources like Hathi. They could just be deleted to reduce index creation time and reduce index creation speed; that requires just, AFAIK¸ eliminating this line of code.

https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/CreateDatabase.py#L295

The only problem that I can see is that the creation of the 'nwords' table works from that index, I believe; so the 'nwords' table might have to be created from the flat files instead. That's not a problem, but it is a little more work.

@organisciak
Copy link
Member

I started this work in https://github.com/Bookworm-project/BookwormDB/tree/small_index

As it currently works, you add --no-reverse-index to bookworm prep database_wordcounts. This naming might be confusing because the change is made at table creation rather than indexing.

@organisciak
Copy link
Member

About the nwords tables: this can be calculated from the source files, multithreaded through dask. Using the h5 files that I've been using:

import dask.dataframe as dd
bookcounts = dd.read_hdf('./*.h5', 'unigrams').reset_index().groupby('id')[['count']].sum()
bookcounts.to_csv('nwords.tsv', sep='\t')

A similar process can be done with the tab-separated files that are used for LOAD DATA INFILE, somewhat slower because of the IO bottleneck.

A simple way forward would entail the following actions:

  1. if the --no-reverse-index option was used, do the above summing on everything in the unigrams text folder. Save this to a new file in .bookworm (nwords.txt, perhaps?) and touch a file in 'targets'
  2. when variableSet.createNwordsFile is run and an nwords file has been created, just load that file in

However, there are a few sticking points that may complicate this.

First, with bookworm prep database_wordcounts --no-delete and --no-index, I've tried to support the use case of partial ingest: so you can slurp some files from the unigrams folder, then replace the files, and slurp up the new ones. This might help with file size limits or concurrent workflows for some people. The nwords workflow above breaks the partial ingest use case, because it expects all the input unigram files to exist at once.

Secondly, you can't rely on the --no-reverse-index flag to tell you when you are supposed to build the nword table, because the table could have been created earlier.

So, to account for these, it might make sense to modify the above steps in three ways. First, initiate the flat file nword creation based on whether INDEX(bookid,wordid,count) is or is not in the schema, rather that looking for the flag. Second, sum up bookcounts per unigram input file, as each file is being slurped up with LOAD DATA INFILE. Finally, alongside the index sorting step, merge and sum those intermediate files into a single file.

@bmschmidt
Copy link
Member Author

I'll trust your judgment about what's easiest.

I should say that there's nothing especially desirable about creating the nwords table through a SQL query;
in fact, I suspect it's inefficient compared to doing word counts at the moment (say) that the unigram files are being created. (The index is necessary to group by books; but books are already grouped by themselves in the flat files). So I'd be happy to see the whole step moved to flat files.

Rather than creating a single nwords.txt file, it might make sense to create a folder at .bookworm/texts/nwords/ filled with files that can be slurped into a single table using LOAD DATA LOCAL INFILE, just like for the full word count. This would (if I understand right) allow piecewise creation; it would also make parallelization of creation even more trivial.

The nwords table is small enough (16 million rows) that there shouldn't be major performance hits to just dropping it entirely and recreating it when needed.

@bmschmidt
Copy link
Member Author

One other point; currently nwords is defined as 'number of tokens inside the whitelist of known tokens.' It would be equally reasonable for it to be 'number of tokens, on and off the whitelist'; it should just be documented and consistent across builds. (Not sure if your h5 files would produce the latter).

@organisciak
Copy link
Member

The h5 files are simply more densely stored versions of what you want in master_bookcounts, so it is whitelist-only data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants