Use random instead of sequential bookids. #103

bmschmidt · 2016-06-24T15:20:19Z

A niche case for some extensions is to use a bookworm word count index as an input to a batched machine learning process. I sometimes want to train a neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by bookid. Most large text corpora have some pre-specified order. For example, Chronicling America is arrange by newspaper, and then by year within each newspaper. This will make learning outputs less effective, since each batch should be as representative of the whole as possible. A gradient descent algorithm might lurch quite far off in the wrong direction, or even (gasp) find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially, there's a workaround for this: the primary table keys will allow SELECT * FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here. It's possible, though, that index construction might take longer if the tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road. Therefore, the random ids should be drawn from the full range 1,16777215, which is the maximum bookid supported under the standard scheme, not just the range 1,catalog_length.

The text was updated successfully, but these errors were encountered:

organisciak · 2016-06-24T16:26:20Z

Seems unnecessarily complex for a very specific use. I've been training
based on seeded hashes, can something like that be implemented for random
but predictable ordering? With your approach, would you check for id
collisions, or just have the id sufficiently large that it can be
reasonable expected not to happen?

On Fri, Jun 24, 2016, 9:20 AM Benjamin Schmidt notifications@github.com
wrote:

A niche case for some extensions is to use a bookworm word count index as
an input to a batched machine learning process. I sometimes want to train a
neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by
bookid. Most large text corpora have some pre-specified order. For example,
Chronicling America is arrange by newspaper, and then by year within each
newspaper. This will make learning outputs less effective, since each batch
should be as representative of the whole as possible. A gradient descent
algorithm might lurch quite far off in the wrong direction, or even (gasp)
find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially,
there's a workaround for this: the primary table keys will allow SELECT *
FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as
a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here.
It's possible, though, that index construction might take longer if the
tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road.
Therefore, the random ids should be drawn from the full range 1,16777215,
which is the maximum bookid supported under the standard scheme, not
just the range 1,catalog_length.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#103, or mute the
thread
https://github.com/notifications/unsubscribe/AADj721eVnVYAPR-lEzLUk7xjVS7VXoFks5qO_WzgaJpZM4I93Ym
.

bmschmidt · 2016-06-24T17:33:56Z

The numbers here are small enough that there doesn't need to be hashing for assignment.

Essentially, right now, the books are assigned an id from popping off the list [1,2,3,...,16777215].
I was thinking of just materializing that full list in memory and doing a Fisher-Yates shuffle or something on that list before running the id assignment. That would guarantee no collisions.

It should be possible to run that shuffle with a seed. But I'm not sure what the benefits would be of that.

If using a four-byte integer (as they do in Chicago, I've heard), the assignment would have to check for collisions against the already-assigned documents. Which is OK, because ID-assignment has never been a thread-safe (or thread-demanding) process.

bmschmidt added the Feature request label Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use random instead of sequential bookids. #103

Use random instead of sequential bookids. #103

bmschmidt commented Jun 24, 2016

organisciak commented Jun 24, 2016

bmschmidt commented Jun 24, 2016

Use random instead of sequential bookids. #103

Use random instead of sequential bookids. #103

Comments

bmschmidt commented Jun 24, 2016

organisciak commented Jun 24, 2016

bmschmidt commented Jun 24, 2016