Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use random instead of sequential bookids. #103

Open
bmschmidt opened this issue Jun 24, 2016 · 2 comments
Open

Use random instead of sequential bookids. #103

bmschmidt opened this issue Jun 24, 2016 · 2 comments

Comments

@bmschmidt
Copy link
Member

A niche case for some extensions is to use a bookworm word count index as an input to a batched machine learning process. I sometimes want to train a neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by bookid. Most large text corpora have some pre-specified order. For example, Chronicling America is arrange by newspaper, and then by year within each newspaper. This will make learning outputs less effective, since each batch should be as representative of the whole as possible. A gradient descent algorithm might lurch quite far off in the wrong direction, or even (gasp) find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially, there's a workaround for this: the primary table keys will allow SELECT * FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here. It's possible, though, that index construction might take longer if the tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road. Therefore, the random ids should be drawn from the full range 1,16777215, which is the maximum bookid supported under the standard scheme, not just the range 1,catalog_length.

@organisciak
Copy link
Member

Seems unnecessarily complex for a very specific use. I've been training
based on seeded hashes, can something like that be implemented for random
but predictable ordering? With your approach, would you check for id
collisions, or just have the id sufficiently large that it can be
reasonable expected not to happen?

On Fri, Jun 24, 2016, 9:20 AM Benjamin Schmidt notifications@github.com
wrote:

A niche case for some extensions is to use a bookworm word count index as
an input to a batched machine learning process. I sometimes want to train a
neural network on word counts, for example.

A SELECT * FROM master_bookcounts query will provide this data, sorted by
bookid. Most large text corpora have some pre-specified order. For example,
Chronicling America is arrange by newspaper, and then by year within each
newspaper. This will make learning outputs less effective, since each batch
should be as representative of the whole as possible. A gradient descent
algorithm might lurch quite far off in the wrong direction, or even (gasp)
find and settle into a local maximum.

If bookids are randomly assigned, rather than assigned sequentially,
there's a workaround for this: the primary table keys will allow SELECT *
FROM master_bookcounts ORDER BY bookid to be delivered nearly as fast as
a table scan, but it will now have a random order.

Since bookids are arbitrary and internal, I don't see a downside here.
It's possible, though, that index construction might take longer if the
tables aren't sorted to begin with. So that's something to watch out for.

This might create problems for additions of bookids down the road.
Therefore, the random ids should be drawn from the full range 1,16777215,
which is the maximum bookid supported under the standard scheme, not
just the range 1,catalog_length.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#103, or mute the
thread
https://github.com/notifications/unsubscribe/AADj721eVnVYAPR-lEzLUk7xjVS7VXoFks5qO_WzgaJpZM4I93Ym
.

@bmschmidt
Copy link
Member Author

The numbers here are small enough that there doesn't need to be hashing for assignment.

Essentially, right now, the books are assigned an id from popping off the list [1,2,3,...,16777215].
I was thinking of just materializing that full list in memory and doing a Fisher-Yates shuffle or something on that list before running the id assignment. That would guarantee no collisions.

It should be possible to run that shuffle with a seed. But I'm not sure what the benefits would be of that.

If using a four-byte integer (as they do in Chicago, I've heard), the assignment would have to check for collisions against the already-assigned documents. Which is OK, because ID-assignment has never been a thread-safe (or thread-demanding) process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants