Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical binning improvement #90

Open
gbordyugov opened this issue Apr 3, 2017 · 7 comments
Open

Categorical binning improvement #90

gbordyugov opened this issue Apr 3, 2017 · 7 comments

Comments

@gbordyugov
Copy link
Contributor

gbordyugov commented Apr 3, 2017

To bin a list x into N bins, one could simply go for the bin index given by

binIndex = hash(x[i]) % N
@jbao
Copy link

jbao commented Apr 7, 2017

@gbordyugov Sounds interesting, can you provide a reproducible example?

@gbordyugov
Copy link
Contributor Author

objectsToBin = ['those', 'strings', 'should', 'be', 'binned', 'in', 'three', 'bins']

nBins = 3

bins = [hash(o) % nBins for o in objectsToBin]

@jbao
Copy link

jbao commented Apr 10, 2017

ok, but how does this link to the categorical binning, where the use case is usually not random assignment, e.g. to group ['a','a','b','b','b'] into 2 groups?

@gbordyugov
Copy link
Contributor Author

Hashing is not random, hash('a') always returns the same Int, if I'm not mistaking

@jbao
Copy link

jbao commented Apr 10, 2017

That's what I thought too, but in my example, it returns [0,0,0,0,0], or am I missing something here?

@gbordyugov
Copy link
Contributor Author

gbordyugov commented Apr 11, 2017

In [1]: hash('a') % 2
Out[2]: 1

In [3]: hash('b') % 2
Out[4]: 0

hash collisions are, of course, possible, but extremely rare - it really seems to be depending on a particular Python installation whether hash('a') % 2 and hash('b') % 2 are the same number.

@jbao
Copy link

jbao commented Apr 11, 2017

Yes, I still don't quite get it (was not able to reproduce the results), will have to research further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants