Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PooledVector for "categorical" columns like "bus type code" (buses.ide)? #9

Open
nickrobinson251 opened this issue Sep 22, 2021 · 0 comments
Labels
idea needs some investigation before we decide

Comments

@nickrobinson251
Copy link
Owner

nickrobinson251 commented Sep 22, 2021

Bus type code is either 1, 2, 3 or 4. Currently this is parsed into a Vector{Int64}. But this could potentially be more efficient in a couple ways:

  1. the integers could be parsed into an Int8 rather than Int64 (technically i suppose we only need 2 bits, but Int8 is probably the smallest type it is actually practical for users to be given).
  2. rather than being stored in an N-length Vector (with N T=Int64 integer values), it could be stored as a PooledVector (with N UInt8 values and a 4-element UInt8 => T values). And both options could be combined (e.g. pool and have T=Int8).

Option 2 doesn't really sound worth it on storage-efficiency alone, but it could be worth it (i.e. provide practical performance improvements to users) depending on how the pooled columns (e.g. "bus type code") are going to be used, because certain operations (joins, mapping, ...) can be very efficient on PooledVectors (as they can work with the 4 pooled values, rather than all N entries).

So we should probably do 1 (i.e. change to Int8s) and then investigate how these columns will be used to decide about 2 (pooling).

@nickrobinson251 nickrobinson251 added the idea needs some investigation before we decide label Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea needs some investigation before we decide
Projects
None yet
Development

No branches or pull requests

1 participant