New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_random_subset
poc utility function
#1928
Conversation
2a248f5
to
43d4441
Compare
@R-Palazzo just to respond to some of your considerations:
I think it might make more sense to treat NaNs similarly to how we do with
I'm actually not sure we want a fixed seed here, since it might be helpful to re-run the function to get a different subsample. Maybe instead we could control the randomness in a similar way to how we control it when we sample from synthesizers?
In my opinion, I think we can leave the index as-is. We don't use the index, and users might want to be able to compare the subsampled tables back to their original data.
Nice, I think this works well :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to align this more with the strategy used in the database subsampling? They seem pretty different at the moment
sdv/multi_table/utils.py
Outdated
return ancestors | ||
|
||
|
||
def _get_disconnected_roots_from_table(relationship, table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be relationships
sdv/multi_table/utils.py
Outdated
Parent table to subsample. | ||
parent_primary_key (str): | ||
Name of the primary key of the parent table. | ||
pk_referenced_before_parent (set): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: could we move parent to the front? ie. parent_pks_referenced_before
.
sdv/multi_table/utils.py
Outdated
from sdv.multi_table import HMASynthesizer | ||
from sdv.multi_table.hma import MAX_NUMBER_OF_COLUMNS | ||
|
||
MODELABLE_SDTYPE = ['categorical', 'numerical', 'datetime', 'boolean'] | ||
RANDOM_STATE = 42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should control randomness for this
a1e0bcb
to
b93e71d
Compare
b93e71d
to
f45a39a
Compare
CU-86azvqpqe
Resolve #1877
A few considerations regarding this PR:
1 -
NaNs handling
: Currently, I don't drop NaNforeign keys
2 -
Randomness
: For reproducibility, I set a seed, is it fine?3 -
Index
: Should we reset the index of the tables at the end after all the dropping is done? This is also a question fordrop_unknown_references
.4 -
Verbosity
: To be consistent with the otherPOC methods
, I added a verbose parameter tosimplify_schema()
5 -
Disconnected schema
: Subsampling disconnected schema should work withget_random_subset
. I wrote down a test where I mocked the metadata validation since disconnected schemas are not supported there.Thanks for your review and your thoughts on this ;)