Add `get_random_subset` poc utility function #1928

R-Palazzo · 2024-04-18T12:52:06Z

CU-86azvqpqe
Resolve #1877

A few considerations regarding this PR:
1 - NaNs handling: Currently, I don't drop NaN foreign keys
2 - Randomness: For reproducibility, I set a seed, is it fine?
3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references.
4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema()
5 - Disconnected schema: Subsampling disconnected schema should work with get_random_subset. I wrote down a test where I mocked the metadata validation since disconnected schemas are not supported there.

Thanks for your review and your thoughts on this ;)

sdv-team · 2024-04-18T12:52:08Z

Task linked: CU-86azvqpqe SDV - Add get_random_subset poc utility function #1877

sdv/multi_table/utils.py

sdv/utils/poc.py

frances-h · 2024-04-22T18:46:17Z

@R-Palazzo just to respond to some of your considerations:

1 - NaNs handling: Currently, I don't drop NaN foreign keys

I think it might make more sense to treat NaNs similarly to how we do with drop_unknown_references and add a flag to indicate if we should drop them or not. As an added consideration, if we don't drop them, maybe should we update the functionality to drop a proportional number of NaN foreign keys to keep the table balanced?

2 - Randomness: For reproducibility, I set a seed, is it fine?

I'm actually not sure we want a fixed seed here, since it might be helpful to re-run the function to get a different subsample. Maybe instead we could control the randomness in a similar way to how we control it when we sample from synthesizers?

3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references.

In my opinion, I think we can leave the index as-is. We don't use the index, and users might want to be able to compare the subsampled tables back to their original data.

4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema()

Nice, I think this works well :)

amontanez24

Should we try to align this more with the strategy used in the database subsampling? They seem pretty different at the moment

amontanez24 · 2024-04-23T03:45:52Z

sdv/multi_table/utils.py

+    return ancestors
+
+
+def _get_disconnected_roots_from_table(relationship, table):


should be relationships

amontanez24 · 2024-04-23T05:00:00Z

sdv/multi_table/utils.py

+            Parent table to subsample.
+        parent_primary_key (str):
+            Name of the primary key of the parent table.
+        pk_referenced_before_parent (set):


minor: could we move parent to the front? ie. parent_pks_referenced_before.

amontanez24 · 2024-04-23T05:03:55Z

sdv/multi_table/utils.py

 from sdv.multi_table import HMASynthesizer
 from sdv.multi_table.hma import MAX_NUMBER_OF_COLUMNS

 MODELABLE_SDTYPE = ['categorical', 'numerical', 'datetime', 'boolean']
+RANDOM_STATE = 42


Not sure if we should control randomness for this

tests/unit/multi_table/test_utils.py

sdv/multi_table/utils.py

R-Palazzo requested review from amontanez24 and frances-h April 18, 2024 12:52

R-Palazzo requested a review from a team as a code owner April 18, 2024 12:52

R-Palazzo removed the request for review from a team April 18, 2024 12:53

gsheni reviewed Apr 18, 2024

View reviewed changes

sdv/multi_table/utils.py Outdated Show resolved Hide resolved

gsheni reviewed Apr 22, 2024

View reviewed changes

sdv/multi_table/utils.py Show resolved Hide resolved

frances-h reviewed Apr 22, 2024

View reviewed changes

sdv/multi_table/utils.py Outdated Show resolved Hide resolved

sdv/multi_table/utils.py Outdated Show resolved Hide resolved

sdv/utils/poc.py Outdated Show resolved Hide resolved

R-Palazzo force-pushed the issue-1877-random-subset branch from 2a248f5 to 43d4441 Compare April 22, 2024 15:42

R-Palazzo requested review from frances-h and gsheni April 22, 2024 15:44

amontanez24 reviewed Apr 23, 2024

View reviewed changes

R-Palazzo requested a review from amontanez24 April 23, 2024 13:23

R-Palazzo force-pushed the issue-1877-random-subset branch 2 times, most recently from a1e0bcb to b93e71d Compare April 26, 2024 17:19

R-Palazzo added 12 commits April 29, 2024 11:55

drop_rows + utils methods

f96656d

unit tests 1

d851e3f

definition

b613dee

unit tests

5bf2ecb

integration tests

cfb687a

lint

734fcf7

docstring

ec4f06e

typo

9121784

address comments

2210bf6

remove random seed + address comments

0c25972

add nan foreign key logic

bb53652

add warning null foreign keys

f45a39a

R-Palazzo force-pushed the issue-1877-random-subset branch from b93e71d to f45a39a Compare April 29, 2024 11:01

clean pyproject

e9330ca

amontanez24 reviewed Apr 29, 2024

View reviewed changes

tests/unit/multi_table/test_utils.py Outdated Show resolved Hide resolved

sdv/multi_table/utils.py Outdated Show resolved Hide resolved

sdv/multi_table/utils.py Outdated Show resolved Hide resolved

address comments

687e543

R-Palazzo requested a review from amontanez24 April 29, 2024 17:25

amontanez24 approved these changes Apr 29, 2024

View reviewed changes

gsheni approved these changes Apr 29, 2024

View reviewed changes

frances-h approved these changes Apr 29, 2024

View reviewed changes

R-Palazzo merged commit 060bae9 into main Apr 30, 2024
37 checks passed

R-Palazzo deleted the issue-1877-random-subset branch April 30, 2024 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `get_random_subset` poc utility function #1928

Add `get_random_subset` poc utility function #1928

R-Palazzo commented Apr 18, 2024

sdv-team commented Apr 18, 2024

frances-h commented Apr 22, 2024

amontanez24 left a comment

amontanez24 Apr 23, 2024

amontanez24 Apr 23, 2024

amontanez24 Apr 23, 2024

		return ancestors


		def _get_disconnected_roots_from_table(relationship, table):

Add get_random_subset poc utility function #1928

Add get_random_subset poc utility function #1928

Conversation

R-Palazzo commented Apr 18, 2024

sdv-team commented Apr 18, 2024

frances-h commented Apr 22, 2024

amontanez24 left a comment

Choose a reason for hiding this comment

amontanez24 Apr 23, 2024

Choose a reason for hiding this comment

amontanez24 Apr 23, 2024

Choose a reason for hiding this comment

amontanez24 Apr 23, 2024

Choose a reason for hiding this comment

Add `get_random_subset` poc utility function #1928

Add `get_random_subset` poc utility function #1928