Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add argument to Profiler for samples ratio #1094

Open
carlsonp opened this issue Feb 12, 2024 · 1 comment
Open

Add argument to Profiler for samples ratio #1094

carlsonp opened this issue Feb 12, 2024 · 1 comment
Assignees
Labels
New Feature A feature addition not currently in the library

Comments

@carlsonp
Copy link
Contributor

Today, there seem to be 2 settings for adjusting the sample size. They are samples_per_update and min_true_samples. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:

pandas_df = pd.read_parquet("myfile.parquet")
profile = Profiler(data, samples_per_update=pandas_df.shape[0])

I was just thinking it would be nice to add an additional flag like samples_ratio which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.

@carlsonp carlsonp added the New Feature A feature addition not currently in the library label Feb 12, 2024
@taylorfturner
Copy link
Contributor

Hey @carlsonp! Thanks for opening the issue and the idea presented.

This makes a ton of sense and I think fits perfectly as a feature into the DataReaders class (as documented here).

There are two features, although not percentages, that exist for CSV and Parquet:

I think something like a percentage sampling would be a nice addition to the readers: read in sampled as desired and pass the pre-sampled data to the profiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

No branches or pull requests

5 participants