Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse the functionality used in both _merge_histogram and the newly created _assimilate_histogram #838

Open
ksneab7 opened this issue May 23, 2023 · 2 comments
Assignees
Labels
contribution_day Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library

Comments

@ksneab7
Copy link
Contributor

ksneab7 commented May 23, 2023

Is your feature request related to a problem? Please describe.
In an effort to adhere to the goal of achieving a clear paradigm of one, easy to understand, path for each of the following tasks for profiling:
Updating, Getting, and Merging
This issue focuses on clearing up the path to defining how to merge a profile (or parts of a profile) with a singular function path to achieving this goal.

The problem this issue addresses is the use of both _merge_histogram and the newly created _assimilate_histogram as well as other merging processes within the dataprofiler that repeat functionality/have overlapping goals for input and output.

An example of a fix for achieving this paradigm is as follows:
We have implemented a much better way to put information from two histograms together with the creation of _assimilate_histogram and we should be able to use that function throughout the code while also achieving the previously desired functionality of _merge_histograms. We can see the old way of doing this in numerical_column_stats.py on line 1286. This recreates the histogram data which is more memory intensive than doing it the way we do in _assimilate_histogram.

Describe the outcome you'd like:
I would like a singular path to merging profiles and their information that achieves the success of all currently existing functions usage.

Additional context:
For detail behind _assimilate_histogram the PR:
#815
Implements the more memory optimized solution

@ksneab7 ksneab7 added the New Feature A feature addition not currently in the library label May 23, 2023
@JGSweets JGSweets added Refactor Code that is being modified to improve the library and removed New Feature A feature addition not currently in the library labels May 23, 2023
@JGSweets
Copy link
Collaborator

Summary of the new paradigm for histograms.

merge: (Built histogram + built histogram)
update: (new data -> get hist on new data-> built histogram) + (existing built histogram)

@JGSweets
Copy link
Collaborator

All calculations should have a get, update, and merge.

Where get -> calcs from raw data.
merge -> takes two existing calcs and merges them
update -> takes in new data to add; get + merge

@JGSweets JGSweets added the Medium Priority Significant improvement or bug / feature reducing overall performance label May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution_day Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library
Projects
None yet
Development

No branches or pull requests

4 participants