Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Column Statistics / Data Profiling for Numeric Columns #44

Open
phpisciuneri opened this issue May 20, 2020 · 1 comment
Open
Assignees
Labels
enhancement New feature or request

Comments

@phpisciuneri
Copy link
Contributor

As discussed in our original Spark Summit presentation: See 22 min mark.

Listening to myself is awful btw.

Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.

@phpisciuneri phpisciuneri added the enhancement New feature or request label May 20, 2020
@phpisciuneri
Copy link
Contributor Author

Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:

  • calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
  • leveraging existing hive/sql functions
  • exploring/using approximate methods for histograms, std dev, etc. on large data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants