Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance impact when trying to generate profiling report for more than 200 columns #534

Open
eapframework opened this issue Feb 16, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@eapframework
Copy link

eapframework commented Feb 16, 2024

Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)

The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?

@eapframework eapframework added the enhancement New feature or request label Feb 16, 2024
@rdsharma26
Copy link
Contributor

Thanks for the feedback @eapframework
We will investigate this issue and get back to you with an update.

@eapframework
Copy link
Author

Hi @rdsharma26, I was doing more testing. By analyzing the spark execution tasks, I believe the performance issue is because for metrics such as CountDistinct, Histogram, each metrics calculation is done on each column in sequential manner. So more columns in dataframe is causing the job to run for more time. Parallelizing these calculations would enhance efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants