Slow test execution and metric calculation #1034

c0t0ber · 2024-03-19T14:11:19Z

When using real data with a size of 100k rows and a large number of columns, metrics, and tests (around 1000), their calculation can take up to 20 minutes. Additionally, computer resources are not fully utilized, with a powerful processor not exceeding 20% of one core's capacity. Consequently, with many tests and metrics and a high RPS of new data, Evidently may not be able to process them in time.

nick-konovalchuk · 2024-03-22T09:06:27Z

I checked the insides of ClassificationPreset and DataDriftPreset. I've seen a lot of data copying, which is far from ideal. Also pandas is used here, while in some cases faster alternatives could be utilized.
But imo the most inefficient part is embedding actual data into html reports

nick-konovalchuk · 2024-03-22T09:59:04Z

I wonder if metric calculation can be done at least in several processes

c0t0ber · 2024-03-22T10:12:49Z

@nick-konovalchuk

The best solution, I believe, would be to use parallel execution, but we need to explore the feasibility of its application. Trying to optimize individual sections is of little use because in my case, we are calculating ~2000 different tests and metrics.

I don't see any problems with generating HTML since you're only using HTML when necessary.

c0t0ber · 2024-03-22T10:17:13Z

Also using polars with lazy calculations instead of pandas can be a good solution if we are talking about calc optimization

nick-konovalchuk · 2024-03-23T18:49:24Z

@c0t0ber
Personally I've never used polars, but I think I remember it using all the cores of a CPU. In such setting multiprocessing would be harmful.

nick-konovalchuk · 2024-03-23T18:56:32Z

@c0t0ber
I don't see a problem with generating HTML. I wish they also had an option of generating actual plotly objects, that I can display using streamlit, for instance. I still can display HTML in streamlit.
The problem is embedding redundant data in HTML. Do you really need all data point to draw a histogram given that you can't change the bin size after the report is generated? Because as far as I understand they embed ALL data points for ClassificationProbDistribution and charts.

nick-konovalchuk · 2024-03-23T18:57:55Z

Also I'm less sure about it, but the data points in HTML may be duplicated in context of several metrics/tests

nick-konovalchuk · 2024-03-23T19:02:23Z

Idk if this is correct and/or possible, but the following would be cool

Generate actual Plotly objects
Extract HTML from them when the report is rendered. I think this HTML won't have redundant data embedded

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow test execution and metric calculation #1034

Slow test execution and metric calculation #1034

c0t0ber commented Mar 19, 2024

nick-konovalchuk commented Mar 22, 2024

nick-konovalchuk commented Mar 22, 2024

c0t0ber commented Mar 22, 2024

c0t0ber commented Mar 22, 2024 •

edited

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

Slow test execution and metric calculation #1034

Slow test execution and metric calculation #1034

Comments

c0t0ber commented Mar 19, 2024

nick-konovalchuk commented Mar 22, 2024

nick-konovalchuk commented Mar 22, 2024

c0t0ber commented Mar 22, 2024

c0t0ber commented Mar 22, 2024 • edited

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

nick-konovalchuk commented Mar 23, 2024

c0t0ber commented Mar 22, 2024 •

edited