We should build a polars instrumentation #71

samuelcolvin · 2024-05-01T16:38:41Z

it would give you query/operation times, and the query plan, see this tweet 🐦 .

Richie says:

I don't have the full context, but the query plan can be serialized to json or visualized via .explain.

The text was updated successfully, but these errors were encountered:

adriangb · 2024-05-01T17:34:34Z

So we probably hook into .collect() for LazyFrame? Really it would be nice for polars to provide a global register of hooks to call when it does work.

samuelcolvin · 2024-05-01T17:48:40Z

cc @ritchie46 in case you have any thoughts?

adriangb · 2024-05-01T17:50:14Z

To be clear that integration doesn't have to be logfire specific in any way. It could just be something like:

class PolarsTrace:
    def on_start(self, **payload) -> None: ...
    def on_end(self, **payload) -> None: ...

class PolarsTracer:
    def start_span(self) -> PolarsTrace: ...

I guess the question is what goes in payload, I'd say anything thats "free" to get (presumably a plan, execution time, etc.).

ritchie46 · 2024-05-02T14:02:18Z

I am not entirely sure I understand what a trace is exactly yet.

Things that might be interesting. .profile gives you the result and a DataFrame containing the timings per operations (optimizer, join, grouping etc).

pl.LazyFrame({"foo": [1, 2, 3]}).serialize() gives you JSON repr of the logical plan, which can be reconstructed later (this is cheap as longs as there are no real in-memory tables in the plan).

Other than that there is a hook where you can get a hold of the IR post optimization (pola-rs/polars#15972). Though this is very much leaking internals. We add it so that we can hook cudf in, but I am not sure it would be wise to build upon this.

adriangb · 2024-05-02T14:23:18Z

I am not entirely sure I understand what a trace is exactly yet.

A trace is really similar to a log statement. The only differences is that it has a start and end (and hence a duration) and a context (where in the execution of the program it started and where it ended).

The LazyFrame.profile() information seems like the kind of thing we'd want. Here's a hacky version:

from typing import Any
import polars as pl

class Tracer:
    def on_collect(self, input: pl.LazyFrame, output: pl.DataFrame, profile: pl.DataFrame, plan: dict[str, Any]) -> None:
        # this would be sent to a remote server not just printed
        print(profile)

def _collect_patch(df: pl.LazyFrame, *args: Any, **kwargs: Any) -> pl.DataFrame:
    plan = df.serialize()
    res, stats = df.profile(*args, **kwargs)
    for tracer in pl.tracers:
        tracer.on_collect(df, res, stats, plan)
    return res

pl.tracers = [Tracer()]
pl.LazyFrame.collect = _collect_patch


df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).lazy()

print(df.sort('a').collect())

The idea would be for polars to provide a hook so that we don't need to monkey patch like this.

I assume a LazyFrame computes a plan on each call chain (e.g. .sort() creates a new LazyFrame with a new plan) so a hook for that that only calls with the .serialize() result would make sense as well.

Is it possible to get similar information for each step of execution to a DataFrame?

samuelcolvin added the Feature Request label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should build a polars instrumentation #71

We should build a polars instrumentation #71

samuelcolvin commented May 1, 2024

adriangb commented May 1, 2024

samuelcolvin commented May 1, 2024

adriangb commented May 1, 2024

ritchie46 commented May 2, 2024

adriangb commented May 2, 2024

We should build a polars instrumentation #71

We should build a polars instrumentation #71

Comments

samuelcolvin commented May 1, 2024

adriangb commented May 1, 2024

samuelcolvin commented May 1, 2024

adriangb commented May 1, 2024

ritchie46 commented May 2, 2024

adriangb commented May 2, 2024