Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should build a polars instrumentation #71

Open
samuelcolvin opened this issue May 1, 2024 · 5 comments
Open

We should build a polars instrumentation #71

samuelcolvin opened this issue May 1, 2024 · 5 comments

Comments

@samuelcolvin
Copy link
Member

it would give you query/operation times, and the query plan, see this tweet 馃惁 .

Richie says:

I don't have the full context, but the query plan can be serialized to json or visualized via .explain.

@adriangb
Copy link
Member

adriangb commented May 1, 2024

So we probably hook into .collect() for LazyFrame? Really it would be nice for polars to provide a global register of hooks to call when it does work.

@samuelcolvin
Copy link
Member Author

cc @ritchie46 in case you have any thoughts?

@adriangb
Copy link
Member

adriangb commented May 1, 2024

To be clear that integration doesn't have to be logfire specific in any way. It could just be something like:

class PolarsTrace:
    def on_start(self, **payload) -> None: ...
    def on_end(self, **payload) -> None: ...

class PolarsTracer:
    def start_span(self) -> PolarsTrace: ...

I guess the question is what goes in payload, I'd say anything thats "free" to get (presumably a plan, execution time, etc.).

@ritchie46
Copy link

I am not entirely sure I understand what a trace is exactly yet.

Things that might be interesting. .profile gives you the result and a DataFrame containing the timings per operations (optimizer, join, grouping etc).

pl.LazyFrame({"foo": [1, 2, 3]}).serialize() gives you JSON repr of the logical plan, which can be reconstructed later (this is cheap as longs as there are no real in-memory tables in the plan).

Other than that there is a hook where you can get a hold of the IR post optimization (pola-rs/polars#15972). Though this is very much leaking internals. We add it so that we can hook cudf in, but I am not sure it would be wise to build upon this.

@adriangb
Copy link
Member

adriangb commented May 2, 2024

I am not entirely sure I understand what a trace is exactly yet.

A trace is really similar to a log statement. The only differences is that it has a start and end (and hence a duration) and a context (where in the execution of the program it started and where it ended).

The LazyFrame.profile() information seems like the kind of thing we'd want. Here's a hacky version:

from typing import Any
import polars as pl

class Tracer:
    def on_collect(self, input: pl.LazyFrame, output: pl.DataFrame, profile: pl.DataFrame, plan: dict[str, Any]) -> None:
        # this would be sent to a remote server not just printed
        print(profile)

def _collect_patch(df: pl.LazyFrame, *args: Any, **kwargs: Any) -> pl.DataFrame:
    plan = df.serialize()
    res, stats = df.profile(*args, **kwargs)
    for tracer in pl.tracers:
        tracer.on_collect(df, res, stats, plan)
    return res

pl.tracers = [Tracer()]
pl.LazyFrame.collect = _collect_patch


df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).lazy()

print(df.sort('a').collect())

The idea would be for polars to provide a hook so that we don't need to monkey patch like this.

I assume a LazyFrame computes a plan on each call chain (e.g. .sort() creates a new LazyFrame with a new plan) so a hook for that that only calls with the .serialize() result would make sense as well.

Is it possible to get similar information for each step of execution to a DataFrame?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants