Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][CHIP] Post Process Script for logs in run.py or as an argument #699

Open
3 tasks
cristianfr opened this issue Feb 28, 2024 · 0 comments
Open
3 tasks

Comments

@cristianfr
Copy link
Contributor

Problem Statement

Chronon jobs can run for a long time, can execute multiple partitions, scan large amounts of data in the configuration and sometimes run into skew where 1 task takes too long.
Much of this information could be parsed from the log and some of it could be extracted from the application metrics. May be even passed on to a GPT like assistant that can explain how much data the job processed and the bottleneck in case of failure. In case of success sometimes it could be good to provide a summary of the tables backfilled, how many partitions, how much of them was via bootstrap, how much time per joinPart.

The logs are fairly long for a person to reasonable parse, however, providing a post process script can create a feedback system with:

  • Visibility: What is the cluster state (how much time spent in ACCEPTED for example), total time processing data, cost, exception encountered, total partitions filled, checking for skew keys, etc.
  • PostExecute: calls to governance API's that may assign ownership, suggestions on modifications to the memory settings, slack alerting that the job finished, sla reporting, etc.

Requirements

  • The ability to run adhoc API's based on the job execution in either a success or a failure outcome.
  • No api changes.

Verification

The qualitative bar to pass here is providing a path for users to increase knowledge of what the jobs are doing, as well as how to self serve, in order to reduce the required expertise to tune a chronon job.

Approach

  • post-execute.sh script to be appended and modified to run.py
  • Alternatively we could do this in the Driver, however Driver, or other OOM tends to not fail gracefully.

Planning

  • Discussion on approach
  • Implementation of entrypoint
  • Base script on task summary and possible failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant