Skip to content

catalyst-cooperative/pudl-output-differ

Repository files navigation

PUDL Output Differ

This standalone tool is designed for comparing contents of two directories. The goal is to give clear and concise report about what are the difference between the two.

File-type specific evaluations, primarily designed to be used with databases, will be executed as well.

Installation

This program uses poetry to manage its dependencies, so you should install that first.

Once poetry is installed, you can set up environment and all dependencies with:

poetry install

It seems that the current set of dependencies require use of Python 3.10 for installing poetry and then installing the environment.

pyenv is a good tool for managing multiple diverse python environments/installations on a single system so that could help you as well.

Usage

The diff tool has many different options, but the standard operation is to provide two directories (may be remote) that are assumed to contain outputs of pudl ETL pipelines. The tool will scan over the files, sqlite databases and tables and generate markdown/html report with differences it finds.

E.g. assume that we have /home/bob/pudl-data/output-dev and /home/bob/pudl-data/output-feature-xyz directories that contain outputs generated by the dev branch and by the feature-xyz branch we're working on. We can then run the analysis by navigating into this project git repository and running:

poetry run diff --html-report feature-xyz-report.html \
  /home/bob/pudl-data/output-dev \
  /home/bob/pudl-data/output-feature-xyz

The above will run the comparison on the files and will write html rendering of the comparison to feature-xyz-report.html file. It will also write raw markdown report to feature-xyz-report.markdown file as well.

The generated html report relies on the presence of github-markdown-light.css which is part of this repository. So if you generate reports into your git checkout directory and open them with the browser, they should render properly.

Few notable parameters:

  • --max-workers controls how many concurrent threads will be used for comparison. More threads will lead to faster completion, but will increase memory pressure and might lead to some sqlite concurrency/locking issues.
  • --otel-trace-backend http://localhost:4317 if you're running tracing services such as jaeger-all-in-one, this will send the traces from the execution to this backend for later analysis.

If you run local prometheus instance, you can monitor cpu, memory usage and other runtime metrics by invoking the differ with --prometheus-port 9101. By default, it will publish metrics on port 9101.

About

Standalone tool for diffing PUDL project outputs

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published