Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet output #1229

Open
the80srobot opened this issue Nov 8, 2023 · 9 comments
Open

Parquet output #1229

the80srobot opened this issue Nov 8, 2023 · 9 comments

Comments

@the80srobot
Copy link
Contributor

the80srobot commented Nov 8, 2023

I would like to add parquet output support to Santa, however there are some trade-offs that might not be acceptable to you. I'd like to have the discussion and the pros/cons in one place (this issue).

Goals

  • Decide if Santa would like a patch series adding parquet support
  • If yes, decide on the approach

Why parquet?

Parquet is the de-facto standard interchange columnar format - most data platforms can ingest it natively, it supports fairly rich schemas and has decent performance.

Most Santa deployments are probably converting logs to a columnar format on the backend to get more efficient storage and compute. Those same advantages already apply on the host itself - a parquet file is going to be smaller than the equivalent fsspool folder, and will therefore use less compute on network IO. (This trade-off was already known to the designers of protocol buffers, hence variadic integer encoding.)

In summary: parquet support makes it easier to adopt Santa for people who already use the format on their backend. This is already a good reason to adopt it. Additionally, it may turn out to save CPU time and bandwidth for existing users of protobuf + fsspool

Why not parquet?

Briefly, code quality and dependency size. The main implementation of parquet is in the Apache Arrow library, which is complex and has a lot of dependencies, including thrift and boost. The codebase itself is large with no obvious modularization or layering - even though different parts have different coding styles and build options (e.g. exceptions vs no exceptions), they are heavily interdependent in all directions. The external dependencies are mostly transparently fetched from upstreams, or expected to be installed on the system, which largely breaks reproducible builds and adds a supply chain problem. All of this makes it difficult to add Arrow as a dependency.

Building Santa with parquet support would likely add ~20 MiB to binary size and require at least the following extra dependencies:

  • Apache Arrow
  • Thrift
  • Boost
  • Snappy
  • Zlib
  • Xz
  • Zstd

Implementation sketch

We can add the dependencies to WORKSPACE as http_archive, and check in a BUILD file into external_patches. Here's what that looks like for thrift:

http_archive(
    name = "thrift",
    build_file = "//external_patches/thrift:BUILD",
    sha256 = "5da60088e60984f4f0801deeea628d193c33cec621e78c8a43a5d8c4055f7ad9",
    strip_prefix = "thrift-0.13.0",
    urls = [
        "https://github.com/apache/thrift/archive/v0.13.0.tar.gz",
    ],
)

The BUILD file can shell out to make or cmake, or just use cc_library. It looks like the latter just works for most libraries.

A serializer would need to keep a reasonable number of recent messages in memory, already converted to column chunks. A single parquet file can contain one or multiple chunks. To begin with, we could tune this to target 1-10 chunks per file, depending on busy the machine is.

File output can use the existing fsspool implementation, just swapping protobuf files for parquet files.

We can avoid building all of the dependencies by bundling them under an @org_apache_arrow target and only depending on that from the new serializer.

Alternatives

Parquet is a standardized, stable format, and implementations other than the official one exist.

It may be worthwhile to implement a minimal version of parquet without all the dependencies. Such a project exists for go, and it could serve as a blueprint. (In fact, we'd only need writer support and don't need the higher-level schema support, so we could end up with an even smaller codebase.)

I'm not sure who has time to do this, though.

@russellhancox
Copy link
Collaborator

I have no objections to adding Parquet support, it looks useful and interesting. I do object to adding several large dependencies and dramatically increasing binary size to add it. If a smaller implementation (in C/C++) can be found (or developed), I don't see any other roadblocks to adding support.

@the80srobot
Copy link
Contributor Author

the80srobot commented Nov 11, 2023

Yeah, I'd forgot that Santa needs to sign all shipped binaries, so a compile-time switch is not a good option.

I'm still trying to shrink the build, and I filed an issue with Arrow asking for help. I'll call this option (1).

Option (2) is to build libparquet as a dylib, bundle it and then load it on demand. This way, santad doesn't get more jolly until you enable parquet support at runtime.

Option (3) Instead of the C++ implementation, use one of the rust crates (parquet or parquet2). At a cursory glance, those look smaller and more reasonable.

Options 2 & 3 aren't mutually exclusive - I actually quite like the idea of shipping as a dylib, because it leaves things as they are for people who don't care about parquet.

Option (4) is building a new miniparquet writer-only library.

@the80srobot
Copy link
Contributor Author

the80srobot commented Nov 11, 2023

Actually, @russellhancox how would you feel about having a dependency on a Rust crate? parquet2 compiles to a 2-MB dylib in all of 10 seconds, which is a 10x improvement over Arrow in both build time and artefact size. Before I go and figure out how to use the cxx bridge in this direction, I wanna make sure there's no reason you know of to avoid Rust.

@russellhancox
Copy link
Collaborator

As long as it builds with bazel and bridges to C++ without performance issues, I see no problem.

@pmarkowsky
Copy link
Contributor

pmarkowsky commented Nov 14, 2023 via email

@pmarkowsky
Copy link
Contributor

Related PoC for building rust with bazel and bridging to C/C++ WIP #1240

@the80srobot
Copy link
Contributor Author

The draft PR #1240 now has some bona fide Rust code to write a couple of columns into a parquet file.

Still TODO:

  1. String support
  2. Some kind of nice wrapper to translate from row types to table types
  3. Wrap (2) in a C++ API
  4. An end-to-end test validating that the parquet output can be read by something like Pandas

Additionally, I'm not super happy with the fact that each row group flush requires allocating memory, but the crate's design prioritizes async safety over allocation efficiency, and it's kind of hard to work around that. I don't think the outcome is terrible - most likely, it's about as allocation-efficient as proto2 with arenas, but it should be profiled.

@the80srobot
Copy link
Contributor Author

OK, I've done all of the above.

  1. Strings are supported
  2. The Table type provides a handy API to write cell by cell
  3. The API works from C++
  4. There is a C++ binary that produces a valid parquet file. (Though checking its validity requires something like Pandas, and is left as exercise to the reader for the moment.)

@the80srobot
Copy link
Contributor Author

the80srobot commented Dec 20, 2023

Thanks for the chat on Monday. To summarize:

  • Pete and Tom need to take a look at the Rust dependencies
  • The mid-term goal is to split the parquet writer logic (most of the Rust code) into a separate library for Pedro to also depend on
  • The place to integrate this to Santa is likely at the Logger level, not Writer/Serializer. To that end, I'll next split the Logger class to create a pure abstract base class. (Send the sketch to Matt.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants