Parquet output #1229

the80srobot · 2023-11-08T09:58:31Z

I would like to add parquet output support to Santa, however there are some trade-offs that might not be acceptable to you. I'd like to have the discussion and the pros/cons in one place (this issue).

Goals

Decide if Santa would like a patch series adding parquet support
If yes, decide on the approach

Why parquet?

Parquet is the de-facto standard interchange columnar format - most data platforms can ingest it natively, it supports fairly rich schemas and has decent performance.

Most Santa deployments are probably converting logs to a columnar format on the backend to get more efficient storage and compute. Those same advantages already apply on the host itself - a parquet file is going to be smaller than the equivalent fsspool folder, and will therefore use less compute on network IO. (This trade-off was already known to the designers of protocol buffers, hence variadic integer encoding.)

In summary: parquet support makes it easier to adopt Santa for people who already use the format on their backend. This is already a good reason to adopt it. Additionally, it may turn out to save CPU time and bandwidth for existing users of protobuf + fsspool

Why not parquet?

Briefly, code quality and dependency size. The main implementation of parquet is in the Apache Arrow library, which is complex and has a lot of dependencies, including thrift and boost. The codebase itself is large with no obvious modularization or layering - even though different parts have different coding styles and build options (e.g. exceptions vs no exceptions), they are heavily interdependent in all directions. The external dependencies are mostly transparently fetched from upstreams, or expected to be installed on the system, which largely breaks reproducible builds and adds a supply chain problem. All of this makes it difficult to add Arrow as a dependency.

Building Santa with parquet support would likely add ~20 MiB to binary size and require at least the following extra dependencies:

Apache Arrow
Thrift
Boost
Snappy
Zlib
Xz
Zstd

Implementation sketch

We can add the dependencies to WORKSPACE as http_archive, and check in a BUILD file into external_patches. Here's what that looks like for thrift:

http_archive(
    name = "thrift",
    build_file = "//external_patches/thrift:BUILD",
    sha256 = "5da60088e60984f4f0801deeea628d193c33cec621e78c8a43a5d8c4055f7ad9",
    strip_prefix = "thrift-0.13.0",
    urls = [
        "https://github.com/apache/thrift/archive/v0.13.0.tar.gz",
    ],
)

The BUILD file can shell out to make or cmake, or just use cc_library. It looks like the latter just works for most libraries.

A serializer would need to keep a reasonable number of recent messages in memory, already converted to column chunks. A single parquet file can contain one or multiple chunks. To begin with, we could tune this to target 1-10 chunks per file, depending on busy the machine is.

File output can use the existing fsspool implementation, just swapping protobuf files for parquet files.

We can avoid building all of the dependencies by bundling them under an @org_apache_arrow target and only depending on that from the new serializer.

Alternatives

Parquet is a standardized, stable format, and implementations other than the official one exist.

It may be worthwhile to implement a minimal version of parquet without all the dependencies. Such a project exists for go, and it could serve as a blueprint. (In fact, we'd only need writer support and don't need the higher-level schema support, so we could end up with an even smaller codebase.)

I'm not sure who has time to do this, though.

The text was updated successfully, but these errors were encountered:

russellhancox · 2023-11-08T15:53:56Z

I have no objections to adding Parquet support, it looks useful and interesting. I do object to adding several large dependencies and dramatically increasing binary size to add it. If a smaller implementation (in C/C++) can be found (or developed), I don't see any other roadblocks to adding support.

the80srobot · 2023-11-11T09:47:32Z

Yeah, I'd forgot that Santa needs to sign all shipped binaries, so a compile-time switch is not a good option.

I'm still trying to shrink the build, and I filed an issue with Arrow asking for help. I'll call this option (1).

Option (2) is to build libparquet as a dylib, bundle it and then load it on demand. This way, santad doesn't get more jolly until you enable parquet support at runtime.

Option (3) Instead of the C++ implementation, use one of the rust crates (parquet or parquet2). At a cursory glance, those look smaller and more reasonable.

Options 2 & 3 aren't mutually exclusive - I actually quite like the idea of shipping as a dylib, because it leaves things as they are for people who don't care about parquet.

Option (4) is building a new miniparquet writer-only library.

the80srobot · 2023-11-11T11:10:31Z

Actually, @russellhancox how would you feel about having a dependency on a Rust crate? parquet2 compiles to a 2-MB dylib in all of 10 seconds, which is a 10x improvement over Arrow in both build time and artefact size. Before I go and figure out how to use the cxx bridge in this direction, I wanna make sure there's no reason you know of to avoid Rust.

russellhancox · 2023-11-13T17:41:31Z

As long as it builds with bazel and bridges to C++ without performance issues, I see no problem.

pmarkowsky · 2023-11-14T15:25:08Z

It should. Main concern I'd have is we'd have to make some changes internally, but they should be minimal.

pmarkowsky · 2023-12-11T14:14:31Z

Related PoC for building rust with bazel and bridging to C/C++ WIP #1240

the80srobot · 2023-12-11T15:01:21Z

The draft PR #1240 now has some bona fide Rust code to write a couple of columns into a parquet file.

Still TODO:

String support
Some kind of nice wrapper to translate from row types to table types
Wrap (2) in a C++ API
An end-to-end test validating that the parquet output can be read by something like Pandas

Additionally, I'm not super happy with the fact that each row group flush requires allocating memory, but the crate's design prioritizes async safety over allocation efficiency, and it's kind of hard to work around that. I don't think the outcome is terrible - most likely, it's about as allocation-efficient as proto2 with arenas, but it should be profiled.

the80srobot · 2023-12-14T13:05:52Z

OK, I've done all of the above.

Strings are supported
The Table type provides a handy API to write cell by cell
The API works from C++
There is a C++ binary that produces a valid parquet file. (Though checking its validity requires something like Pandas, and is left as exercise to the reader for the moment.)

the80srobot · 2023-12-20T09:26:32Z

Thanks for the chat on Monday. To summarize:

Pete and Tom need to take a look at the Rust dependencies
The mid-term goal is to split the parquet writer logic (most of the Rust code) into a separate library for Pedro to also depend on
The place to integrate this to Santa is likely at the Logger level, not Writer/Serializer. To that end, I'll next split the Logger class to create a pure abstract base class. (Send the sketch to Matt.)

pmarkowsky added the enhancement label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet output #1229

Parquet output #1229

the80srobot commented Nov 8, 2023 •

edited

russellhancox commented Nov 8, 2023

the80srobot commented Nov 11, 2023 •

edited

the80srobot commented Nov 11, 2023 •

edited

russellhancox commented Nov 13, 2023

pmarkowsky commented Nov 14, 2023 via email •

edited

pmarkowsky commented Dec 11, 2023

the80srobot commented Dec 11, 2023

the80srobot commented Dec 14, 2023

the80srobot commented Dec 20, 2023 •

edited

Parquet output #1229

Parquet output #1229

Comments

the80srobot commented Nov 8, 2023 • edited

russellhancox commented Nov 8, 2023

the80srobot commented Nov 11, 2023 • edited

the80srobot commented Nov 11, 2023 • edited

russellhancox commented Nov 13, 2023

pmarkowsky commented Nov 14, 2023 via email • edited

pmarkowsky commented Dec 11, 2023

the80srobot commented Dec 11, 2023

the80srobot commented Dec 14, 2023

the80srobot commented Dec 20, 2023 • edited

the80srobot commented Nov 8, 2023 •

edited

the80srobot commented Nov 11, 2023 •

edited

the80srobot commented Nov 11, 2023 •

edited

pmarkowsky commented Nov 14, 2023 via email •

edited

the80srobot commented Dec 20, 2023 •

edited