Add Parquet output #3682

shani-gold · 2023-11-13T12:14:19Z

Parquet is a columnar storage file format that is commonly used in the context of big data processing frameworks, such as Apache Spark and Apache Hive. The format is designed to be highly efficient for both storage and processing, especially in scenarios involving large-scale data analytics. Here are some reasons why Parquet is often used:

Columnar Storage: Parquet stores data in a columnar format, which means that values from the same column are stored together. This storage layout is more efficient for analytics queries that often involve reading specific columns rather than entire rows.

Compression: Parquet supports various compression algorithms, enabling efficient use of storage space. It reduces the amount of disk space needed to store large datasets, making it cost-effective.

Predicate Pushdown: Some query engines, like Apache Spark, can take advantage of predicate pushdown with Parquet. This means that certain filter conditions can be pushed down to the storage layer, minimizing the amount of data that needs to be read during query execution.

Schema Evolution: Parquet supports schema evolution, allowing you to evolve your data schema over time without requiring modifications to existing data or affecting backward compatibility.

Compatibility with Big Data Ecosystem: Parquet is widely used in the big data ecosystem, and many big data processing frameworks have built-in support for reading and writing Parquet files. This makes it easier to integrate Parquet with existing data processing workflows.

Performance: Due to its columnar storage and other optimizations, Parquet can offer improved performance for analytics queries, especially when dealing with large datasets.

When working with large-scale data analytics, Parquet can be a suitable choice for storing and processing your data efficiently. It provides benefits in terms of storage space, query performance, and compatibility with popular big data tools and frameworks. However, the choice of file format depends on your specific use case and the tools you are using in your data processing pipeline.

rafaeldtinoco · 2023-11-13T13:02:07Z

Hello @shani-gold,

This isn't currently planned by us in the short term, but it wouldn't be complicated in being implemented. Is this something you're willing to work on ?

If you check pkg/printer/printer.go you will find multiple printer flavors, like json, gob, table. You could implement a parquet printer there by, perhaps, following the json printer (and possibly converting json to parquet ? Its likely that parquet data schema would have to follow our json schema so it wouldn't brake very often.

Also, be aware that we're currently changing the "Tracee" event structure (#2870), that would mean that the data schema would have to mimic the new one.

Hope it helps for now, and we can keep this opened if you're not doing it and someone else is willing to (or our priority changes).

Thanks!

aquasecurity#3682

shani-gold · 2023-11-13T15:37:04Z

Hi Rafael, I already implemented it :)
#3685

rafaeldtinoco · 2023-11-13T16:04:18Z

Hi Rafael, I already implemented it :) #3685

Oh, that was easy on my side =D. Ok, Ill try to review soon! Thanks for work! Excited to check it.

rafaeldtinoco · 2023-11-13T16:15:03Z

@shani-gold would you mind sharing the use case and project ? Just by curiosity. I'm interested in your use case, are you doing event OLAP processing ? Is that the reason why you wanted to provide such feature ?

shani-gold · 2023-11-13T18:22:32Z

It's for a profiler

shani-gold added the kind/feature label Nov 13, 2023

shani-gold pushed a commit to shani-gold/tracee that referenced this issue Nov 13, 2023

Add Parquet output Issue aquasecurity#3682

b015fdf

aquasecurity#3682

shani-gold linked a pull request Nov 13, 2023 that will close this issue

Add Parquet output Issue #3682 #3685

Draft

yanivagman linked a pull request May 9, 2024 that will close this issue

Add Parquet output Issue #3682 #3685

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet output #3682

Add Parquet output #3682

shani-gold commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

shani-gold commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

shani-gold commented Nov 13, 2023

Add Parquet output #3682

Add Parquet output #3682

Comments

shani-gold commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

shani-gold commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

rafaeldtinoco commented Nov 13, 2023

shani-gold commented Nov 13, 2023