Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet output #3682

Open
shani-gold opened this issue Nov 13, 2023 · 5 comments · May be fixed by #3685
Open

Add Parquet output #3682

shani-gold opened this issue Nov 13, 2023 · 5 comments · May be fixed by #3685

Comments

@shani-gold
Copy link

Parquet is a columnar storage file format that is commonly used in the context of big data processing frameworks, such as Apache Spark and Apache Hive. The format is designed to be highly efficient for both storage and processing, especially in scenarios involving large-scale data analytics. Here are some reasons why Parquet is often used:

Columnar Storage: Parquet stores data in a columnar format, which means that values from the same column are stored together. This storage layout is more efficient for analytics queries that often involve reading specific columns rather than entire rows.

Compression: Parquet supports various compression algorithms, enabling efficient use of storage space. It reduces the amount of disk space needed to store large datasets, making it cost-effective.

Predicate Pushdown: Some query engines, like Apache Spark, can take advantage of predicate pushdown with Parquet. This means that certain filter conditions can be pushed down to the storage layer, minimizing the amount of data that needs to be read during query execution.

Schema Evolution: Parquet supports schema evolution, allowing you to evolve your data schema over time without requiring modifications to existing data or affecting backward compatibility.

Compatibility with Big Data Ecosystem: Parquet is widely used in the big data ecosystem, and many big data processing frameworks have built-in support for reading and writing Parquet files. This makes it easier to integrate Parquet with existing data processing workflows.

Performance: Due to its columnar storage and other optimizations, Parquet can offer improved performance for analytics queries, especially when dealing with large datasets.

When working with large-scale data analytics, Parquet can be a suitable choice for storing and processing your data efficiently. It provides benefits in terms of storage space, query performance, and compatibility with popular big data tools and frameworks. However, the choice of file format depends on your specific use case and the tools you are using in your data processing pipeline.

@rafaeldtinoco
Copy link
Contributor

Hello @shani-gold,

This isn't currently planned by us in the short term, but it wouldn't be complicated in being implemented. Is this something you're willing to work on ?

If you check pkg/printer/printer.go you will find multiple printer flavors, like json, gob, table. You could implement a parquet printer there by, perhaps, following the json printer (and possibly converting json to parquet ? Its likely that parquet data schema would have to follow our json schema so it wouldn't brake very often.

Also, be aware that we're currently changing the "Tracee" event structure (#2870), that would mean that the data schema would have to mimic the new one.

Hope it helps for now, and we can keep this opened if you're not doing it and someone else is willing to (or our priority changes).

Thanks!

shani-gold pushed a commit to shani-gold/tracee that referenced this issue Nov 13, 2023
@shani-gold shani-gold linked a pull request Nov 13, 2023 that will close this issue
@shani-gold
Copy link
Author

Hi Rafael, I already implemented it :)
#3685

@rafaeldtinoco
Copy link
Contributor

Hi Rafael, I already implemented it :) #3685

Oh, that was easy on my side =D. Ok, Ill try to review soon! Thanks for work! Excited to check it.

@rafaeldtinoco
Copy link
Contributor

@shani-gold would you mind sharing the use case and project ? Just by curiosity. I'm interested in your use case, are you doing event OLAP processing ? Is that the reason why you wanted to provide such feature ?

@shani-gold
Copy link
Author

It's for a profiler

@yanivagman yanivagman linked a pull request May 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants