New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] get a listing of parquet files? #341
Comments
Hi!
|
Thanks @mjakubowski84, that makes sense! It seems like you have already provided all the building blocks to do this so I will be able to make progress. Thank you so much 🙏 |
@calvinlfer I have a feeling that, as you are reading partitions, you might be interested in the latest release: https://github.com/mjakubowski84/parquet4s/releases/tag/v2.17.0. The internals and API of |
This is great! I am reading partitions and this will help a lot, thank you so much 🙏 |
Hello, it’s me again!
I was wondering if it’s possible to get a listing of all parquet files (and their partition info) in a partitioned read?
To add some context: I have a use case for $work where I’m taking unstructured data from S3 that’s in partitioned parquet files and sinking them into Kafka topics after parsing the unstructured data into structured data. The input data itself doesn’t yield well at all to identifying duplicate data that could occur during failure scenarios (when machines go down when producing this data to Kafka) so I’m trying to use a combination of the partition information + parquet file + line number in the parquet file (using zipWithIndex) and attach all this information to each record so that downstream consumers can recognize this scheme and detect whether duplicate data is present and do something about it.
If I had this capability, I would be able to have the context of each file and turn each file into an fs2 stream and attach the relevant context (partition info and file info and the line info) and produce that into Kafka
is this something we can support? I would love to hear your thoughts and if there’s a better way to solve this.
The text was updated successfully, but these errors were encountered: