Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create schema-rich format for enrich good stream #4579

Open
devorbit opened this issue Oct 28, 2023 · 2 comments
Open

Create schema-rich format for enrich good stream #4579

devorbit opened this issue Oct 28, 2023 · 2 comments

Comments

@devorbit
Copy link

devorbit commented Oct 28, 2023

Use schema-rich format for Enrich stage good stream output

Hi Team,

I think there are multiple threads and discussions around this topic. I could find most of them outdated or cold.
I am personally a big fan of the Snowplow platform and recently started building it for my organization from scratch on the GKE Kubernetes cluster.

It's been a challenging but progressing journey until now. But at this stage, I feel like there are going to be long-term caveats and challenges around the TSV output from the Enrich stream.

We have both batch and stream processing pipelines in the downstream, serving both offline and online teams at a wider level.

Caveats around TSV seem to be crucial due to it's cascaded complexity and bottlenecks downstream.

I am seeking some clarity on roadmaps around this case. Please suggest.

Thank you
Jay

@stanch
Copy link
Contributor

stanch commented Oct 30, 2023

Hi @devorbit,

Could you clarify what you mean by “schema-rich format”?

The TSV output is not meant to be consumed directly. For loading into warehouses, we have dedicated loaders. For real-time apps, we recommend one of our Analytics SDKs that parse the TSV for you. You can also use Snowbridge with TSV to JSON conversion.

@devorbit
Copy link
Author

Hi @stanch Thank you for your thoughts on this. Let me explain my case here.

I am using Kafka and GCP for my Snowplow setup. And here are some of the caveats that I have to deal with upfront.

  1. No Kafka to Bigquery Loader:
    Currently, the loaders don't include the Kafka source for the BQ destination. So I setup Snowbridge for Kafka to pub/sub replication and this works with additional cost and caveats.

  2. No support for Kafka to GCS BigLake Loader:
    The second issue that we are facing is loading TSV events from Kafka/PubSub to GCS as Iceberg tables. Considering the fact the BQ only support Iceberg for BigLake, but Snowplow do not support Iceberg sinks as of now.

  3. Limitations with the Analytics SDK:
    If you look into the SDK examples, it seems more friendly to batch pipelines in Spark not for the streaming. And for streaming use cases we need to have some sort of Event's and Contexts' schema for deserialisation and processing.

So my reference to schema-rich format was, to some degree, to develop a direction for simplifying the data processing downstream , by providing a library to create shemas by may be stitching the event and the contexts in some way for simplification.

I could be wrong or may not have very good understanding of the Snowplow sdks. You can suggest me a work-around in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants