Create schema-rich format for enrich good stream #4579

devorbit · 2023-10-28T10:08:45Z

Use schema-rich format for Enrich stage good stream output

Hi Team,

I think there are multiple threads and discussions around this topic. I could find most of them outdated or cold.
I am personally a big fan of the Snowplow platform and recently started building it for my organization from scratch on the GKE Kubernetes cluster.

It's been a challenging but progressing journey until now. But at this stage, I feel like there are going to be long-term caveats and challenges around the TSV output from the Enrich stream.

We have both batch and stream processing pipelines in the downstream, serving both offline and online teams at a wider level.

Caveats around TSV seem to be crucial due to it's cascaded complexity and bottlenecks downstream.

I am seeking some clarity on roadmaps around this case. Please suggest.

Thank you
Jay

stanch · 2023-10-30T10:05:53Z

Hi @devorbit,

Could you clarify what you mean by “schema-rich format”?

The TSV output is not meant to be consumed directly. For loading into warehouses, we have dedicated loaders. For real-time apps, we recommend one of our Analytics SDKs that parse the TSV for you. You can also use Snowbridge with TSV to JSON conversion.

devorbit · 2023-10-30T12:29:45Z

Hi @stanch Thank you for your thoughts on this. Let me explain my case here.

I am using Kafka and GCP for my Snowplow setup. And here are some of the caveats that I have to deal with upfront.

No Kafka to Bigquery Loader:
Currently, the loaders don't include the Kafka source for the BQ destination. So I setup Snowbridge for Kafka to pub/sub replication and this works with additional cost and caveats.

No support for Kafka to GCS BigLake Loader:
The second issue that we are facing is loading TSV events from Kafka/PubSub to GCS as Iceberg tables. Considering the fact the BQ only support Iceberg for BigLake, but Snowplow do not support Iceberg sinks as of now.

Limitations with the Analytics SDK:
If you look into the SDK examples, it seems more friendly to batch pipelines in Spark not for the streaming. And for streaming use cases we need to have some sort of Event's and Contexts' schema for deserialisation and processing.

So my reference to schema-rich format was, to some degree, to develop a direction for simplifying the data processing downstream , by providing a library to create shemas by may be stitching the event and the contexts in some way for simplification.

I could be wrong or may not have very good understanding of the Snowplow sdks. You can suggest me a work-around in that case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create schema-rich format for enrich good stream #4579

Create schema-rich format for enrich good stream #4579

devorbit commented Oct 28, 2023 •

edited

stanch commented Oct 30, 2023

devorbit commented Oct 30, 2023

Create schema-rich format for enrich good stream #4579

Create schema-rich format for enrich good stream #4579

Comments

devorbit commented Oct 28, 2023 • edited

Use schema-rich format for Enrich stage good stream output

stanch commented Oct 30, 2023

devorbit commented Oct 30, 2023

devorbit commented Oct 28, 2023 •

edited