Skip to content

jwalsh/beam-summit-nyc-2023

Repository files navigation

Beam Summit 2023

Tue, 13 Jun

  • 09:00 AM - 09:15 AM: Welcome by Danielle Syse
  • 09:15 AM - 09:45 AM: How to Fail with Real-time Analytics by Matthew Housley
  • 09:45 AM - 10:30 AM: Beam ML past, present and future by Kerry Donny-Clark & Reza Rokni
  • 10:30 AM - 11:00 AM: Break
  • 11:00 AM - 11:25 AM: Beam at Talend - the long road from incubator project to cloud-based Pipeline Designer tool by Alexey Romanenko
  • 11:00 AM - 11:50 AM: How to write an IO for Beam by John Casey
  • 11:00 AM - 11:50 AM: Multi-language pipelines: a unique Beam feature that will make your team more efficient by Chamikara Jayalath
  • 11:30 AM - 11:55 AM: Scaling Public Internet Data Collection With Apache Beam by Lior Dadosh
  • 12:00 PM - 12:50 PM: A Beginners Guide to Avro and Beam Schemas Without Smashing Your Keyboard by Devon Peticolas
  • 12:00 PM - 12:50 PM: Beam IO: CDAP And SparkReceiver IO Connectors Overview by Alex Kosolapov & Elizaveta Lomteva
  • 12:00 PM - 12:30 PM: Managed Stream Processing through Apache Beam at LinkedIn by Xinyu Liu, Bingfeng Xia & +1 More Speakers
  • 01:00 PM - 02:00 PM: Lunch
  • 02:00 PM - 02:25 PM: Easy Cross-Language With SchemaTransforms: Use Your Favorite Java Transform In Python SDK by Ahmed Abualsaud
  • 02:00 PM - 02:25 PM: From Dataflow Templates to Beam: Chartboost’s Journey by Austin Bennett & Ferran Fernandez
  • 02:30 PM - 02:55 PM: Cross-language JdbcIO enabled by Beam portable schemas by Yi Hu
  • 02:30 PM - 02:55 PM: Mapping Data to FHIR with Apache Beam by Alex Fragotsis
  • 02:30 PM - 02:55 PM: Meeting Security Requirements For Apache Beam Pipelines On Google Cloud by Lorenzo Caggioni
  • 03:00 PM - 03:25 PM: Introduction to Clustering in Apache Beam by Jasper Van den Bossche
  • 03:00 PM - 03:25 PM: Oops I actually wrote a Portable Beam Runner in Go by Robert Burke
  • 03:00 PM - 03:25 PM: Simplifying Speech-to-Text Processing with Apache Beam and Redis by Pramod Rao & Prateek Sheel
  • 03:30 PM - 03:55 PM: Developing (experimental) Rust SDKs and a Beam engine for IoT devices by Sho Nakatani
  • 03:30 PM - 03:55 PM: Hot Key Detection and Handling in Apache Beam Pipelines by Shafiqa Iqbal & Ikenna Okolo
  • 03:30 PM - 03:55 PM: Scaling Up The OpenTelemetry Collector With Beam Go by Alex Van Boxel
  • 04:00 PM - 04:15 PM: Break
  • 04:15 PM - 04:40 PM: Managing dependencies of Python pipelines by Valentyn Tymofieiev
  • 04:15 PM - 04:40 PM: Troubleshooting Slow Running Beam Pipelines by Mehak Gupta
  • 04:15 PM - 04:40 PM: Unbreakable & Supercharged Beam Apps with Scala + ZIO by Sahil Khandwala & Aris Vlasakakis
  • 04:45 PM - 05:35 PM: Beam loves Kotlin: full pipeline with Kotlin and Midgard library by Mazlum Tosun
  • 04:45 PM - 05:45 PM: Community Discussion: Future of Beam by Alex Van Boxel
  • 04:45 PM - 05:10 PM: Resolving out of memory issues in Beam Pipelines by Zeeshan Khan
  • 05:15 PM - 05:40 PM: Benchmarking Beam pipelines on Dataflow by Pranav Bhandari

Wed, 14 Jun

  • 09:00 AM - 10:00 AM: Founders’ Panel by Federico Patota, Reuven Lax & +2 More Speakers
  • 10:00 AM - 10:30 AM: Break
  • 10:30 AM - 10:55 AM: Apache Beam and Ensemble Modeling: A Winning Combination for Machine Learning by Shubham Krishna
  • 10:30 AM - 10:55 AM: Dealing with order in streams using Apache Beam by Israel Herraiz
  • 10:30 AM - 10:55 AM: Running Apache Beam on Kubernetes: A Case Study by Sascha Kerbler
  • 11:00 AM - 11:25 AM: Building Fully Managed Service for Beam Jobs with Flink on Kubernetes by Talat Uyarer & Rishabh Kedia
  • 11:00 AM - 11:25 AM: Getting started with Apache Beam Quest by Svetak Sundhar
  • 11:00 AM - 11:50 AM: Per Entity Training Pipelines in Apache Beam by Jasper Van den Bossche
  • 11:30 AM - 11:55 AM: Running Beam Multi Language Pipeline on Flink Cluster on Kubernetes by Lydian Lee
  • 11:30 AM - 11:55 AM: Too big to fail - a Beam Pattern for enriching a Stream using State and Timers by Tobias Kaymak & Israel Herraiz
  • 12:00 PM - 12:25 PM: Deduplicating and analysing time-series data with Apache Beam and QuestDB by Javier Ramirez
  • 12:00 PM - 12:50 PM: How many ways can you skin a cat, if the cat is a problem that needs an ML model to solve? by Kerry Donny-Clark
  • 12:00 PM - 12:25 PM: Machine Learning Platform Tooling with Apache Beam on Kubernetes by Charles Adetiloye
  • 12:30 PM - 12:55 PM: Design considerations to operate a stateful streaming pipeline as a service by Bhupinder Sindhwani & Israel Herraiz
  • 12:30 PM - 01:00 PM: Using Large Language Models in Data Engineering Tasks by Sean Jensen-Grey & Vince Gonzalez
  • 01:00 PM - 02:00 PM: Lunch
  • 02:00 PM - 02:25 PM: Large scale data processing Using Apache Beam and TFX libraries by Olusayo Olumayode Akinlaja
  • 02:00 PM - 02:25 PM: Parallelizing Skewed Hbase Regions using Splittable Dofn by Prathap Reddy
  • 02:00 PM - 02:25 PM: Write your own model handler for RunInference! by Ritesh Ghorse
  • 02:30 PM - 02:55 PM: Case study: Using statefulDofns to process late arriving data by Amruta Deshmukh
  • 02:30 PM - 02:55 PM: How to balance power and control when using Dataflow with an OLTP SQL Database by Florian Bastin & Leo Babonnaud
  • 02:30 PM - 02:55 PM: Power Realtime Machine Learning Feature Engineering with Managed Beam at LinkedIn by Yanan Hao & David Shao
  • 03:00 PM - 03:50 PM: CI/CD for Dataflow with Flex Templates and Cloud Build by Mazlum Tosun
  • 03:00 PM - 03:50 PM: Dataflow Streaming - What’s new and what’s coming by Tom Stepp & Iñigo San Jose Visiers
  • 03:00 PM - 03:25 PM: Optimizing Machine Learning Workloads on Dataflow by Alex Chan
  • 03:30 PM - 03:55 PM: ML model updates with side inputs in Dataflow streaming pipelines by Anand Inguva
  • 04:00 PM - 04:15 PM: Break
  • 04:15 PM - 05:15 PM: Beam Lightning Talks by Pablo Estrada
  • 04:15 PM - 04:40 PM: Loading Geospatial data to Google BigQuery by Sean Jensen-Grey & Dong Sun
  • 04:15 PM - 04:40 PM: Use Apache Beam to build Machine Learning Feature System at Affirm by Hao Xu
  • 04:45 PM - 05:10 PM: Accelerating Machine Learning Predictions with NVIDIA TensorRT and Apache Beam by Shubham Krishna
  • 04:45 PM - 05:10 PM: Streamlining Data Engineering and Visualization with Apache Beam and Power BI: A Real-World Case Study by Deexith Reddy
  • 05:30 PM - 08:00 PM: AI Camp: Generative AI meetup

Thu, 15 Jun

  • 09:00 AM - 10:30 AM: Workshop: Application Modernization with Kafka and Beam by Sami Ahmed
  • 09:00 AM - 10:30 AM: Workshop: Catch them if you can - Observability and monitoring by Wei Hsia
  • 09:00 AM - 10:30 AM: Workshop: Step by step development of a streaming pipeline in Python by Anthony L

Actions

[#A] Beam Examples

Python, Kafka, k8s, Flink, Beam

Local environment:

  • AWS
  • GCP
  • Kafka
  • Kubernetes
  • Apache Beam
  • Python

Focus on the following workflow:

  • Setup
  • Onboarding
  • Local development
  • Test support
  • Infrastructure
  • Operational support
  • Debugging
  • Cloud deployment (Terraform)

apache-beam data-processing stream-processing machine-learning python java cloud runners big-data real-time batch-processing side-inputs schemas multilingual

Presentation: Image Prompt Generation through GPT 4

/Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_8.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_7.png

/Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_13.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_12.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_11.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_10.png

Digitally rendered caricature of a cat with a sassy attitude and a furball for a sidekick

Dataflow with GeoBeam

Session Notes

[#C] How to Fail with Real-time Analytics by Matthew Housley

[#B] Beam ML past, present and future by Kerry Donny-Clark & Reza Rokni

with beam.Ppeline as p:
    (p
     | filter Matchfiles
     | map lambda)

Generative AI

TensorFlow Hub

Generative Art

images/20230613_256x256_image_13.png images/20230613_256x256_image_12.png images/20230613_256x256_image_11.png images/20230613_256x256_image_10.png images/20230613_256x256_image_9.png images/20230613_256x256_image_8.png images/20230613_256x256_image_7.png images/20230613_256x256_image_6.png images/20230613_256x256_image_5.png images/20230613_256x256_image_4.png images/20230613_256x256_image_3.png images/20230613_256x256_image_2.png images/20230613_256x256_image_1.png images/20230613_256x256_image.png

[#C] Multi-language pipelines: a unique Beam feature that will make your team more efficient by Chamikara Jayalath

[#B] Troubleshooting Slow Running Beam Pipelines by Mehak Gupta

digraph G {
    logs -> cause 
    logs -> quotas
    ...
}
  • GCP dashboarding
  • worker-startup, worker, docker, kubelet, shuffler
  • This is a subset of system, vm-health, vm-monitor, …
  • Job Metrics > throughput
  • Metrics > CPU utilization
  • Data freshness
  • Batch > Execution Details > straggler detection (reshuffle)
  • GC thrashing
  • machine types, decrease parallelism, or dataflow prime

[#C] Community Discussion: Future of Beam by Alex Van Boxel

  • Airflow
  • Multi-dimensional watermarks
  • DAGS and Beam
  • Time as a concept
  • Using graph databases and iteration with loop unrolling
  • Graph traversal

[#C] Founders’ Panel by Federico Patota, Reuven Lax & +2 More Speakers

[#C] Apache Beam and Ensemble Modeling: A Winning Combination for Machine Learning by Shubham Krishna

with pipeline as p:
  ...

Example: Create image caption and ranks with sequential pattern (BLIP (Salesforce), CLIP (validation))

DAG:

digraph G { 
	  url -> blip -> captions-> clip;
	  url -> { read, preprocess, inference } 
	  input -> inference -> prediction;
}

[#C] Running Apache Beam on Kubernetes: A Case Study by Sascha Kerbler

[#C] How many ways can you skin a cat, if the cat is a problem that needs an ML model to solve? by Kerry Donny-Clark

[#A] Using Large Language Models in Data Engineering Tasks by Sean Jensen-Grey & Vince Gonzalez

  • Consider the use of LLMs as a workflow consideration
  • Pretty signifant number of people using daily (1/3)
  • Attention Is All You Need https://arxiv.org/abs/1706.03762
  • Context window for the requests to the LLM
  • ChatLLM + RLHF
  • Use of the corrective effects
  • Failure to consider the requirements for prompt generation
  • “You’re holding it wrong”
  • How To Ask Questions The Smart Way http://www.catb.org/~esr/faqs/smart-questions.html
  • Source, Question, Destination
  • ETL
  • Dimensionality tranform to answer as lower level space
  • Shots
  • Example: quadratic equation solving (one shot)
  • Example: chain of thought for
  • Example: total papers and Pandas dataframe + passing previous code
  • Example: test data generation
  • Hallucinations about things that should exist
  • On Bullshit: https://press.princeton.edu/books/hardcover/9780691122946/on-bullshit

/Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_6.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_5.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_4.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_3.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_2.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image_1.png /Users/jasonwalsh/org/org-ai-images/20230614_256x256_image.png

[#A] Loading Geospatial data to Google BigQuery by Sean Jensen-Grey & Dong Sun

Streamlining Data Engineering and Visualization with Apache Beam and Power BI: A Real-World Case Study by Deexith Reddy

https://github.com/mohaseeb/beam-nuggets

[#B] AI Camp: Generative AI meetup

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published