Data streaming is the process of transmitting a continuous flow of data (a.k.a. streams) typically fed into stream processing software to derive valuable insights.
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.
- Sensors in transportation vehicles, industrial equipment, and farm machinery send data to a streaming application. The application monitors performance, detects any potential defects in advance, and places a spare part order automatically preventing equipment down time.
- A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
- A real-estate website tracks a subset of data from consumers’ mobile devices and makes real-time property recommendations of properties to visit based on their geo-location.
- A solar power company has to maintain power throughput for its customers, or pay penalties. It implemented a streaming data application that monitors of all of panels in the field, and schedules service in real time, thereby minimizing the periods of low throughput from each panel and the associated penalty payouts.
- A media publisher streams billions of clickstream records from its online properties, aggregates and enriches the data with demographic information about users, and optimizes content placement on its site, delivering relevancy and better experience to its audience.
- An online gaming company collects streaming data about player-game interactions, and feeds the data into its gaming platform. It then analyzes the data in real-time, offers incentives and dynamic experiences to engage its players
This data streaming example pipeline will be dealing with logs related to online systems.
- Improve security against threats
- System Troubleshooting
- Performance Optimization
- IP Intelligence
- Targeted Marketing Advertisements based on traffic
source/:
Contains the source filesbatch_pipeline.py:
Code for batch processing of logscreate_tables:
Code to create table in GCP BigQuerygenerateStreamData.py:
Code to generate fake log data using[Faker](https://faker.readthedocs.io/en/master/)
python librarypublishLogsToPubSub.py:
Imports data fromgenerateStreamData.py
and publishes to GCP's cloud Pub/Substream_pipeline.py:
Code for data streaming pipelinescheduled_queries.sql:
Paste these quieries in GCP BigQuery and create scheduled queries to generate tables that have location of IP address
utils/:
Contains config files, access keys and additional scriptsexample.cfg:
Sample config file. Add your GCP credentials and IDs.json:
Add your GCP service account credential file here
imgs/:
Contains project architecture and some sample images
- Python (3.6 & above): Source Code
- GCP
- Cloud Storage: Storing Code and running using cloud console
- Cloud Pub/Sub: Cloud messaging service for transforming and storing
- BigQuery: Storing Transformed Data
- Cloud DataFlow (with Apache Beam): Tool to create pipelines for streaming or batch processing. It is particularly useful for parallel processing and is suited to perform Extract, Transform, and Load (ETL) tasks
- Google Data Studio: Visualization and Dashboards