Skip to content

This repository contains an end-to-end walkthrough to leverage Google Cloud services to demonstrate Solution Accelerators for few business domains

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/aicoe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solution Accelerators

This repository contains end-to-end walkthrough of solution to solve a specific technical challenge by accelerating the development and deployment of software components and related documentation. In the first wave, the Solution Accelerator for Security Analytics is released.

Security Analytics

This repository contains an end-to-end walkthrough to leverage Google Cloud services to demonstrate a Solution Accelerator for Security Analytics. The Solution Accelerator starts from - ingesting real-time / streaming and batch data into BigQuery, developing models using Vertex AI and BQML to detect and report anomalies of security attacks and using Looker Studio to visualize the details. A subset of public datasets and generated data are used to simulate the flow.

Table of Contents

Introduction

The Solution Accelerator is developed to create core Google Cloud infrastructure for Data Ingestion pipeline with support for Batch and Real-time data, support Transformations using the out of the box Templates, letting the data flow through a MLOps pipeline exhibiting data collection, processing, modeling, anomaly detection and visualization. These key elements can be used in the Security and Analytics domain around below use cases.

  • Analyzing network traffic to identify patterns that indicate a potential attack
  • Detect insider threats or malicious activity
  • Incident response and forensics (using logs)
  • Manage third and fourth-party vendor risk
  • User Behavior Analysis
  • Data exfiltration Detection (Perhaps in conjunction with VPC Service Controls)

High Level Architecture

The labelled numbers (1-6) correspond to the Sprint cycle, each explaining the data journey in great detail.

HighLevelFlow

Sprint Description Cost Duration
1. Realtime Ingestion Google Cloud PubSub is used to stream data in real-time to BigQuery using JSON log format $0 10mins
2. Enrichment Dataflow is used with PubSub to stream data to BigQuery $0 10mins
3. Feature Store Dataflow is used to store data from Google Cloud Storage into Vertex AI FeatureStore $10 an hr 45mins
4. Anomaly detection Anomaly detection is demonstrated using FeatureStore and AutoML and Vertex AI Model Registry $10 an hr 1 45mins 2
5. BigQueryML Data stored in BigQuery is leveraged to develop a BigQueryML model for Anomaly detection $16 1 20mins
6. Visualization Anomaly detection dashboard is developed using Looker Studio that shows the various data paths and trigger patterns. Custom dashboard can be developed depending on use cases. $16 1 10mins
Clean up Tear down of resources from previous sprints $0 20mins

[1] Cost includes the previous sprints

[2] There is an alternative by developing a model, train and deploy which last for 4 hours

The above journey of data from Log ingestion, enriching logs and inference can be walked through using different data sets.

Tech Stack

  • Python 3.7
  • Terraform / HCL (HashiCorp Configuration Language)
  • Shell scripting
  • Google Cloud services

Hands-on

Bootstrap

This is the first step that creates the foundational infrastructure needs for the remaining sprints. Click here for instructions.

Note Do not skip this step. This step lays down foundational scripts needed to automate the infrastructure provision from Sprints 1 through end.

Realtime Ingestion

This sprint shows reading data from a file to simulate a real-time experience and ingesting to a Cloud PubSub topic and storing into a BigQuery table. Cloud PubSub to BigQuery ingestion is done via PubSub BigQuery subscription. Click here for instructions.

Data Enrichment

This sprint shows reading data from a file to simulate a real-time experience and ingesting to a Cloud PubSub topic and storing into a BigQuery table. Cloud PubSub to BigQuery ingestion is done via Dataflow. Dataflow is also doing data enrichment. Click here for instructions.

Feature Store

This sprint shows a feature engineering platform for Security Analytics. Milestone involve building an enrichment pipeline that reads data from GCS to a Dataflow job that writes to Vertex AI Feature Store. Click here for instructions.

Anomaly Detection

This sprint demonstrates anomaly detection using FeatureStore and AutoML and Vertex AI Model Registry. Click here for instructions.

BigQuery ML

This sprint uses data from streaming and batching datasets to train a K-Means model for clustering. Anomaly detection is demonstrated and results are stored in a BigQuery table. All anomalies are alerted using PubSub. Click here for instructions.

Visualization

This sprint demonstrates a dashboard developed using Looker Studio that shows the various data paths and trigger patterns of Anomaly detection. Click here for instructions.

Cleanup

For resources created and managed by terraform: execute terraform destroy in reverse order. For resources created and managed outside of terraform (created by the pipelines and predictions / models): execute the relevant scripts from the utils directory. Click here for instructions.

Versioning

Initial Version August 2023

Code of Conduct

View

Contributing

View

License

View

About

This repository contains an end-to-end walkthrough to leverage Google Cloud services to demonstrate Solution Accelerators for few business domains

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published