Skip to content

Latest commit

 

History

History
1549 lines (1024 loc) · 77.5 KB

course_6_Reliable_Cloud_Infrastructure_Design_for_Resiliency_Scalability_and_Disaster_Recovery.md

File metadata and controls

1549 lines (1024 loc) · 77.5 KB

Design for Resiliency, Scalability, Disaster Recovery & Security

  • Resiliency, Scalability, Disaster Recovery: Implement technologies and processes that assure business continuity in the event of a disaster.
  • Security: Implement policies that minimize security risks, such as auditing, separation of duties and least privilege.
  • Capacity Planning & Cost Optimization: Identify ways to optimize resources and minimize cost.
  • Deployment, Monitoring and Alerting, and Incident Response

Content

Detailled Content

Design for Resiliency, Scalability, and Disaster Recovery

Implement technologies and processes that assure business continuity in the event of a disaster.

video

Overview

This module deals with resiliency. The ability of a system to stay available and to bounce back from problems. But what is a resilient or available design? Is resiliency something you can add? Is it a feature you can turn on? Not really. Resiliency is the quality of a design that accounts for and handles failure. One principle of design is that sometimes a quality you want in the system isn't something you can really do anything about. To get the quality you want you have to look 180 degrees in the opposite direction. In this case, to get availability you have to look at and deal with the potential causes and sources of failure.

This module is designed for:

  • resiliency,
  • scalability and,
  • disaster recovery.

resiliency_definition.png

we're gonna be covering failure; in general, failure due to a loss, failure due to overload. How do we cope with failure? Here, go through to a psychiatrist on that one. What about business continuity? How do we continue if there is a failure and disaster recovery? If there's something major, how do we recover completely from this? Then we will finally talk about scalable and resilient design, which is supposedly supposed to offset all of these bad things from happening or at least dealing with them at least. So then we're going to have an out-of-service issue with our photo service, and then we're going to have to redesign our logging system.

Failure Due to Loss

video

Failure is mandatory

For a period of time in the 1960s, a research team was trying to develop a perfect conductor. The theory was that if they could grow a perfect crystal and metal wire, there would be no signal loss and, therefore, no errors in communication. And they succeeded in creating the perfect wire. However, when they tested it, sometimes there were still errors. Do you know why? It's because we live in a quantum mechanical universe, and the location of electrons moving down a conductor is probabilistic. So, occasionally, an electron will appear outside of the wire and get lost. Errors are going to happen. Loss is going to happen.

So the challenge isn't to avoid it, but to accept it and deal with it.

In this lesson, you'll learn about designing systems to handle failure caused by loss of resources.

resiliency_failure_is_mandatory.png

Single Point of failure

resiliency_failure_single_point_of_failure.png

Design to avoid Single Point of failure: N+2 (a spare spare)

resiliency_failure_design_for_single_point_of_failure.png

Correlated failures

resiliency_failure_correlated_failures.png

Design to avoid correlated failures

resiliency_failure_design_to_avoid_correlated_failures.png

Failure Due to Overload

video

When a system is overloaded at some point the system crosses over into nonlinear behavior. It can crash, thrash, stop responding or break adjacent resources at the service depends on. There's a very specific relationship between failure due to loss and failure due to overload. Imagine for example that a resource is lost and you queue up the work for that resource. All the work that comes in gets backlogged. When the resource is restored, there's this tremendous backlog of work to get done before any new work can be handled and if there's enough of it, nothing new ever gets done. So then the system goes from being totally unavailable due to resource loss, to totally unavailable due to overload. In this lesson you'll learn about the common causes of overload failure and how to plan for them and deal with them. When it comes to overload, prevention is really the best solution so design is critical.

resiliency_failure_due_to_overload.png

Failover design for reliability

resiliency_failure_due_to_overload_failover_design_for_reliability.png

Cascading failures

resiliency_failure_due_to_overload_cascading_failures.png

Design to avoid Cascading failures

resiliency_failure_due_to_overload_design_to_avoid_cascading_failures.png

Example:

resiliency_failure_due_to_overload_design_to_avoid_cascading_failures_example.png

resiliency_failure_due_to_overload_design_to_avoid_cascading_failures_mitigate_incast_failure.png

Queries-of-Death overload failure

resiliency_failure_due_to_overload_QueriesOfDeath_failures.png

Positive feedback cycle overload failure

resiliency_failure_due_to_overload_positive_feedback_overload_failures.png

Detect overload early: Early warning systems (canaries)

resiliency_failure_due_to_overload_detect_early_canaries.png

Coping with Failure

video

The time to prepare for an emergency is before it happens. Moreover, the way to prepare for failure is to embrace it, except that sooner or later, it's inevitable. By making the processes and behaviors you want part of the normal routine operations, you avoid the surprise.

For example, if you know that a zone outage is possible, consider establishing rotating outages as part of the routine operations. Here's another idea, if you know that the users expect 99.95 percent availability, and your service is operating at 99.99 percent availability, consider using that 0.04 percent gap to exercise your resiliency and recovery designs. Finally, don't underestimate the importance of meetings. Circumstances are going to change, if you surround the technical processes with the right human processes, the team will catch the issues before they become emergencies.

resiliency_coping_with_failure.png

Forest fires or Controlled burns

resiliency_coping_with_failure_forest_fire_or_controlled_burn.png

Prepare the team

resiliency_coping_with_failure_prepare_the_team.png

Incorporate failure into SLOs

resiliency_coping_with_failure_include_in_SLOs.png

Monthly meetings to build processes

resiliency_coping_with_failure_meeting_monthly_meetings.png

Strategies for dealing with failure

resiliency_coping_with_failure_strategies.png

Business Continuity & Disaster Recovery

video

The overall strategy for business continuity can be summed up in this motto;

No surprises.

Whatever's happening or could happen, you want to find out about it early and you want to give yourself plenty of recovery options in the design of your system. Now that's a balance.

How much resource and energy do you want to spend on insurance? You might have great recovery systems built into your design, but how do you know they're working, and how do you know that something hasn't changed and those systems haven't quietly stopped working? You need to understand what level of testing and exercise of the recovery systems gives you confidence. That will help you decide what to include in the design elements and also help shape the human processes that operate and maintain the system.

resiliency_business_continuity_disaster_recovery.png

Cloud DNS: 100% availability

resiliency_business_continuity_disaster_recovery_CloudDNS.png

Data Integrity

resiliency_business_continuity_disaster_recovery_data_integrity.png

Reliable recovery with Lazy Deletion

resiliency_business_continuity_disaster_recovery_lazy_or_soft_deletion.png

backup, archive, RESTORE!

resiliency_business_continuity_disaster_recovery_backup_archive_restore.png

Tiered backup for resiliency

resiliency_business_continuity_disaster_tiered_backup_services.png

Cloud Storage features for backup and DR

resiliency_business_continuity_disaster_tiered_backup_services_cloud_Storage.png

Prepare the team for disasters

resiliency_business_continuity_disaster_practice_document_prepare_the_team.png

Scalable & Resilient Design

video

Vertical scaling makes components bigger, but it leaves them as a single point of failure. For example, swapping out a single VM for VM with larger capacity, still means that VM can fail and potentially crash your service. Horizontal scaling make services bigger through multiplicity. It leaves a unit capacity the same, but increases the pool of units. For example, instead of growing a larger VM, you could move to a design with a pool consisting of multiple smaller VMs. That not only makes it scalable, but resilient, because if one VM is lost, the others can pick up the workload until replacement is added. There are several steps you can take to make your design resilient, and they're covered in this lesson

  1. Health checks to monitor instances

resiliency_design_health_checks.png

  1. Automatically replace instances

resiliency_design_auto_replace_instances.png

  1. Resilient Storage: Cloud Storage, Cloud SQL

resiliency_design_resilient_storage.png

  1. Resilient Network

resiliency_design_resilient_network.png

Design pattern: General design for scalable & resilient apps

resiliency_design_design_pattern_general_design_for_scalable_resilient_apps.png

  • Handles loss of instance

general_design_for_scalable_resilient_apps_loss_of_instance.png

  • Handles loss of zone

general_design_for_scalable_resilient_apps_loss_of_zone.png

  • Handles loss of database

general_design_for_scalable_resilient_apps_loss_of_database.png

  • Handles full disaster recovery

general_design_for_scalable_resilient_apps_full_disaster_recovery.png

Microservices design for scalable & resilient streaming

resiliency_design_design_pattern_microservices_design.png

12-factor system & application design in GCP

resiliency_design_design_pattern_12-factor_design.png

Processes for simple, iterative, aligned development

resiliency_design_processes.png

resiliency_design_processes_iterate.png

This module covered several related design goals, including:

  • reliability,
  • scalability,
  • and disaster recovery.

The first subject was availability and reliability. A key concept is that planning for failure and dealing with failure in your design leads to improved reliability. Failure can occur due to the loss of a resource or it can occur due to overload. You must be careful when making adjustments to a system, that you don't accidentally create the potential for an overload failure when you're trying to improve resiliency to a loss failure. The second subject was disaster recovery. You learned that planning for disaster and preparing for recovery is key. The third was scalable and resilient design. That brings together many of the design principles you've seen in the previous modules, and shows how they all fit together into a general resilient solution.

Application: Out of Service!

video

Business problem

It's a Design problem, we lost an entire zone!!!

application_photo_Service_crashed.png

The popular and growing photo services suddenly crashed:

  • How do you handle a major outage?
  • What could the cause be? What changes can you make the design so this problem doesn't take down the entire service again in the future?

application_photo_Service_crashed_problem.png

Have a plan for dealing with a major outage

application_photo_Service_crashed_solution_have_process_in_place.png

Systematic logical troubleshooting

Service loss due to zone outage

application_photo_Service_crashed_troubleshooting.png

Collaboration & communication: Report, Document, build policy

So you want to move these servers to multiple zones.

application_photo_Service_crashed_solution_multiple_zones.png

dummy.png dummy.png dummy.png

Break down business logic on the photo service

What about our Service Level Objectives (SLOs) and Indicators (SLIs)

SLOs didn't change again. We just added more redundancy, making the service span multiple zones.

application_photo_Service_crashed_SLOs.png

Design Challenge #4: Redesign for Time

application_photo_Service_crashed_again_SinglePointOfFAilure.png

Need to scale frontend server

application_photo_Service_crashed_again_solution_scale_upload_servers_Across_zones.png

Prevent overload: MAke load testing real

application_photo_Service_crashed_again_prevent_overload_with_real_testing.png

Scaling requires breaking state out of the upload server

application_photo_Service_crashed_again_scale_upload_server_stateless.png

What does that look like?

Need to make Frontend servers stateless

application_photo_Service_crashed_again_scale_upload_server_stateless_how.png

Update on SLOs and SLIs

You see what it is that you can measure and then quantify it.

application_photo_Service_crashed_again_new_SLOs.png

Design Challenge: Log aggregation delayed troubleshooting

video

When the photo service crashed, it became evident that troubleshooting was taking too long. The reason for the delay was tracked to log aggregation. The aggregated logs needed for troubleshooting were delayed. Worse, the more problems that are occurring in the system, the more log entries are generated and the longer it takes for the aggregated logs to become available for troubleshooting. Can you redesign the log system to eliminate the bottlenecks? Watch the lesson that describes the problem, then come up with your own solution. When you're ready, continue the lesson to see a sample solution.

design_challenge_redesign_for_time.png

design_challenge_redesign_for_time_problem.png

If you remember the 12-factor design, it says to treat log events as streams so that should be a clue. The business issue is servant's service resiliency. It's just taking too long to troubleshoot service issues, batch processing of the logs, just simply does not support live service troubleshooting. It's causing delays, we can't meet our service level objectives, we can't identify and respond to incidents in times. So here's the design challenge. Replace the cron batch processing with stream processing and here's our hint. Try to consider a microservices design.

design_challenge_redesign_for_time_challenge.png

Troubleshooting & solution

design_challenge_redesign_for_time_troubleshooting.png

design_challenge_redesign_for_time_solution.png

Instead of having a cronjob, we have not converted it to Stream Processing with Cloud Pub/Sub.

Design for Security

video

Implement policies that minimize security risks, such as auditing, separation of duties and least privilege.

Overview

There are 3 kinds of security services built into the Google Cloud platform:

  1. services that are transparent and automatic, such as encryption of data that occurs automatically when data's transported and when it's at rest.
  2. services that have defaults but that offer methods for customizations, such as using your own encryption keys rather than those provided.
  3. services that can be used as part of your security design, but only contribute to security if you choose to use them in your design.

security_overview.png

Cloud Security

video

When you migrate an application to the cloud or develop an application in the cloud, there are security benefits that the application inherits simply because of the host environment. Google already has to protect its applications, and many of the benefits of that security effort are built into the infrastructure itself. You need to know about them so you don't accidentally spend effort duplicating them. You also need to know where those services end and your security design begins so that you don't accidentally leave gaps in your security strategy.

Google's strategy for cloud security: "Pervasive efense in depth"

security_Google_strategy.png

Cloud Networking Security: Defense in Depth

"So the least that we can actually expose to the Internet, all the better"

security_Google_defense_in_depth.png

Network Access Control & Firewalls

video

The first level of security where your design can have a significant impact is on network level access control. For example, if you remove the external IP from an instance in a bastion hosts design, you'll eliminate one target that could be attacked. Locking down network access to only what's required is one way to reduce the potential attack surface.

Firewall configuration: 1st line of defense for access

security_firewalls.png

Design for securely accessing VMs

security_secure_VMs.png

API access control with Cloud Endpoints

security_API_control_with_Cloud_Endpoint.png

Protections Against Denial of Service

video

Part of the protections against denial of service attacks are built into the cloud infrastructure. The network, for example, uses software defined networking or SDN. Since there are no physical routers and no physical load balancers, there are no actual hardware interfaces that could be overloaded. There are also services that adapt to demand in intelligent ways, and you can use these in your design to afford further protection against overload attacks

security_DDOS.png

Edge protections agaisnt DDoS

security_protection_vs_DDOS.png

security_protection_vs_DDOS_with_network.png

security_infrastructure_protection_vs_DDOS.png

Resource Sharing & Isolation... a compromise

video

Google Cloud Platform provides a rich array of network topology features, that provide different blends of separation isolation, and communication, and sharing. The least secure design is where everything is in a single failure domain, and all the parts communicate and depend directly on one another. There are many ways of separating those parts and providing more private, or tolerant communication channels between them, creating multiple failure domains and therefore better isolation. In submarine design, the parts of the submarine are divided into compartments that can be sealed off from one another. This helps with reliability, because one compartment can flood and be sealed off from the rest. But it also helps with security, because if an attacker gains entry to one compartment, it can be sealed off to limit the damage.

security_resource_sharing_def.png

VPC isolation through public IPs

security_resource_sharing_VPC_isolations_through_public_IPs.png

IP address isolation using VPN tunneling

security_resource_sharing_IP_isolations_through_VPN.png

Cross-project VPC network peering

security_resource_cross_project.png

Cross-organization VPC network peering

security_resource_cross_organization_sharing.png

Shared VPC

security_resource_VPC_sharing.png

Isolation through multiple network interface

security_resource_multi_NIC.png

Access GCP services over internal IP

security_resource_GCP_over_internal_IP.png

Data Encryption & Key Management

video

You probably already know that Google automatically encrypts data in motion and data at rest, but that description generalizes some of the details that will help you make design decisions. For example, you can use Google's built-in key management, or you can provide your own keys. Also for particularly sensitive data, you can add your own encryption methods in addition to those provided

security_encryption_all_data_in_motion_at_rest_encrypted.png

Server-side encryption

security_encryption.png

Customer manager encryption keys (CMEK)

security_custom_encryption_keys.png

Customer Supplied encryption keys (CSEK)

security_custom_supplied_encryption_keys.png

Persistent Disk Encrytion with CSEK

security_persistent_disk_CSEK.png

Moore control over encryption

security_encryption_more_control.png

Design for Security: Identity Access & Auditing

video

Authorization, access to resources is controlled by Google's Identity and Access Management or IAM system. You're already using this service to control authorized access. By using the auditing tools available, you can also check for unwanted actions like attempts at unauthorized access. That can tell you where the attackers interest is focused. So you can add security measures in those areas.

Identity Access Management

security_IAM.png

Service Accounts

security_Service_Account.png

GCP security auditing with Forseti-security (Open Source)

security_GCP_Security_Auditing_wth_Forseti.png

Cloud Audit Logging

security_Cloud_Audit_Logging.png

External audits & GCP Standards Compliance

security_Standard_Compliance.png

This module covered security from several perspectives, including identity and access management, data encryption and key management, resource sharing and isolation for compartmentalization, protections against denial of service attacks, network access control, and the automatic protections built into the platform and services. You learn that there are some security that's inherited from the environment. Some is configured with default options that you might want to change, and some security features are optional and can be included in your design, if it makes sense with your security strategy.

Application: Photo Service - Intentional Attack

video

There's evidence that hackers are trying to compromise the private information of system users, and maybe try to bring down the service.

That brings up 2 important issues:

  1. How does the system keep users data private?
  2. How does the system protect against a denial-of-service attack?

Identify the protections already in place that are provided by the platform by default, then consider additional design changes that could provide additional protections.

application_photo_Service_intentional_attack.png

Business problem

application_photo_Service_intentional_attack_business_problem.png

application_photo_Service_intentional_attack_business_problem_architecture.png

Break down business logic on the photo service

Lock down the frontend

application_photo_Service_intentional_attack_business_problem_solution.png

application_photo_Service_intentional_attack_business_process_if_DDoS.png

application_photo_Service_intentional_attack_business_actions_vs_DDoS.png

  1. use Cloud CDN to cache our thumbnails accross the world
  2. use Cloud DNS
  3. Implement auto-scaling (instance group)

Can we protect the backend?

application_photo_Service_intentional_attack_backend.png

application_photo_Service_intentional_attack_backend_lockdown_VPC.png

application_photo_Service_intentional_attack_backend_lockdown_Private_network.png

Security checklist

application_photo_Service_intentional_attack_security_checklist.png

Design Challenge #5: Defense in Depth

video

The security measures you're considering implementing in the photo service will make the system much more secure. But wait a moment, the user information and event information may be contained in the logs. If the log system isn't secure, none of the rest of the system is secure. Watch the lesson that describes the problem, then come up with your own solution. When you're ready, continue the lesson to see a sample solution, and remember that the sample solution is not the best possible solution. It's just an example.

design_challenge_security_log_files.png

Problem

design_challenge_security_log_files_problem.png

Possible solution

design_challenge_security_log_files_solution.png

Capacity Planning & Cost Optimization

video

Identify ways to optimize resources and minimize cost.

Overview

Both forecasting for future demand on a system, and planning the resources for a system, depend on non-abstract large-scale design sometimes called dimensioning.

When you optimize for one factor by changing a resource, there may be other consequences. For example, if you change the VM size to optimize CPU capacity, it's possible that network throughput memory and disk capacity could change as a consequence. So, you really need to think through all the dimensions that are affected by your design and perform the calculations to ensure there's sufficient capacity for your purposes.

A common mistake is to optimize away resiliency. Remember that overcapacity is sometimes included by design to handle bursty periods, growth, or intentional attacks. Failing to recognize the purpose of excess capacity, and then reducing it to save money, can create opportunities for cascade failures.

dimensioning_capacity_planning_pricing.png

Capacity Planning

video

Capacity planning is an ongoing cyclical process.

There are various common measures such as:

  • VM instance capacity,
  • disk performance,
  • network throughput,
  • and workload estimations.

Ultimately, you need to be able to answer the question, is there sufficient resource with reasonable certainty? One of the key design principles in this course is to allow other factors to influence your design first, and then come back and dimension the design later. You might have to change some of the design for capacity or for better pricing, but at least, you'll make these adjustments knowing what benefits you're trading for reduced cost or better capacity management.

Capacity planning cycle:

dimensioning_cycle.png

1. Forecast

dimensioning_cycle_forecast.png

Forecast estimation

dimensioning_cycle_forecast_estimation.png

Instance overhead estimation

How to NOT overestimate:

dimensioning_cycle_forecast_instance_overhead_estimation.png

Persistent disks estimation

dimensioning_cycle_forecast_presistence_disks_estimation.png

Network capacity estimation

dimensioning_cycle_forecast_network_estimation.png

Workload estimation

dimensioning_cycle_forecast_workload_estimation.png

Perfkit Benchmarker (Open source tool by Google)

dimensioning_cycle_forecast_workload_estimation_PerfkitBEnchmarker.png

Allocate

dimensioning_cycle_allocate.png

Example with rough estimation:

dimensioning_cycle_allocate_example.png

Opportunity for optimization before allocating more resources?

dimensioning_cycle_allocate_opportunity_for_optimization.png

Approve

dimensioning_cycle_approve.png

Deploy

First: Test, test, test...

dimensioning_cycle_deploy.png

dimensioning_balanced_approach_to_dimensioning.png

Pricing

video

Pricing is commonly used in:

  • cost optimization,
  • reducing cost,
  • and also for budgeting.

One feature of Google Cloud Platform is that bulk use discounting is built in an automatic for many services. In this lesson, you'll learn about how design choices can influence price.

For example, you may have distributed an element of your solution over multiple regions to improve reliability. However, that distribution design might result in additional network charges for egress traffic. Is the cost of the reliability worthy additional network charges? Pricing estimation, and pricing that follows capacity planning can help you decide.

Optimize VMs cost

pricing_optimize_VM_cost.png

Optimize Disks cost

pricing_optimize_disks_cost.png

Optimize Network cost

pricing_optimize_network_cost.png

VM to VM in the same zone:

pricing_optimize_network_cost_same_zone_VM-to-VM.png

In this module, you learned about capacity planning, including the planning cycle, and you learned about pricing. The two of them together, capacity and pricing, provide another perspective on design options. You can modify the design for cost optimization or to limit resource usage. One important point is to apply dimensioning to your design after you've considered other functional aspects of the design.

Application: Photo Service - Cost & Capacity

video

Capacity planning for the coming year's completed.

As a final step, you'll look at the VM options and perform non-abstract cost optimization analysis.

Given the growing capacity requirements, what makes sense financially to choose for the most cost-effective?:

  • a bigger capacity VM,
  • or is sticking with the current size VM

Can we offer the same service with less money?

application_photo_Servicecost_optimization.png

Business problem

application_photo_Service_budget.png

Reviewing the current architecture:

application_photo_Service_review_archtecture.png

application_photo_Service_cost_optimization_problem.png

We want to make a recommendation: Should we move to higher cores CPUs?.7

We first need to check cost effectiveness.

application_photo_Service_check_cost_effectiveness.png

Design Challenge #6: Dimensioning

video

The photo application service design is now set to auto scale and grow for the projected doubling of demand in the coming year.

However, that means the log information will also double. The current storage service is Bigtable.

  • Will the additional demand both data and traffic put stresses on Bigtable?
  • Will the system need an additional Bigtable node to handle the demand in the coming year?

Watch the lesson that describes the problem then come up with your own solution, and when you're ready, continue the lesson to see a sample solution.

Current layout of our log service:

design_challenge_capacity_planning_growth_logs.png

Growth status for BigTable

design_challenge_capacity_planning_growth_logs_BigTable.png

What can handle a BigTable node, in nb. of queries (qps), in throughput (MB/s)?

Our current use of BigTable:

design_challenge_capacity_planning_BigTable_current_use.png

  • the size of the log payload for each of our workloads (web, app, data): ~552 B
  • estimation of log entries per day: ~300 millions entries per day

Our system handles ~154.2 GB/day, i.e. ~55TB/year.

What our service would look like if the usage double inthe coming year?

design_challenge_capacity_planning_BigTable_challenge.png

design_challenge_capacity_planning_BigTable_estimate_growing_capacity.png

a BigTable node can handle up to 10 000 qps and 10MB/s of throughput. So doubling the usage of our app can be handled by a single BigTable node, but we need to consider our storage capacity reaching 110TB by the end of the 2nd year.

Doing the math, 22 BigTable servers using SSD drives won't be sufficient for the double growth forecasted for the coming year.

design_challenge_capacity_planning_BigTable_estimate_growing_pricing.png

Deployment, Monitoring and Alerting, and Incident Response

video

This module discusses deploying, operating and maintaining your design.

One of the things that's been visited repeatedly during this course, is that for a system to stabilize after implementation, it needs to be surrounded by properly prepared and designed behaviors.

What people do while operating and maintaining the system matters. The SLOs and SLIs you've been evolving through the design process, provide an objective method to manage the solution, to keep it running and on track. However, these same measures and the discipline of iteratively reviewing them, will also help determine when the circumstances have changed, when the assumptions of the original design are no longer true or accurate, and it's time to revisit the design and evolve the system.

This module focuses on the behavioral part of your design.

design_behavior_design_operations.png

Learning Objectives

  • Implement processes that minimize downtime, such as monitoring and alarming, unit and integration testing, production resilience testing, and incident post-mortem analysis.
  • Launch a cloud service from a collection of templates.
  • Configure basic black box monitoring of an application.
  • Create an uptime check to recognize a loss of service.
  • Establish an alerting policy to trigger incident response procedures.
  • Create and configure a dashboard with dynamically update charts.
  • Test the monitoring and alerting regimen by applying a load to the service.
  • Test the monitoring and alerting regimen by simulating a service outage.

Deployment

video

In this lesson, you'll learn some tips about how to deploy your solution. The advice seems like common sense. Make:

  • a checklist,
  • automate processes,
  • use an infrastructure orchestration framework.

But don't underestimate the importance of these activities. They're at the core of deploying a stable solution.

  1. Plan your checklist of dependencies for deployment

design_behavior_deployment_plan.png

  1. Launch automation with resilience in mind

design_behavior_deployment_implement_automation.png

Tool of choice: Deployment Manager

  • configuration
  • Resources
  • Templates

design_behavior_deployment_implement_automation_tool.png

Monitoring & Alerting

video

This lesson covers:

  • monitoring and alerting its concepts.
  • It then illustrates the application of these concepts with a stack driver service including the kinds of monitoring that can be configured.
  • How to set up an alert and notification?
  • and how to create a dashboard with charts to help visualize the running system?

SRE pyramid: Monitoring is measuring

(https://landing.google.com/sre/books/)

design_behavior_monitoring_is_measuring.png

Push-based and Pull-based metrics

design_behavior_monitoring_push-based_pull-based_metrics.png

Black box monitoring (affecting user experience)

design_behavior_monitoring_blackbox.png

White box monitoring (monitoring services)

design_behavior_monitoring_whitebox.png

Carefully output of monitoring systems: alerts,

  • Alerts: a human must take action immediately
  • Tickets: a human must take action, but the situation isn't yet urgent
  • Logging: diagnostic information only

design_behavior_monitoring_output_carefuly.png

12-factor administration & operation in GCP: Stack driver

design_behavior_monitoring_on_GCP.png

design_behavior_Stackdriver_unified_tool.png

design_behavior_Stackdriver_built_for_AWS.png

Specify a specific Stackdriver account to make use of its services!

Some features of Stack driver

Uptime (health) check details

design_behavior_Stackdriver_uptime_check.png

Create alerts (conditions, notifications, documentation)

design_behavior_Stackdriver_create_alerts.png

Dashboards

design_behavior_Stackdriver_create_dashboards.png

design_behavior_Stackdriver_create_dashboards_dynamic.png

Logging Agents can be installed to capture all types of logs from other tiers too.

design_behavior_Stackdriver_install_log_agents_also_for_tiers_products.png

Incident Response

video

Incident response is the human behavior that results in system stability when things don't go as planned.

The Site Reliability Engineering or SRE model is introduced (see references section). You've actually been learning best practices throughout this course that relate directly to the layers of the SRE model. By developing your design with reliability in mind you've established processes for operating, maintaining, and recovering system in the event that things start to go sideways. In this lesson you'll review the items that were discussed in detail earlier in the class to prepare for successful incident response. Now, we'll discuss a few final steps such as developing playbooks to implement the response strategy.

incident_response_user_trust.png

Structured incident response

incident_response_structure.png

It includes:

  • Monitoring dashboards
  • Alterting regimen
  • Plans & Tools for responding to issues

incident_response_structure_details.png

incident_response_structure_SRE_pyramid.png

List of SRE processes

list_processes.png

12-factor guidelines on administration and management tasks

incident_response_12-factor_Admin_Mgmt_guidelines.png

Build a playbook based on alerts

incident_response_alerts_and_processes.png

Create "easy buttons" for quick fix

incident_response_use_microservices_and_APIs.png

Balance interrupt-driven work and Incident Response

Controlled burns vs Fire fighters again.

incident_response_balance_project_driven_incident_response.png

This module covered deploying, operating, and maintaining your design. Much of the groundwork needed for successful deployment monitoring and incident response was established earlier in the course in the context of the design process. This module points out how to integrate all those elements together to promote the behaviors that will lead to a stable service.

Application: Stabilization & Operation

video

The photo service has evolved into a sophisticated scalable, reliable, secure system.

The goal is to stabilize the system and make it maintainable and operable.

  • What elements of the service should be monitored?
  • What kinds of alerts and notifications would you set up?

application_photo_Service_stabilization_operation.png

application_photo_Service_current_architecture.png

application_photo_Service_what_to_monitor.png

Design Challenge #7: Monitoring & Alerting

video

Monitoring & Alerting are what you add to the log system to stabilize a solution.

application_photo_Service_logging_current_architecture.png

Business Challenge

What you think might be important to monitor, and under what conditions alerts and notifications should be sent?

application_photo_Service_logging_business_challenge.png

So in this case, make a list of three monitoring and alerting items that you would put into place to support stabilizing this particular solution.

What monitoring and alerting to set up for the logs

application_photo_Service_logging_solution_logs_watchdogs.png

  • monitor that latest data in BigTable isn't older than few minutes > white box
  • monitor the queue mechanisms in Pub/Sub for all 3 feeds (web, app, data) > black box
  • monitor network latency, network uptime for all 3 feeds > black box

Google's reference architectures online

application_photo_Service_logging_solution_logs_Google_tutorials.png

Lab: Deployment Manager - Full Production

In this final lab, you'll clone a public repository of deployment manager templates. The public repo is a library of templates that are provided for a variety of purposes.

They provide a flexible base of templates that you can build on to create your own deployment solutions. There are several tutorials available in the online documentation that use the templates in the repo. This lab is based on one of the advanced tutorials. It employs many of the best practices and design principles you've learned in this class. It creates a scalable, resilient full production service around a simple logbook application.

The lab goes beyond the tutorial, by adding monitoring and testing. You'll use stac driver to configure monitoring, alert notifications and to set up graphical dashboards.

  • You'll use Apache Bench to generate load traffic to test the system and trigger auto scaling.
  • You'll also simulate a service outage to test notifications and resiliency features.

During the previous labs in this course you learned a lot about the basic use of deployment manager. In this final lab, you'll clone a public repo of example Deployment Manager templates that you can use as reference for developing advanced deployments. The previous labs all used YAML templates and Jinja2 templates. This final lab uses Python templates. You'll deploy a full production application that implements many of the principles that were discussed and applied during the class.

lab_full_production.png

lab_full_production_architecture.png

Resources/Articles