Reliable Cloud Infrastructure: Design and Process

Content

Detailled Content
Course Intro
Objectives
Modules covered
- 4 parts of this class
Defining the Service
Business-logic Layer design
Data Layer Design
Presentation Layer Design
Resources/Articles

Detailled Content

Detailled Content
Course Intro
Objectives
Modules covered
- 4 parts of this class
Defining the Service
Business-logic Layer design
Data Layer Design
Presentation Layer Design
Resources/Articles

Course Intro

course site: https://www.coursera.org/learn/cloud-infrastructure-design-process
2-weeks course intro video: Architecting with Google Cloud Platform Design and Process class
Course overview video

While researching the materials for this class, we learned that expert cloud architects often use complex assemblies in their solutions. The assembly changes many aspects of the design at once, and it's very hard to learn design skills from these complex examples.

A good analogy is an expert chef.

The chef is preparing a dish and adds a bowl of something, you ask what it is and the chef responds, "Well, that's just a half dozen herbs and spices I like to use." You can't duplicate the recipe if you don't know what the ingredients are in their proportions. So in this class, we've broken down the process of the experts into clear understandable steps. We've tried to show all the ingredients.

The class uses a tiered design model to organize the learning. And along the way, you'll have the opportunity to apply what you learned to example designs.

Speaking of examples, the example solutions in the class exercises are intended to get you thinking about design. There are adequate solutions but not perfect solutions, you'll probably think of ways to improve on these solutions and come up with your own designs that are better and that's exactly what we want you to do, to learn to develop your own designs.

Objectives

video

GCP design skills
Process knowledge and skills
Prepare you to continue learning by exposing you to concepts and models (differents SRE & best practices)
Introduce Deployment Manager as a Cloud Architect tool (template & design)

Learned in the labs:

Modules covered

Defining the service (What is that exact process?)
Business logic layer design (This can also be done on paper before deploying resources)
Data layer design (identify the data layer, help identify which services we may utilize)
Presentation layer design (identify the presentation layer, help identify which services we may utilize)
Design for resiliency, scalability and Disaster recovery (DR) (Build it stronger)
Design for security (Build it stronger)
Capacity planning and cost optimization (How much will this design cost? Help design an acceptable budget)
Deployment, monitoring and alerting, and incident response (Key if you have/want an automated deployment, to understand how we respond to incidences)

4 parts of this class

video

Introduction to architecting concepts and principles.
- Defining the Service
The application of those principles to a real world design.

You'll begin with an application on a single virtual machine, and through the modules, you'll evolve the design into a sophisticated scalable, resilient, stable, secure solution.
A related application problem.

You'll be presented with the problem and challenge to come up with your own design, to think through the design challenge on your own, after, you can compare your design to a standard solution.
Hands on labs with Google Deployment Manager.

Deployment Manager is a template-based infrastructure tool. It enables you to produce identical environments for development, test in production, and it's useful for documenting and testing variations in your designs. So you'll be learning deployment manager as a cloud architect tool.

Defining the Service

video course overview

A step-by-step process so you can learn how to think about cloud design.

Layered and iterative approach to design used in this training

video "Define the Service" process

Each module expands on the previous design, exploring it from a new perspective, evolving it into a scalable, reliable, stable, secure design

After introducing a layered model, the module discusses three concepts that are critical to design and to cloud architecture. These are:

state,
measurement,
and requirements.

Design the service... roughly (we will iterate on it) and involve the team going through that design process.
then continue with a more structured design approach... iterate!
- is it handling different scenarios?
- do we have the proper budget?
- ...
Define measurable
- define measurable instruments that will allow us to define
- are we succeeding the goals that we've set out?
- What are exactly our service level objectives?
- and how do we create service level indicators to measure those?
Then we move to our three-tier architecture:
- Presentation layer (Networking)
- Business logic layer (Compute)
- Data layer (Storage)
Resiliency, scalability and disaster recovery
Security
Budget, Capacity planning and cost optimization
Deployment, monitoring, alerting and incident response

Recency bias

(always remember to try new instead of re-using old)

Defining the Service: State and Solution

video

State refers to any action in the system that depends on the memory of a preceding action. And the data that must be remembered is called state information.

For example, compare two sets of directions:

The first one says, walk south for one block and take the first right.
The second one says, turn right one block after you've passed the market.

The second instruction requires you to remember that you've passed the market, so it's stateful. Whereas the first one does not require you to remember anything, so it's stateless. Whether a system is stateful or stateless has a lot of influence on design. And in particular, you have to decide where to store state information, how to retrieve it, and what to do if it's lost.

Analogy with kitchen:

you could build everything yourself (the protein, the vegetables, make the sauce, ...). That gives you more control by centralizing everything and allow you to forsee the problem before it gets to the end state. Problem is scaling.
in a stateless approach, it's easier to handle scaling and distribution.

So what are your states? What kind of information should be stored? Where can we store them? How can we manage them?

Best state is NO states.

How do dal with states?

Avoid hotspots. Instead, push them to a backend.

You could use cache, but that might become overloaded, so to overcome that, distribute the states onto several backend with load-balancing.

Google's common design

This solution is a design process and fits many solutions:

Frontend:

Cloud DNS
HTTPS Load balancer before pre-emptible machines and store states in the backend.

Backend:

backend also stateless, also made of pre-emptible machines, pushing data to "a static stateful cluster". This stateful cluster is charted, replicated.

Defining the Service: Measurement

video

Measurement is the key to creating a stable and manageable system.

The Cloud Architect doesn't just design and implement a solution. There are usually required to stick around for some time after the implementation to stabilize the system and make sure it can be operated and maintained. If you design from the beginning, identifying indicators to be measured and objectives to compare them against, you'll have the information needed for the stabilization processes that surround the system.

How you should measure your service and discuss the terms we use to describe those measurements?

As the designer of your service, it's up to you to define how your service should look and feel to your users. It's equally important to establish metrics that tell your users what they can expect from your service in terms of reliability and performance. We'll call this the service level

To define a service level, you need to identify what the users care about.

Service Level Indicators (SLI): Our first step is to identify the Service Level Indicators or SLIs. For example, the availability of a web page or how fast the system responds when a user clicks a button.
Service Level Objectives (SLO): Next, we determined the threshold of pain for each of these SLIs. Meaning at what point the users will decide that the service is working properly. We'll call this the Service Level Objective or SLO.
Service Level Agreements (SLA): Finally, some services are so critical that a formal business agreement will exist between the provider and the users to guarantee a specific behavior and to compensate the users if the service fails to meet this expectation.

Chances are that you've seen terms like Service Level Agreement or SLA. Perhaps you've even wonder how these figures are determined or more importantly, how they're monitored. Service levels define how a given service or product should behave. There are a number of factors that go into service levels ranging from availability specifications such as 99.99 percent up time or performance specification such as a minimum response time of one second. Service levels helps set the user's expectation and help keep their expectations in line with the design of the service.

Service Level Indicators (SLI)

SLIs should be directly observable and measurable by the users. SLIs should not be internal system metrics such as server, CPU utilization or the number of server failures. This is done for two reasons:

the user can't directly measure those internal factors because they are on the outside of the system.
utilization isn't really an effective measure of the the usability of a system. Instead, utilization is a metric best used by auto-scalars to maintain a consistent user experience

Example SLI:

users know the service functions well because they do not receive errors
they know it performs well because the latency between services is very quick
they have no idea of the load of CPUs, and they don't care!

To provide an accurate measurement of the latency a user would experience, we need to monitor additional systems. In this case, we need to know the true latency of our application. We must account for the latency of a load balancer and of the data server. Note that in this case, we don't have visibility to the user's Internet connection. So, our SLI will be the end-to-end latency from the front-end web load balancer through our service and then back to the data server

Service Level Objectives (SLO)

Once we've identified what the users care about, we need to quantify those thresholds of pain. This is represented as the Service Level Objective or SLO. The SLO is a threshold value for an SLI, and the result should be set at the lowest or poorest level of service, where the users will still consider the service to be in good working order. That's to say, the SLO represents the point at which the user would consider opening a support ticket because the service failed to meet his or her expectation.

Determining an SLO:

When designing your SLOs, it's important to put reality above utopia. While we'd love to have a service that is up 100 percent of the time, that also means that the service can never be updated. Google has adopted an error budget methodology. Instead of assuming that every service will run perfectly, services are given error budgets for the SLOs that operates a bit like spending money. Each service starts with a set amount of errors that are considered normal and expected. Any errors in excess of this budget are considered outages. So, the goal is to perform all the system maintenance and updates within this budget. As the days go by each month, the error budget is replenished, giving the development and operations team the breathing room needed to implement changes and apply updates.

it's important to ensure that all SLOs are based on the user experience, rather than an internal system metric. For example, if a service becomes unavailable at 03:00 AM, but no one notices, perhaps the SLO shouldn't be based on a 24-hour day.

Service Level Agreements (SLA)

Going one step further, we have SLAs. As previously mentioned, some services are so critical that the loss of the service could represent loss of money for the customer or even loss of life in extreme situations such as, autonomous cars or air traffic control. These contracts define an even more restrictive level of service that is lower or poorer than what is defined in the SLO. The key difference here is that the SLO is a soft target used by the owner of the service to set expectations. The SLA is a business contract that grants a user compensation if the service falls below a certain threshold. Therefore, developers and operations team strive to maintain the SLO, which inherently meets the SLA figures.

An SLA is not the minimum point at which a services is considered usable. There are some services that are so mission-critical, that the service provider must provide a written guarantee, that the service will perform above a certain specification. That specification is the SLA, and breaking the SLA grants compensation from the provider to the users. Keep in mind, not all services have SLAs, but all services should have an SLO. It's also possible that providers might not publicly disclose their SLOs for fear that users might incorrectly associate the SLO with an SLA.

User Persona

If you don't know your user, how can you establish what they find important and acceptable? The answer comes from marketing concept used when developing user interfaces. The user persona. A user persona is an abstract representation of how a certain set of users will use your service. For instance, the power user, the casual user, the road where you are in, the inpatient user. User personas can help you determine the SLIs and the SLOs for your service by understanding how each group will actually use your service.

Defining the Service: Gathering Requirements

video

Requirements or items that are important to the business, business goals. Requirements are related to measurement. For example; one requirement might be that the service is responsive to users. That requirement is embodied in an objective, such as; no more than half second latency in responding to customer request. And the objective is supported by a measurable indicator, such as; measuring round trip time between user request and delivery of a result. This lesson covers gathering requirements and things to consider and discuss that will help identify requirements.

Just ask questions: who, what...

Who's the main character?

What do they do?

What do they look like?

Are they old?

Do they have problems?

...

Specify constraints: time, data , users

Scaling requirements: backend, frontend

Size requirements

This can limit your choice in the type of tools, implementation you will have to build.

Application: Introducing an example Photo Application service

video

This lesson introduces the Thumbnail photo service, an example application, that will evolve and design throughout the class. This lesson focuses on applying the principles that were learned in the previous lessons in this module. In this case, we're going to introduce a very basic photo service.

Gather requirements

users
speed
resources
scale
size
availability

Business logic

From logic, define Services Levels (SLIs, SLOs)

Process: test before & during launch

Pre-production tests:

unit tests
integration tests
system tests
stress test

Production tests:

Rollout (staged) with user
acceptance, A/B testing

Lab: use Deployment Manager to create a VM

video
understand YAML templates syntax
deploy a single appserver with Deployment Manager

Business-logic Layer design

Business-logic Layer: Design Overview

video

What is business logic?

In computer science, it's the code that implements business rules that determines what happens to data. In other words, processing. I think of business logic in this way.

Consider the transaction of buying an airline ticket. I could buy the ticket at the counter. I could buy the ticket through a travel agent. I could buy the ticket from a kiosk machine at the airport. I could buy the ticket online using a web browser, or I could buy the ticket using an app on my phone. All of these are different front ends or interfaces, but the transaction itself, of purchasing the ticket, remains the same no matter what interface is used. And that, to me, is the business logic. By definition, business logic operates on data. And that means executing code, processing. So for this reason, the business logic layer is also the part of the design process where you'll consider cloud processing options and determine what service the business logic code will use when it needs to run.

Microservices are a specific kind of Service Oriented Architecture, or SOA, that leverages small, stateless processing for improved scalability and resiliency.

Microservices design is a popular approach to applications, and this lesson explores how microservices work and support on the Google Cloud platform for microservices.

In computer software, business logic or domain logic is part of the program that encodes the real-world business rules that determine how data can be created, stored, and changed.

In this module, we're going to cover:

microservice architecture,
how GCP supports 12-factor,
mapping computing needs to platform products,
compute system provisioning.

We're also going to explore our photo service. We're going to have our first issue. The photo service is slow, so we're going to have to identify what the cause is, and then understand exactly how we're going to redesign for this. We're also going to have a new design challenge, and this is going to have to do with our log aggregation which you're going to see as a result of our photo service. This will be followed up by another lab. This will be our GCP lab deployment manager in package and deploy.

So now we're going to package:

learn how to package our applications
and deploy those versus just simply deploying applications in their source.

Business-logic Layer: Microservice Architecture

video

Microservices are a specific kind of service-oriented architecture or SOA that leverages small, stateless processing from proof scalability and resiliency.

Microservices design is a popular approach to applications and this lesson explores how microservices work and support on the Google Cloud Platform for microservices.

Benefits of microservives

How microservices complicate business logic?

When using microservices make sense?

Google's offering for microservices

Cloud Functions

Cloud Functions example: translate text in an image

Cloud Functions example: implementation on GAE & limitations

GCP 12-factor support

https://12factor.net/

GCP development tools matching 12-factor methodology

Storing states in your environment

Mapping computing needs to platform products

Think of App Engine first, then K8s if you need more control on the scaling, or Compute Engine as a last resort.

Compute System Provisioning

Prefer & plan for horizontal scaling

Pros/Cons of horizontal scaling

Horizontal scaling design

Horizontal scaling tradeoffs: latency, capacity, scalability, cost

Design first, dimension later

Application: The photo service is slow!!!

video

Business problem

the service has been slowing down
what's te cause of this slow service?
What can be done about it?
What opportunities does this offer for improving the design?

Systematic logical troubleshooting

Collaboration & communication

Break down business logic on the photo service

Identify the attributes of the different services?

Segregate services for better performance and scalability

What about our Service Level Objectives (SLOs) and Indicators (SLIs)

Design challenge #1: Log aggregation

This lesson introduces a related independent design problem, the log files.

On the original single virtual machine solution, all of the log files were stored on the instance. As the design changes, new complications are introduced into the design of a log system. To meet troubleshooting requirements, the separate logs will need to be aggregated onto a single system. Watch the lesson that describes the problem and then come up with your own solution. When you're ready, continue the lesson to see a sample solution. Remember that the sample solution is not the best possible solution, it's just an example, your design might be better

What that means is, we have two different log entries now, one for the web and one for the application. They do share a common session ID fields so we need to have a process that will join those together for troubleshooting.

we have logs on two servers, we're going to aggregate them into a single log. So, the solution here is to design a logging server. This server will accept logs, perhaps we're just using sys log fluid d or something else like that, maybe even Kafka. So, in this case we're going to create, let's call this a Python script. We're just going to whip something together, we're going to have a daily cron job that's going to take all these log files and it's going to aggregate them. So, it's purpose is to ingest that data to open the log files together, so it pushes them together. Then we need to transform them. So, even though they're all in one file, we're going to transform them and join them on that session ID. Then we're going to output that data and now we have aggregate log files. Then the daily cron job will go ahead and repeat itself. So, every 24 hours, we will have combined logs.

New business logic:

lab: useDeployment Manager: Package and Deploy

In this lab, you'll learn:

how to customize an instance using the deployment manager template.
how to install library packages and software required of an application.
how to package and install a Python application, the Echo application, that will be installed at boot time and will run when the instance starts up.

The application development team is supposed to provide an early version of the photo application for you to start building the infrastructure, but they're late. Instead, they provide a similar Python application, called Echo for you to use until the real app is ready. Note that you'll not need to do any programming for this lab. The application is already written. You'll just be creating and configuring the templates. Your challenge in this lab is to use deployment manager templates to create the app server. You'll need to package and install the pre-written Python application. You'll need to install the required application frameworks that the application depends on. And you'll need to handle environment metadata to customize and initialize the server during the boot process.

Qwiklabs – Deployment Manager Package and Deploy

In this lab you will...

Develop a service using a pre-written Python application called "Echo" and example Deployment Manager templates written in YAML and JINJA2.
Create a deployment package suitable for Deployment Manager using the python package manager, pip.
Stage your package in a Cloud Storage bucket.
Manually test the application to ensure that it is working properly.
Use Deployment Manager to deploy the Echo service.
Test the new service.

video lab overview
lab notes

Data Layer Design

video

The data layer covers the storage and retrieval of data, including the mechanisms of storage and retrieval, such as databases and file systems, and including the access methods that use the services, such as SQL and API's. However, what's not covered in this module is the transport of data into, around, and out of the system. The networking transport of data is covered in the presentation layer module.

The data layer includes; the data persistent mechanisms, such as the database services and storage services, and the data access layer, which encapsulates the persistent mechanisms and exposes the data.

Classifying and Characterizing Data

video

There are many ways to characterize data such as structured or unstructured, persistent or ephemeral. This lesson focuses on characterizing data based on what the user cares about. Most often, the user doesn't care about the underlying technology, but about whether the data is accessible when they need it. And whether the data they retrieve is the same data they stored and not modified or altered.

What users really care about is data integrity.

They can't really distinguish the difference between:

data loss,
data corruption,
extended unavailability.

If their data's not there, it might as well be gone.

However, as an engineer, persistence and access are very separate. Access, basically, is a loss of data access, is very important to users, but that doesn't mean that our data is gone. Persistence and proactive detection and rapid recovery is really what our goal is. If there is data that is somehow lost we want to be able to get it back, recover it maybe using version controls or backups. Or some kind of failover mechanism.

So when you're thinking about your data, you need to think about what data transaction properties are going to be required. And, of course, you have your menu, and you can only really pick two, and this is the CAP Theorem. So here you have to choose between:

consistency,
availability,
and partition tolerance.

BASE

If you want availability and partition tolerance, you can go over to BASE, which is Basically Available Storage. It's soft state and it's going to be eventual consistency. Now, what that means is, you may make an update, it may not be available there immediately. Or specific changes may not necessarily be available. But with that type of ability, it allows you to write data, it allows you to expand your data very, very quickly. And eventually everything will all be consistent.

ACID

Now if you want pure consistency, then we get into what we call ACID transactions. Which stands for Atomicity, Consistency, Isolation, and Durability. That means if you write some data to your data storage system, it is guaranteed to be there. You will not get an acknowledgement until it has been written and it is completely fault-tolerant, replicated, et cetera. So it will ensure any type of failover capabilities.

What are the data consistency requirements?

What are you truing to optimize? Domain/App specific

uptime
latency
scale
velocity
privacy

Data Ingest and Data Migration

video

Data migration is the idea that data already exists somewhere and that it will be transported into the cloud. Once the base of existing data has been migrated, updates and new data will be added to the cloud version and the original is no longer needed.

Data ingestion is the idea that data will continue to originate outside of the cloud and will be periodically loaded into the cloud.

This lesson covers the most common methods for getting data into the cloud.

GCS migration tools: console, gsutil, JSON API

About data migration. We need to bring data into our system and we potentially need to migrate it from different locations.

So there's a few ways that we do this:

One is directly from the console for GCS, you can just drag and drop files.
There's the gsutil for the Google storage utility command. This also has the built in Amazon S3 SDK so it allows us to transfer data from Amazon buckets on ourselves. But it also gives us capability to do identity access management, sign URLs, access control lists. There's a whole bunch of capabilities just from this command line.
And then of course, our JSON API. So here you can make the API calls yourself, you can compress this data, and you can also do partial uploads which is really good especially because most uploads to GCS buckets are going to be entire updates. And so in this case here, if you control the API, you might be able to have a little bit more efficiency built in.

GCS migration tools for large transfers: Cloud Storage Transfer Service

Now we also have a commercial service, well not a commercial, but it's a service that is designed for very large transfers, and it's called the Cloud transfer service. It's basically a web based interface that's using a lot of the APIs in the back end to transfer data between say, Amazon and us, and it does so very efficiently. Now it usually recommends typically over a terabyte worth of data because it can do a lot of management like deleting the source data, making sure one is synchronized, overriding any changes, but optimizes the number of streams as well. So it looks at that source material says, how many of them are there? I will create an optimized number of streams to ensure we can get this moved over as fast as possible.

GCS migration tools for large transfers without network: Google Transfer Appliance

Now of course, the size always does matter and if you need to be transferring hundreds of terabytes of data, perhaps, you don't want to do that over the network. Even though ingress is free, it may not be time efficient. So we now have the transfer appliance very similar to Amazon's Snowball capability. It's an appliance that ships on site, you can either rock it or do it as a off-the-shelf appliance up to a petabyte worth of storage. And this way, you can copy all the data on your local network and then ship it to Google for data ingest.

GCS migration tools scal-up table

Here's a fun little table to kind of give you an idea. When you really start to scale up and the amount of data you need to move over to us, you also have to scale up in the amount of bandwidth that you might have capable. Now, even at 10 gigs fully saturated, 30 hours is pretty considerable but, I wouldn't say considerable. The fact that you probably cannot utilize that full amount of bandwidth yourself and this is where you'll start to see some serious delays in a larger data transfer. So just kind of use this and we do recommend using the Google transfer appliance probably between 10 and 100 terabytes just to make your life a little easier.

Data ingestion tools

So other ways that we ingest data into your service.

Now first, obviously, it can be done through the network which is going to be, its globally available, it's highly available, there's a bandwidth of a 100 gigs and many of our peering locations in low latency.
You can post directly to cloud storage using HTTP, RESTful APIs.
You can put front end application through App Engine. So this way, we can build an authentication, integrate that way, even cloud endpoints.

So we can expose just a simple API that can ingest data into your application even through cloud functions, or set up a computer engine instance and store it on local persistent disk. There's a number of different ways to bring data in, some of them have various pros and cons which we can cover.

Identification of Storage Needs and Mapping to Storage Systems

video

Google Cloud Platform offers a wide variety of storage and database services.

This lesson provides a general method to narrow down the list of candidates storage services. The diagrams and charts in this lesson can help you ask meaningful questions about the data that will help you match the data to storage services.

Choose a storage solution between: DISK, MOBILE or CLOUD solutions

Choose a storage solution for unstructured data on Cloud Storage: regional, multi-regional, nearline, coldline

Choose a storage solution for analytics: BigQuery

Choose a storage solution for SQL (scalable or not): Clud SQL or Spanner

Choose a storage solution for NoSQL: Cloud Datastore

Choose a storage solution: decision tree & summary table

Application: The photo service experiences Intermittent Outages

video

There's something wrong with the photo application.

Occasionally, the service simply fails to produce the smaller preview image. This used to suddenly occur, but in recent weeks the frequency appears to be increasing and it is becoming a problem recognize by users. The issue appears to be random.

What's happening?
and how can you change the design to fix this problem?

Business problem

Systematic logical troubleshooting

systematic & logical troubleshooting
and answering the "five why's"

The team has determined that the root problem, is the persistent disk on the application server cannot keep up with the scale of demands being placed on it. All right. So, that's the thing. You can scale up CPU, but sometimes your underlying disk IO performance cannot keep up. As well as something that most people don't think about, is disk IO also does need some CPU? So, that could be a problem. But in this case, we're focusing strictly on that. There's not enough virtual machine disc IO power to handle this kind of scale that we're looking at.

Break down business logic on the photo service

with 1 millions pictures per day, we need to move from our local file system to Cloud Storage for scalability.

New Service Level Objectives (SLOs) and Indicators (SLIs)

we're going to introduce some new service level objectives.

In this case here, we want to reduce the error rate. So, we want to keep the failures to produce a thumbnail down below, less than one 10th of one percent. So, that's about 100 errors per million images.

Now, how are we going to measure this? While our service level indicators are going to look at the logs and we're going to identify if there's any logs that failed to, if they indicate any type of errors. So, this means that our error budget is going to be 3,000 errors per month.

Now, one of the things that our IOs do, is they are allocated an error budget. As long as they stay within that budget, then they can work on future things. But if they're unable to remain within their error budget during the month, then they will have to focus on automation and tools, and whatever it takes, to ensure that they can now meet that error budget month after month before they can work on new things, or just simply take a break, right? Because that's really the goal, you don't want your SRE is working full time. They should only be working about half the time and if they are, it should be either responding to requests or developing for future uses. Take the job of a firefighter, they're always on-call, always on edge, but they're not physically working all the time. If you did that, then they wouldn't be able to offer that same kind of response.

Design challenge #2: Complication

video

In the application lesson, the random intermittent failure turned out to be a capacity issue. The services simply couldn't keep up with the demand and you're able to redesign a solution to add a storage service component in addition to the web server component and the thumbnail processing app server component.

This leads to another problem in the log aggregation system: The log aggregator needs to account for another source of log entries and the log aggregation servers running out of disk space. Take a closer look at this problem.

Your challenge is to modify the log aggregation design to avoid or overcome this issue. Watch the lesson that describes the problem, then come up with your own solution. When you're ready, continue the lesson to see a sample solution. Remember that the sample solution is not the best possible solution. it's just an example and your design might

Business logic

Why not GCS for aggregating this log files?

Why not Bigtable for aggregating this log files?

Presentation Layer Design

video

Overview

The presentation layer has to do with the flow of data through the system between the user, business logic, and the stored service, aka this is the network:

the network edge configuration,
the network configuration for data transfer within the service,
and network integration with other environments.

Network Configuration

video

One key point of this lesson is a location within the cloud network makes a difference in latency.

In general, there's a direct tradeoff: the more distributed elements are in your design, the more tolerant the system will be of outages. However, there are real performance limitations because the round trip time is much slower between distant elements. And there are expense considerations for egress traffic. The critical technology to implementing Intelligent Traffic Management in the design is load balancing. Fortunately, the platform offers a very robust set of load balancing services that are optimized for different circumstances

So, network configuration for data transfer within the service, all is going to focus on:

location,
load balancing,
and caching.

Impact of Location

So the location of resources within the cloud is, within the cloud network, I should say, is very significant. So, when you take a look here at you know a million nanoseconds is one millisecond, depending on what network you choose, if you're going to choose say a two kilobyte packet over one gig network, you can do this in 2000 of a second. Now, what about a roundtrip within the same data center? Google's data centers are pretty fast, you can get away with half a millisecond, sometimes even faster, depending on the design. Now,what if you wanted to send something across the world, from say California to the Netherlands and back. Well, that could be upwards of 150 milliseconds. Yes, even that is considerably fast due to Google's global fiber network. However, these are design considerations you need to take place. Especially, if you're looking at the perspective of the user or your specific application. So remember, no more than six to seven round trips between Europe and US per second are possible but approximately 2000 per second can be achieved within a data center. And these are important generalizations, understand whether it's time to push data closer to the user or actually locate your application in multiple different data centers themselves. So the technology that allows you to control the network location of resources used by your service is going to be called Load balancing.

Impact of Load Balancing

So the technology that allows you to control the network location of resources used by your service is going to be called Load balancing. Now load balancing in our world is a little bit different. So, basically what it does, it's getting user traffic to the application servers with capacity in the closest region to that customer. Unless you know if you have no load balancing sometimes it's going to be internal distribution of traffic across a multiple servers inside of your application infrastructure. It can also trigger auto scaling, but we also provide global load balancers. And that gives you a single external IP address that can locate, or I should say, that can send traffic to any geographical location that's closest to your individual user. We offer many different kinds of load balancing services and they're each optimized for each different use case

Global Load Balancers

Choosing your type of load balancer

Integration with other Environments

video

Network edge configuration for users and clients

https://peering.google.com/#/infrastructure

In order to use the CDN, you have to turn on HTTP(S) load balancing or other network load balancing. But in this case here, you can actually push and publish content directly from Google Cloud stores to the CDN. But if you want to use it from the network layer, we can automatically start to cache any data that goes through our HTTP(S) load balancer. So that's a kind of a huge benefit

Network connections to other networks (e.g. data center, another cloud)

Dedicated interconnects

The big thing in the middle is you need to have some kind of dedicated interconnect. So, we have what's called a cloud partner interconnect or even dedicated connections which can connect us to other providers. So in this case here, we can actually have private bandwidth between both cloud providers. And of course, if you have the proper routing between the two, we can route those networks equally between the two different locations.

VPN configurations

There's a couple of ways to do that by taking advantage of VPNs:

you can aggregate multiple VPN gateways to try to aggregate the bandwidth to the other provider.
you can also do this in reverse. And in this case, we usually allow about 1.5Gb per VPN tunnel. So this way you can aggregate these together or create a fully mesh network. So this way, you have fell over in case the remote VPN gateway fails, you have backup links et cetera, just standard networking redundancy built in.

And so, all of this is really available through Cloud Router. Again either setting static routes or dynamic routes. So if you're making network changes at the cloud provider and spinning up new subnets that might be custom, you want to make sure that the Google Cloud platform also knows about that, and you take advantage of our cloud routing service to make sure that notifications are possible.

VPN Performances

Application: Photo service... Periodic Slowdowns

video

Business problem

The photo services working fine, however, recently there have been periodic slowdowns, users are experiencing delays, it's taking an increasingly long time for the system to respond with a thumbnail image. Then, after some time, it appears to speed up again.

What could possibly be the cause of the slowdown?

what can you do to change the design to overcome this problem?

Systematic logical troubleshooting

Collaboration & communication: Report, Document, build policy

you have to document it, while it might be viewed as a really boring process and most people complain about having to do it you always want to create reports, but under what conditions. Well, for starters, anytime an SLO is breached, any incident that requires emergency on-call, also if there's a need for follow-up communications for example, if legal need to explain an outage without revealing the impact or something competitive. There needs to be a clear policy that defines the writing of these reports. Timelines must be specified in terms of how soon after an incident a draft is going to be published, as well as a timeline for completing the report. So, writing these reports creates a record of what happened and what was done to fix it. This will be a helpful source of reference in the future if a similar event happens again, so doing these postmortems should be a mandatory part of your recovery processes.

Break down business logic on the photo service

It's taking minutes to generate thumbnails. So, the system is definitely slow. A systematic and logical troubleshooting process has been followed in the "five why's" have been asked and answered. The conclusion is that the issue is definitely tied to the capacity of the system to generate thumbnails. It's also been established that it isn't the front-end web server is causing the delays, but the back-end thumbnail generating service, which is failing to keep up with demand. When we say capacity, it's the capacity to actually perform the service in a timely manner. So, what happens, the thumbnail is running out of CPU. We could watch CPU utilization, but CPU utilization isn't linear and during busy utilization times, it can go to 100%. This is sure to impact the end-to-end response time for the user.

How do you monitor it?
How do you alert on this? You need to be able to look at specific windows.
Depending on how long it should be for the average thumbnail to respond instead of looking over five minutes, maybe need to look over a minute.
But keep in mind, CPU utilization is not actually to be used as a service level indicator.

Solution: Scale out the backend processing of thumbnails.

Here's our decision, we decided that if we need to handle more thumbnail processing, it's got to become more scalable. However, we didn't choose to simply throw more CPU and network at it because it was more of a single point of failure. Instead we decided to add a load balancer and scale out the number of thumbnail servers. The great thing is that it's like micro servers in itself now, because storage has been isolated to Google Cloud Storage, the same code can be distributed and it doesn't keep track of a queue or anything else. The Upload Server basically pulls whatever is on the Data Storage Server and it load balances it as they come in. Technically, this is probably an internal load balancer, but we'll get into that a little bit later.

In this case, to help us with our greater than 80 percent CPU utilization, we want to distribute traffic request from our business logic to the Application Servers in a cluster.

What about our Service Level Objectives (SLOs) and Indicators (SLIs)

Even though we've added a cluster of servers, we haven't changed anything that our users can measure. The performance is still a measure of the end-to-end latency and the accuracy of the service is still based on the error logs.

Design challenge #3: Growth

video

In the application lesson, you overcame the periodic slow downs by increasing capacity of the thumbnail application component, replacing the single server with a group of load balanced auto scaling servers.

That means instead of a single app server log, you now have logs coming from each server in the pool. How can you evolve the design of the log aggregation system to accommodate the new scalable service? Watch the lesson that describes the problem, then come up with your own solution. When you're ready, continue the lesson to see a sample solution. Remember that the sample solution is not the best possible solution, it's just an example and your design might be better.

Let's define the problem. So in our case autoscaling of the application servers has produced logs that are outgrowing the processing capacity of the aggregation logging server. So in this case we want to design a solution. Now always remember, there are multiple designs that you can take, right? Your solution may be better, and that's okay. But let's at least work through the different solutions, decide what we want to take advantage of next. And what we hear too, hey, now do we move to data flow? Well, maybe. But let's think about what that might do, and these are dialogues I've had with students in class. This is a Python script, and well, data flow does support Python. So that didn't work because, but maybe to do what we want to do, it requires the Java version of the SDK. And that would mean that our programmer who designed this and just threw this together, is going to have to learn a new language. Now you could say, hey that's expected, but these are potential limitations. So if you look at a new service, what are the requirements of that new service to make it happen? Even though on paper it may sound good, you can just swap it out. But in order to rewrite the code and there might be a learning curve, etc. So in this case, our guy is, you know, this would be a lot easier if all we did was design this so it could scale. So one simple solution, and iteratively was let's go ahead and put an internal load balancer here, so this way, our logging server can run on multiple instances. So here we have an auto scaling instance group. It can take logs as they're ingested and it can put them directly inside of Big Table. All right, so now we can run our queries, get our session IDs, and we don't have to really worry about this failing. So really, this scales rather nicely, because we will have a daily cron job that will take all the local data storage and process that in.

lab: Autoscaling

video
lab notes

The photo application that was promised by the development team is late. You decide that you don't want to wait any longer to develop the infrastructure. When the photo app does become available, you don't want to add to the schedule delay, so you decide to make the infrastructure auto-scaling and reliable by building on a previous deployment. This way, when the photo app is available you can just drop it in to the deployment framework that you've built. The deployment manager templates from the previous lab will be provided to you in a TAR file, so you'll just be creating the additional files.

Resources/Articles

https://cloud.google.com/compute/docs/reference/rest/v1/instances
Common design patterns: https://cloud.google.com/apis/design/design_patterns

Files

course_6_Reliable_Cloud_Infrastructure_Design_and_Process.md

Latest commit

History

course_6_Reliable_Cloud_Infrastructure_Design_and_Process.md

File metadata and controls

Reliable Cloud Infrastructure: Design and Process

Content

Detailled Content

Course Intro

Objectives

Modules covered

4 parts of this class

Defining the Service

Layered and iterative approach to design used in this training

Defining the Service: State and Solution

Google's common design

Defining the Service: Measurement

Service Level Indicators (SLI)

Service Level Objectives (SLO)

Service Level Agreements (SLA)

User Persona

Defining the Service: Gathering Requirements

Just ask questions: who, what...

Specify constraints: time, data , users

Scaling requirements: backend, frontend

Size requirements

Application: Introducing an example Photo Application service

Gather requirements

Business logic

From logic, define Services Levels (SLIs, SLOs)

Process: test before & during launch

Lab: use Deployment Manager to create a VM

Business-logic Layer design

Business-logic Layer: Design Overview

Business-logic Layer: Microservice Architecture

Benefits of microservives

How microservices complicate business logic?

When using microservices make sense?

Google's offering for microservices

Cloud Functions

Cloud Functions example: translate text in an image

Cloud Functions example: implementation on GAE & limitations

GCP 12-factor support

GCP development tools matching 12-factor methodology

Mapping computing needs to platform products

Compute System Provisioning

Prefer & plan for horizontal scaling

Pros/Cons of horizontal scaling

Horizontal scaling design

Horizontal scaling tradeoffs: latency, capacity, scalability, cost

Design first, dimension later

Application: The photo service is slow!!!

Business problem

Systematic logical troubleshooting

Collaboration & communication

Break down business logic on the photo service

Identify the attributes of the different services?

Segregate services for better performance and scalability

What about our Service Level Objectives (SLOs) and Indicators (SLIs)

Design challenge #1: Log aggregation

lab: useDeployment Manager: Package and Deploy

Data Layer Design

Classifying and Characterizing Data

BASE

ACID

What are the data consistency requirements?

What are you truing to optimize? Domain/App specific

Data Ingest and Data Migration

GCS migration tools: console, gsutil, JSON API

GCS migration tools for large transfers: Cloud Storage Transfer Service

GCS migration tools for large transfers without network: Google Transfer Appliance

GCS migration tools scal-up table

Data ingestion tools

Identification of Storage Needs and Mapping to Storage Systems

Choose a storage solution between: DISK, MOBILE or CLOUD solutions

Choose a storage solution for unstructured data on Cloud Storage: regional, multi-regional, nearline, coldline

Choose a storage solution for analytics: BigQuery

Choose a storage solution for SQL (scalable or not): Clud SQL or Spanner

Choose a storage solution for NoSQL: Cloud Datastore