Skip to content

Latest commit

 

History

History
1745 lines (1129 loc) · 106 KB

Preparing_for_the_Professional_Cloud_Architect_Examination.md

File metadata and controls

1745 lines (1129 loc) · 106 KB

Preparing for the Professional Cloud Architect Examination

~ 1 day / Expert

https://www.coursera.org/learn/preparing-cloud-professional-cloud-architect-exam/

Content

Audience

Cloud professionals who intend to take the Professional Cloud Architect certification exam Must have attended Architecting with GCP: Infrastructure course or equivalent on demand courses. Knowledge and experience with GCP, equivalent to GCP Architecting Infrastructure Knowledge of cloud solutions, equivalent to GCP Design and Process Industry experience with cloud computing

Course Outline

The course includes presentations, demonstrations, and hands-on labs.

* Module 1: Understanding the Professional Cloud Architect Certification
    * Position the Professional Cloud Architect certification among the offerings
    * Distinguish between Associate and Professional
    * Provide guidance between Professional Cloud Architect and Associate Cloud Engineer
    * Describe how the exam is administered and the exam rules
    * Provide general advice about taking the exam


* Module 2: Sample Case Studies
    * MountKirk Games
    * Dress4Win
    * TerramEarth

* Module 3: Designing and Implementing
    * Review the layered model from Design and Process
    * Provide exam tips focused on business and technical design
    * Designing a solution infrastructure that meets business requirements
    * Designing a solution infrastructure that meets technical requirements
    * Design network, storage, and compute resources
    * Creating a migration plan
    * Envisioning future solution improvements
    * Resources for learning more about designing and planning
    * Configuring network topologies
    * Configuring individual storage systems
    * Configuring compute systems
    * Resources for learning more about managing and provisioning
    * Designing for security
    * Designing for legal compliance
    * Resources for learning more about security and compliance

* Module 4: Optimizing and Operating
    * Analyzing and defining technical processes
    * Analyzing and defining business processes
    * Resources for learning more about analyzing and optimizing processes
    * Designing for security
    * Designing for legal compliance
    * Resources for learning more about security and compliance
    * Advising development/operation teams to ensure successful deployment of the solution
    * Resources for learning more about managing implementation
    * Easy buttons
    * Playbooks
    * Developing a resilient culture
    * Resources for learning more about ensuring reliability

* Module 5: Next Steps
    * Present Qwiklabs Challenge Quest for the Professional CA
    * Identify Instructor Led Training courses and what they cover that will be helpful based on skills that might be on the exam
    * Connect candidates to individual Qwiklabs, and to Coursera individual courses and specializations.
    * Review/feedback of course

Introduction

video

Details of the exam/certification is available in English or in French:

Examinaton guide

Use the Exam Guide outline to help identify what to study.

Guide de l'examen de certification

Section 1 : Concevoir et planifier l'architecture d'une solution cloud

1.1 Concevoir l'infrastructure d'une solution qui répond aux exigences commerciales. Points à prendre en compte :

  • Cas d'utilisation commerciale et stratégie produits
  • Optimisation des coûts
  • Compatibilité avec la conception de l'application
  • Intégration
  • Mouvement des données
  • Compromis
  • Création, achat ou modification
  • Mesures de la réussite (par exemple : indicateurs clés de performance (KPI), retour sur investissement (ROI), métriques)
  • Conformité et observabilité

1.2 Concevoir l'infrastructure d'une solution qui répond aux exigences techniques. Points à prendre en compte :

  • Conception à haute disponibilité et basculement
  • Élasticité des ressources cloud
  • Évolutivité pour répondre aux exigences de croissance

1.3 Concevoir des ressources réseau, de stockage et de calcul. Points à prendre en compte :

  • Intégration à des environnements sur site/multicloud
  • Mise en réseau cloud native (VPC, appairage, pare-feu, mise en réseau de conteneurs)
  • Identification du pipeline de traitement des données
  • Adaptation des caractéristiques de données aux systèmes de stockage
  • Schémas des flux de données
  • Structure du système de stockage (par exemple : objet, fichier, RDBMS, NoSQL, NewSQL)
  • Mise en correspondance des besoins en calcul et des produits de plate-forme

1.4 Créer une planification de la migration (c'est-à-dire des documents et des schémas architecturaux). Points à prendre en compte :

  • Intégration de la solution aux systèmes existants
  • Migration des systèmes et des données permettant d'assurer la compatibilité avec la solution
  • Mappage de licences
  • Planification des réseaux et de la gestion
  • Tests et démonstration de faisabilité

1.5 Envisager des améliorations futures pour la solution. Points à prendre en compte :

  • Améliorations au niveau du cloud et technologiques
  • Évolution des besoins de l'entreprise
  • Promotion

Section 2 : Gérer et provisionner l'infrastructure d'une solution

2.1 Configurer des topologies de réseaux. Points à prendre en compte :

  • Extension à un environnement sur site (mise en réseau hybride)
  • Extension à un environnement multicloud, qui peut inclure la communication de GCP à GCP
  • Sécurité
  • Protection des données

2.2 Configurer des systèmes de stockage individuels. Points à prendre en compte :

  • Attribution du stockage des données
  • Traitement des données/provisionnement du calcul
  • Gestion de la sécurité et des accès
  • Configuration réseau pour le transfert des données et la latence
  • Conservation des données et gestion du cycle de vie des données
  • Gestion de la croissance des données

2.3 Configurer des systèmes de calcul. Points à prendre en compte :

  • Provisionnement du système de calcul
  • Configuration des fluctuations de calcul (préemptif et standard)
  • Configuration réseau pour les nœuds de calcul
  • Configuration de la technologie de provisionnement de l'infrastructure (par exemple : Chef/Puppet/Ansible/Terraform)
  • Orchestration des conteneurs (par exemple : Kubernetes)

Section 3 : Concevoir des solutions sécurisées et conformes

3.1 Concevoir des solutions sécurisées. Points à prendre en compte :

  • Gestion de l'authentification et des accès (IAM)
  • Hiérarchie des ressources (organisations, dossiers, projets)
  • Sécurité des données (gestion des clés, chiffrement)
  • Tests d'intrusion
  • Séparation des tâches
  • Contrôles de sécurité
  • Gestion des clés de chiffrement fournies par le client avec Cloud KMS

3.2 Concevoir des solutions conformes aux exigences légales. Points à prendre en compte :

  • Législation (par exemple, Health Insurance Portability and Accountability Act (HIPAA), Children's Online Privacy Protection Act (COPPA), etc.)
  • Audits (y compris les journaux)
  • Certification (par exemple : framework ITIL)

Section 4 : Analyser et optimiser les processus techniques et métier

4.1 Analyser et définir les processus techniques. Points à prendre en compte :

  • Plan de cycle de développement logiciel (SDLC)
  • Intégration continue/développement continu
  • Culture de résolution des problèmes/d'analyse post-mortem
  • Tests et validation
  • Processus d'entreprise informatique (par exemple, ITIL)
  • Continuité des activités et reprise après sinistre

4.2 Analyser et définir les processus métier. Points à prendre en compte :

  • Gestion des parties prenantes (par exemple, influence et facilitation)
  • Gestion du changement
  • Évaluation des équipes/Vérification des compétences
  • Processus décisionnel
  • Gestion de la réussite des clients
  • Optimisation des coûts/ressources (Capex/Opex)

4.3 Développer des procédures pour tester la résilience d'une solution en production (par exemple, DiRT et Simian Army)

Section 5 : Gérer la mise en œuvre

5.1 Conseiller les équipes de développement et d'opérations pour s'assurer que la solution est déployée correctement. Points à prendre en compte :

  • Développement d'applications
  • Bonnes pratiques relatives aux API
  • Cadres de test (chargement/unité/intégration)
  • Outils de migration des systèmes et des données

5.2 Interagir avec Google Cloud à l'aide du SDK GCP (gcloud, gsutil et bq). Points à prendre en compte :

  • Installation locale
  • Google Cloud Shell

Section 6 : Garantir la fiabilité de la solution et des opérations

6.1 Surveiller la solution, journaliser les événements et gérer les alertes

6.2 Gérer le déploiement et les releases

6.3 Prendre en charge la résolution des problèmes opérationnels

6.4 Évaluer les mesures de contrôle qualité

Exemples d'études de cas

Certaines questions de l'examen de certification Cloud Architect peuvent porter sur une étude de cas qui décrit une entreprise fictive et un concept de solution. Ces études de cas sont destinées à vous fournir davantage de contexte pour vous aider à choisir la ou les bonnes réponses. Passez en revue quelques exemples d'études de cas qui pourraient vous être proposées.

  • Mountkirk Games
  • Dress4Win
  • TerramEarth

prepare_exam_operation_secure.png

  • Design and implement (make something function... more about products)
  • optimize and operate (make something secure and cost effective)
  • manage implementation and operate reliability (make something continue reliably and adapt over time... more about procedures)

Have you heard of the the four P's? That's product, people, policy, and process.

Meaning of the Professional Cloud Architect Certification

(video)

Different roles & certifications

prepare_exam_different_roles_and_certifications.png

I just want to caution you that the associate cloud engineer is not a simpler and easier cloud architect exam. All of these certifications are based on real-world practical job skills required and used by practitioners in the industry.

A cloud engineer uses the same technology as the cloud architect. However, their job focuses different and so the skills are different. For example:

  • a cloud architect might consider how to design a Kubernetes cluster to meet customer requirements.
  • A cloud engineer might run jobs on the cluster and be more focused on monitoring the cluster and measuring and maintaining its performance.
  • A cloud architect designs the solution and implements it.
  • A cloud engineer operates a solution, monitors it maintains it, and evolves it as business circumstances change.

So, which certification or certifications you might want depends on your job role, the job you have or the job you want to have.

Difference between "associate" vs "professional" level of certifications

= difference on "designing" and "business requirements"

Professional:

  • designing
  • planning
  • PoC
  • Identifying the business needs

Associate:

  • implementing
  • operating & technical requirements

prepare_exam_difference_associate_vs_professional.png

Tips for methods of study

prepare_exam_tips_for_methods_of_study.png

Tips for the day of the exam

prepare_exam_tips_for_exam_day.png

Product and technology knowledge

ou need to know the basic information about each product that might be covered on the exam.

You need to know...

  • What it does, why it exists.
  • What is special about its design, for what purpose or purposes was it optimized?
  • When do you use it, and what are the limits or bounds when it is time to consider an alternative?
  • What are the key features of this product or technology?
  • Is there an Open Source alternative? If so, what are the key benefits of the cloud-based service over the Open Source software?

Which products and technologies

Training and Certification meet at the JTA -- the Job Task Analysis -- the skills required of the job.

The scope of the exam matches the learning track and specialization in training. So a great place to derive a list of the technologies and products that might be on the exam is to look at all the products and technologies that are covered in the related training. The training might not cover everything. But it is a good place to start.

Study methods

Training is great. Digging into the online documentation can be very instructive and covers more detail than can be covered in a class, so documentation tends to have more equal coverage of features, whereas training has to prioritize its time. Getting hands on experience can help you understand a product or technology much better than reading and is the kind of experience a professional in the job would have. So labs can be a great way to prepare.

Build your own case study summaries

prepare_exam_case_studies_summaries.png

Mountkirk Games Case Study: a game app

Mountkirk Games Case Study [video]

Key business points

prepare_exam_key_business_points_game_app.png

Technical evaluation

prepare_exam_technical_evaluation_game_app.png

Sample solution

prepare_examsamle_solution_game_app.png

Dress4Win Case Study: social network app around garderobe

Dress4Win Case Study [video]

Key business points

prepare_exam_key_business_points_soc-net_app.png

prepare_exam_tech_approach_soc-net_app.png

Technical evaluation

prepare_exam_technical_evaluation_soc-net_app.png

prepare_exam_technical_evaluation_2_soc-net_app.png

Sample solution

prepare_exam_sample_solution_soc-net_app.png

prepare_exam_sample_solution_2_soc-net_app.png

TerramEarth Case Study: IoT sensors for agriculture & mining

TerramEarth Case Study [video]

Key business points

prepare_exam_key_business_points_IoT_app.png

prepare_exam_key_business_points_2_IoT_app.png

prepare_exam_key_business_points_4_IoT_app.png

Technical evaluation

prepare_exam_key_business_points_3_IoT_app.png

prepare_exam_technical_evaluation_IoT_app.png

prepare_exam_technical_evaluation_2_IoT_app.png

Sample solution

prepare_exam_sample_solution_IoT_app.png

Touchstone concepts

A touchstone concept is a complex or key idea -- something that you would learn in a class AFTER you have learned all the basic dependent concepts. They are used in this course because they are a very efficient way for you to learn where you have confidence and where more preparation might be needed.

This approach is based on the Deeper Learning method of adult learning.

Example

People seem to be able to relate well to this example.

Touchstone: "Unlike other vendor clouds, a subnet spans zones, enabling VMs with adjacent IPs to exist in separate zones, making design for availability easier to accomplish since the VMs can share tagged firewall rules."

To understand the above statement, the basic dependent knowledge that must already be understood includes, Regions, Zones, Subnets, IP Addresses, and Firewall Rules.

These basic concepts are not taught or reviewed in this course. They are taught in the training courses in this specialization and in the corresponding learning track in instructor led training.

Advice: Evaluate the dependent basic concepts

Assess your confidence with each touchstone concept as it is presented. Don't expect to be taught the basic concept. If you don't understand the touchstone at all, or if you don't feel confident in your knowledge of it, or if you feel there are specific elements of it that you don't understand or are not confident about -- take note!

This is an area where more preparation can be of benefit for you.

Also -- note where you are confident, know the material, and the dependent concepts on which the touchstone is based. These areas require less preparation for you. So noting what you know well can help make your preparation activities more efficient.

Designing and Implementing

video

This module covers designing and implementing infrastructure solutions. Design can get complicated. Do you have an approach to design? It's easy to confuse elements if you don't use an organized method. Do you have favorite design elements? For example, do you find most of your designs start with VMs? You'll want to overcome these biases by understanding the infrastructure services available and when to select them.

Today you'll be learning about and preparing for the Professional Cloud Architect exam.

A lot of that has to do with design. Before you can design a solution, you need to understand the building blocks, the underlying services, and technologies that make up solutions in Google Cloud. Here's a tip, use a layered model like this one. It'll help you organize your thinking about each exam question, so that you'll more easily recognize and focus on what's important. Professional Cloud Architects often use layered models to organize or separate solution designs. It makes it much easier to deal with the complexity and to make sure there are no dropouts in the design. This model comes from our design and process class.

Design_methods.png

Designing a solution infrastructure that meets business requirements

This class follows the exam guide. So whenever you see a slide like this, a blue column contains items directly from the exam guide.and the white column contains tips and advice directly relevant to each outlined item. You can read through these yourself.

I'm going to highlight and discuss one or two of these per slide. When we speak about business requirements, we're asking the question, "What are the customer's needs and expectations?" Questions on the exam are realistic, so on a job, these discussions would likely be with a business stakeholder, and you'd need to be prepared to answer these questions and their concerns.

You'll notice that the first and last items in the list have to do with determining the criteria for success and deciding how to measure that. It's very important to be explicit about exactly what you're trying to achieve. These items are often stated qualitatively at the beginning and are measurable and quantitative at the end.

touchstones.png

solutions_depends_on_context.png

Context are often tradeoffs solutions: Good vs Fast vs Cost

build_buy_modify.png

Practice Case Study analysis #1

Case Study #1

case_study_01-design_plan.png

Context = need to gain in speed and ease of use thanks to cloud solutions

Identify technical watchpoints

case_study_01-watchpoints.png

video

case_study_01-technical_solution.png

case_study_01-design_plan_requirements.png

Designing a solution infrastructure that meets technical requirements

Design_solution_meeting_requirements.png

Design_solution_meeting_requirements_what_to_measure.png

Design_solution_meeting_requirements_time_value_deadline_requirements.png

Design_solution_chains_of_microservices.png

Common design patterns

Common design patterns: https://cloud.google.com/apis/design/design_patterns

Design_solution_common_design_patterns.png

Narrow down technology

Narrow down technology to what could work, then what would work best given a particular context:

Design_solution_narrow_down_to_what_could_work_then_whats_best.png

Identifying bottlenecks

Identifying bottlenecks is especially useful for questions involving building out from existing solutions. For example, the current system can support X number of users, and the goal is to support Y number of users. What's the bottleneck in the current design? Is it bandwidths, gigabytes, queries per second? Where will the application hit its limits? This is often the factor that determines which solution is best in the circumstance.

Design_solution_identifying_bottlenecks.png

Read/Build dataflow diagrams

Design_solution_read_buikd_dataflow_diagrams.png

ACID (consistency) vs BASE (availability)

Design_solution_assets_vs_base.png

significance of atomicity, consistency, isolation, and durability.

  • Atomicity: Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged.
  • Consistency: Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.
  • Isolation: Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
  • Durability: Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash)

Creating a Migration Plan

video

Design_solution_migration_plan.png

Design_solution_migration_plan_think_practically.png

Design_solution_envision_furture_requirements.png

Preparing for Managing

video

Managing and provisioning solution infrastructure. If you think about it, managing and provisioning are both about capacity and demand and choosing the right infrastructure components to support and adapt to the demand. External connections, know your options. Internet, VPN, Cloud Router and the various flavors of direct interconnects.

**Example: **

Google Cloud Networking is not like other vendor networks. Not like traditional IP networks and not like other SDA networks. That's networking in the cloud and you need to know how you might handle migrating an existing data center network into a GCP network.

Managing_various_interconnects.png

Google Cloud Networking is not like other vendor networks. Not like traditional IP networks and not like other SDA networks. That's networking in the cloud and you need to know how you might handle migrating an existing data center network into a GCP network.

Subnetworks can extend across zones in the same region. One VM and an alternate VM can be on the same subnet but in different zones. A single firewall rule can apply to both VM's even though they're in different zones. This makes it much easier to design and implement resilient or high-availability solutions.

Managing_various_subnets_extends_Across_zones.png

Know your options:

networking_interconnect_options.png

Security

security.png

Case Study #2

dummycase_study_02-minimize_impact_productivity.png

####### Identify Technical Watchpoints

case_study_02-watchpoints.png

####### Implementation meeting technical requirements

case_study_02-technical_solution.png

case_study_02.png

Configuring individual Storage Systems

video

Know your different Storage solutions:

know_your_storage_options.png

compare_storage_options.png

compare_storage_options_2.png

compare_storage_options_3.png

compare_storage_decision_tree.png

compare_firebase_datastore.png

Preparing for DAta processing

Data transfer

video

Data_transfer.png

Lazy deletion design

lazy_deletion_design.png

speed_transfer_data_online.png

The left side of the table are closer to physical speeds, and the right side of the table is closer to online speeds. Therefore, it's much faster to accumulate data online and work with it and transfer it online, than to collect the data physically and then transfer it.

Cloud Storage

video

Cloud_storage_for_archive.png

4 storage classes:

Cloud_storage_for_archive_4_classes.png

Cloud_storage_for_archive_disks_options.png

Cloud_storage_how_it_works.png

Cloud_storage_simulates_a_filesystem.png

BigTable

BigTable in a nutshell :):

  • How does BigTable works? (Colossus: Google File System GFS ~HDFS)
    • manipulate "tablets"
    • manipulate the data
    • manipulate the metadata
  • BigTable is a learning system, detecting "hotspots" where activity is highest, splitting a tablets into several, and automatically rebalance the compute resources.
  • best use case is with big data -- above 300 GB
  • Cloud Bigtable design idea is "simplify for speed"
    • data in tables, only one index, the "Row Key" > "Speed through simplification". Forget SQL, build up from a mnimal set of opeations.
  • Optimize the name of you Row Key according to the application.

Data processing to Machine Learning

video

  • BigQuery provides a front-end for analysis and a back-end that can read from several sources including BigQuery tables, but also CSV files in Cloud Storage.
  • Cloud Dataproc is a managed service for Hadoop clusters, useful for processing data and returning it to Cloud Storage or BigQuery.
  1. The first step is migration from Data Center processing to Cloud Data processing.
    • BigQuery replaces many tools and custom applications in data center,
    • while Cloud Dataproc replaces Hadoop.
    • Cloud Bigtable is a drop in replacement for Hbase.
  2. Machine learning is available from Cloud Dataproc using APIs, such as natural language processing or NLP. When ready, the business can move from cluster-based managed service to a serverless service, and access the full benefits of machine learning. Machine learning provides value through tagging of unstructured data, which makes it useful for specific purposes. Machine learning can also be used to recognize items and for prediction. Machine learning is more of a focus of the data engineering track rather than the cloud architect track. But it's still part of the infrastructure of a cloud architect, and it might be used for finding solutions.

Data_processing_to_ML.png

ML_value.png

Cloud AI on GCP :D:

  • cloud AutoML:
    • pre-built AI models (Vision, NLP, Translation, ....)
    • add custom models
    • create new models
  • ML with BigQuery

Preparing for Compute

video

Configuring Compute Systems

One thing to consider in the design is whether you can create an application that's tolerant of some amount of lost data or state and part of the system can simply drop some data or can store data externally and recover from drops. Then if that part is isolated, you can consider using Preemptible VMs for lower cost.

Development environments and disaster recovery are often good applications for creating infrastructure through automation technologies such as Deployment Manager or Terraform. In the development environment case, you can generate a clone of the production infrastructure solution or use by the development team. So the test team needs an environment, deploy another copy. Quality control needs and environment, another copy. Auditing and compliance test backup and recovery, create more deployments on demand.

Configure_VM_systems.png

Selecting Compute Options

Compute.png

  • Compute Engine (Processing flexibility)
  • Kubernetes Engine (platform independence)
  • App Engine standard/flex (Code first)
  • Cloud Functions (Microservices

Options in a table to know forward and backward:

Compute_options.png

Choosing a load balancer for Compute Engine

Compute_load_balancing.png

No load balancers per se (load balancing servers), but load balancing is managed by the network (software).

there's no such thing as a load balancer in Google Cloud. Allow me to explain. In Google Cloud, there is no load balancer because the function of distributing traffic is handled by the software defined network. So there are several kinds of load balancing. But these are just features that are part of the network, not physical devices. Load balancing services are distinguished by the kind of traffic they direct. By whether they're intended to balance traffic from one server to another inside the Google Cloud or if they're intended to direct data arriving from the Internet. Also load balancing can be global or applied to a specific region. Make sure you understand the basics of how geo-distributed and load balancing works.

Choosing instance groups for Compute Engine

choosing_instance_group_for_compute_engine.png

  • Unmanaged instance groups collect different kinds of instances. Usually, this is done for management of lift and shift existing designs and it's not recommended because it does not make the best use of the features available in cloud.
  • Managed instance groups are all the same kinds of instances meaning that the type can be defined by an instance template and auto-scaling is available.

Zonal managed instance groups keep all the instances in the same zone which is useful to provide consistent network location when the instances must communicate with similar latency and avoid zone to zone transfer.

Regional managed instance groups distribute the instance and multiple zones within the region increasing reliability. Instance groups should be managed instance groups to make effective use of the cloud.

Microservices, Containers, Data Processing, and IoT

video

Microservices

microservices.png

Microservices is not a panacea, it doesn't fit all cases. You can implement a microservices solution in App Engine cloud functions and using Node.js and Kubernetes. The platforms have overlapping coverage. Do you know when you might choose one platform over another for microservices solution? Coordinating a transaction across stateless microservices is tricky. You have to store the state externally and retrieve and use it in each function. Microservices architectures are commonly used and implemented in Cloud Functions or in App Engine.

Containers

containers.png

kubernetes.png

What you want to do is blend the approaches where it makes sense to the business. This is another case where what the client wants is what's most important to the design and on an exam it means being sensitive to and looking out for those trade-offs.

balance_resiliency_cost.png

Managed Services vs Serverless services

services_ManagedServices_ServerlessServices.png

IoT

Look for it at cloud.google.com/solutions

diagram_IoT_core_cloud_functons.png

The core assembly here is

  • IoT Core: Google Cloud IoT Core provides a fully managed service for device registration, authentication, authorization, metadata and configuration.
  • Cloud Functions: IoT events and data can be sent to the cloud at a high rate and need to be processed quickly. Cloud functions allow you to write custom logic that can be applied to each event as it arrives. This can be used to trigger alerts, filter invalid data, or invoke other APIs. Cloud Functions can operate on each published event individually.
  • Cloud Pub Sub: Cloud Pub Sub. Google Cloud Pub Sub provides a globally durable message ingestion service. Cloud Pub Sub can act like a shock absorber and rate level or for incoming data. Cloud Pub Sub scales to handle data spikes that can occur when swarms of devices respond to events in the physical world. It buffers these spikes to help isolate them from applications monitoring their data.
  • Cloud Dataflow: Google Cloud Dataflow provides the open Apache beam programming model as a managed service for processing data in multiple ways including batch, extract transform load ETL patterns, and continuous streaming patterns. Cloud Dataflow performs well with high volume data processing.

Cloud Functions

Cloud Functions:

dummy.png

Containers & GKE

video

Google Kubernetes Engine:

containers_GKE.png

Between 2017 and 2018, the number of organizations using containers for software development had to deploy their services doubled. The trend shows no signs of slowing. For this reason, container knowledge and skill with Kubernetes is increasing the importance for the job of a Cloud architect. Of course, if you need more of these skills for the job, you will also need them to prepare for the exam. Docker is software that builds containers. User apply application code and instructions called a Docker file, and Docker follows the instructions and assembles the code and dependencies into the container. Containers can be run much as an application can run. However, it is a self-contained environment that can run on many platforms. Google Cloud offers a service called Cloud Build which functions similarly to Docker. It accepts code and configuration and builds containers. Cloud Build offers many features and services that are geared towards professional development. It is designed to fit into a continuous development, continuous deployment workflow. It is designed to scale and to handle many application developers working on, and continuously updating a live global service. If you had 100 developers sharing source files, you would need a system for managing them, for tracking them, versioning them, and enforcing a checking review and approval process. Cloud Source Repositories is a cloud-based solution. If you were deploying hundreds of containers, you would not be in keeping it to yourself. One of the reasons to use containers is to share them with others. So you need a way to manage and share them. This is the purpose of container registry. Container registry has various integrations with continuous integration, continuous deployment services.

containers_Dockerfile.png

A Docker container is an image built-in layers. Each layer is created by an instruction in the Docker file. All the layers except for the top one are air locked. The thin read-write layer at the top is where you can make changes to a running container. For example, if you needed to change a file, those changes would be written here. The layer designed inside of a container isolates functions. This is what makes the container stable and portable. Here are a few of the common Docker commands. The docker build command creates the container image. The docker run command runs the container. There are other Docker commands that can help you list images, check the status of a running container, work with logs or stop a running container.

containers_Docker_commands.png

You can run a container in Docker itself, as you saw with the docker run command. You can also run containers using Compute Engine. Compute Engine gives you the alternative to start up a virtual machine from a container, rather than from an OS Image Boot Disk. You also have this option when creating an instance template, which means you can create managed instance groups from containers. App Engine supports containers as custom runtimes. The main difference between the App Engine standard environment and the App Engine flexible environment is that flexible hosts applications in Docker containers. It creates Docker containers and persists them in Container Registry. A Container Orchestrator is a full service for managing, running and monitoring containers. Both App Engine flexible environment and Google Kubernetes engine are container orchestrators. Kubernetes is an open standard software. So you can run a Kubernetes cluster in your data center. Google Kubernetes engine provides Kubernetes as a managed service.

kubernetes_nodes_pods_cluster.png

A Kubernetes cluster is composed of nodes, which are a unit of hardware resources. Nodes in GKA are implemented as VMs in Compute Engine. Each node has pods. Pods are resource management units. A pod is how Kubernetes controls and manages resources needed by applications and how it executes code. Pods also give the system fine grain control over scaling. Each pod host, manages, and runs one or more containers. The containers in a pod share networking and storage. So typically, there is one container per pod, unless the containers hold closely related applications. For example, a second container might contain a logging system for the application in the first container. A pod can be moved from one node to another without reconfiguring or rebuilding anything. This design enables advanced controls and operations that gives systems built on Kubernetes unique qualities.

kubernetes_nodes_master_node.png

Each cluster has a master node that determines what happens on the cluster. There are usually at least three of them for availability, and they can be located across zones. A Kubernetes job makes changes to the cluster. For example, a pod YAML file provides the information to start up and run a pod on a node. If for some reason a pod stops running or a node is lost, the pod will not automatically be replaced. The deployment YAML tells Kubernetes how many pods you want running. So the Kubernetes deployment is what keeps a number of pods running. The deployment YAML also defines a replica set, which has how many copies of a container you want running. The Kubernetes scheduler determines on which node and in which pod the replica containers are to be run.

kubernetes_AB_testing.png

One of the advanced things that Kubernetes deployments allow you to do is roll out software to some pods and not others. So you can actually keep version in production on most of the pods and try out version B with a sample group and other pods. This is called A/B testing, and it is great because you can test the new software in a real production environment without risking the integrity of the entire service. Another thing you can do with deployments is a rolling update. Basically, you load up the new software in a replacement pod, switch the load to the new pod, and turn down the old one. This allows you to perform a controlled and gradual roll out of the new software across the service. If something goes wrong, you can detect the problem and roll back to the previous software. Really, if you are going to run an enterprise production service you will need these kinds of operations. That is one major reason to adopt Kubernetes. There are a number of subjects that were not covered in this brief overview. For example, how containers running in the same pod can share resources, how containers running in different pods can communicate, and how networking is handled between a node's IP and the applications. These subjects and more are covered in the course, getting started with Google Kubernetes engine or you can find more information in the online documentation.

BigQuery

BigQuery:

Practice Exam #2

video

Practice: Network features

practice_Networking.png

Which network feature could help a company meet its goals to expand service to Asia while reducing latency?

  • HTTP/TCP load balancing,
  • network TCP/UDP,
  • Cloud Router,
  • or Cloud Content Delivery Network (CDN).

The answer is D, Cloud Content Delivery Network (CDN). CDN will enable a company to expand its online presence with a single IP address and global reach, leveraging Google's global network. While global load balancing is part of CDN, it won't help reduce latency to customers in Asia, whereas the cache service in CDN will do that. Network load balancing is designed to load balance from one GCP service to another to scale back end services, which is called east-west communications. North-south communications, where one part is in the GCP network and the other part is external, requires a different kind of load balancing. Cloud Router uses BGP to discover changes in network topology in a remote network, so it doesn't address latency. How can you minimize the cost of storing security video files that are processed repeatedly for 30 days? First, Regional class, then move to Coldline after 30 days; next, Nearline class, then move to Coldline after 30 days; Regional class, then move to Nearline after 30 days; or Multi-Regional class, and then move to Coldline after 30 days? The answer is A, Regional class, then move to Coldline after 30 days. The question here is answered by understanding the purpose of each of the storage classes and, in general, how they're priced. One thing to remember is that Coldline is really not intended to be read more than once a year. It's cheap to write data to it, but much more to read it back, compared to the other classes of storage. So the correct answer is A, local usage in a regional bucket for initial use during the month, then Coldline because it's unlikely to be read after that. This is often the case when data is used during the month and archived for compliance and record keeping after. The other options will not be cost effective.

Practice Storage

How can you minimize the cost of storing security video files that are processed repeatedly for 30 days?

  • First, Regional class, then move to Coldline after 30 days;
  • next, Nearline class, then move to Coldline after 30 days;
  • Regional class, then move to Nearline after 30 days;
  • or Multi-Regional class, and then move to Coldline after 30 days?

The answer is A, Regional class, then move to Coldline after 30 days. The question here is answered by understanding the purpose of each of the storage classes and, in general, how they're priced. One thing to remember is that Coldline is really not intended to be read more than once a year. It's cheap to write data to it, but much more to read it back, compared to the other classes of storage. So the correct answer is A, local usage in a regional bucket for initial use during the month, then Coldline because it's unlikely to be read after that. This is often the case when data is used during the month and archived for compliance and record keeping after. The other options will not be cost effective.

practice_Storage.png

practice_Storage_classes.png

  • Lab notes: PCA Prep—Google Cloud Essential Skills

Problem solving is the key skill of the job.

During the exam you will be reading a question or a case and the problem it is describing should start to inform and define a solution in your mind. The faster and more clearly you can understand the requirements and identify elements that might be part of the solution, the better the chances are that you will understand the question correctly and be able to identify the correct answer. This is what we mean by the best preparation for the exam is to be prepared for the job.

Practice problem solving

There are sample/practice exam-type questions throughout this course and a simulated exam quiz at the end. Don't just read through these or watch these for information. It is not there to teach you the right answer to a particular question. It is there to give you the opportunity to practice using your problem-solving skills, which you will need on the job and for the exam.

Practice evaluating your own confidence about information, an answer, or a solution.

Another skill that will help you on the job and on the exam is being able to evaluate your own confidence about your knowledge. People often assume that they either know something or they don't know it, they either recall it or they don't. But in fact, recall is much less binary than that. What we don't often get the chance to practice is evaluating how well we know something or how certain we are of answers or solutions.

This is really important. Because if you know that you are not certain of something, then you can use that as a guide to help you decide what to study, how to prioritize your study time, and how much effort to apply.

This means that you learn more from missing a question and answering it incorrectly in this course than from answering it correctly. When you miss a question, it is an indicator to take note that this is something you might want to study to better prepare for the exam.

How much confidence do you want? I will often dig into the documentation and not only prove to myself exactly why the correct answer is correct. But I will also continue studying until I know with absolute certainty exactly why each incorrect answer is wrong. I want to be able to state the reason it is not correct. Because that is how I know that I have studied enough.

Preparing for Optimizing and Operating

This module covered the sections of the exams outline on optimizing and operating. You can optimize a solution for many things such as:

  • reliability,
  • efficiency,
  • low cost,
  • high performance,
  • or security.

Security and compliance

Security is a broad term. It includes privacy, authentication and authorization, and identity and access management. It could include intrusion detection, attack mitigation, resilience and recovery. So security really appears across the documentation and not in just one place.

Compliance is about meeting some external guideline or standard

Practice Case Study Analysis #3

video

Business problem

The following is a case study that involves a financial services company. This vertical often involves private information and transactions, so the security requirements are high. Also these kinds of companies often need a plan for audits to meet compliance requirements for certifications. This customer had a common FinServ requirement. The customer did not want any data to traverse the public Internet, for obvious reasons. So they had a security strategy that included a technical requirement to use private APIs to access Google Cloud resources. They saw this is a fundamental need to their security strategy. Additionally, they wanted to know how the cloud provider's security standard certifications, and what they did to stay current. So they were concerned that the provider might lose the certification that they were relying on for business. A large company wanted to improve their security posture, a common FinServ requirement.

practice_case_Study_3_requirements.png

  • Security, business requirement, data cannot traverse the public Internet. Technical requirement, must have private API access to GCP services as a good security practice and to minimize data exfiltration.
  • Compliance, business requirement, cloud provider must earn the trust of the business. How does Google Cloud maintain the latest standards around security, availability, process integrity, privacy, and confidentiality?

The first thing we did was make sure all access to GCP was through secure methods including SSL, VPN, Interconnect and private API. We decided to use a new feature that was in alpha called VPC service control. This enables the security perimeter. For example BigQuery could placed inside a security perimeter, and then could only be accessed at a private endpoint. And then there were standards and compliance such as ISO and SOC. We provided these to the customer and they needed to sign agreements to be covered by Google's guarantees about these standards.

Identify Technical Watchpoints

We mapped that technical requirement in Google's clouds, products and services.

  • Security, ensure all traffic to GCP is through secure methods, such as SSL, TLS, VPN, Interconnect, private APIs and endpoints.
  • Compliance, Google Cloud has standards, regulations, and certifications that would meet their compliance requirements and help earn their trust in our platform.

practice_case_Study_3_solutions.png

Identify Technical solution/implementation

And this is how we implemented that technical requirement:

  • VPC service controls, secure GCP API.

We restricted access to user GCP resources based on the Google Cloud Virtual Network or IP range. We restricted the set of Google APIs and GCP resources accessible from user's Google Cloud Virtual Network.

Standards, regulations, and certifications. Products regularly undergo independent verification of security, privacy, compliance controls, certifications. And so ISO 27001, 27017, and 27018 and SOC 1, 2, and 3 certifications. An interesting point about both security and compliance is that it's a shared responsibility model. So although we provided secure access and layered protection, the customer needed to use IAM to manage access to its employees. And implement secure practices in its procedures. Also, the standard compliance covers the cloud resources, but not the customer's application. So they may need to take extra steps to ensure that the overall solution is compliant.

practice_case_Study_3_implementation.png

practice_case_Study_3.png

Preparing Designing for Security and Compliance

Preparing Designing for Security

video

  1. One key to securing access is to request and established groups that represent roles.
  2. Then apply the permissions to the groups, and allow the people in the organization who manage identity to assign membership to the groups.

This creates a clean interface between permission management on the cloud side, and group membership on the personnel IT side.

Another key to security, is to craft security permissions. The standard roles are defined for the most common use cases, but you might want to derive more granular and restricted roles by customizing them. Service accounts are a great way to separate system components, and established secure communications between components. A bastion host is a way to leverage a service account. For risky and uncommon actions, make the user admin startup and log into bastion host. From there they can borrow the service account assigned to the host to perform restricted functions.

One benefit is that the login process generates logs for accountability.

security_best_practices.png

A policy is set on a Resource, and each policy contains a set of rules and role members.

Resources inherit policies from parents. So, a policy can be set on a resource for example, a service, and another policy can be set on a parents such as a project that contains that service. The final policy is the union of the parent policy and the resource policy.

What happens when these two policies are in conflict?

What if the policy on the resource only gives access to a single cloud storage bucket, and restricts access to all other buckets? However, at the project level, a rule exists that grants access to all buckets in the project. Which rule wins? The more restrictive rule on the resource or the moral general role on the project?

If the parent policy is less restrictive, it overrides a more restrictive resource policy.

So, in this case, the project policy wins. Folders map well to organization structure. It's a way to isolate organizations or users or products while still having them share billing and corporate resources.

Commit a security checklist to memory. Sometimes just running down a list will rapidly identify a solution.

security_policies.png

security_IAM_list.png

Network & Security reading

Network & Security reading:

  • why do you bother locking doors?
  • Share VPC: keep others out by locking you in!

security_shareVPC_example.png

  • VPC peering keeps communications private and on topic

security_VPCpeering_example.png

  • Should you use Shared VPC or VPC peering?

security_VPCpeering_or_sharedVPC.png

  • remove external IPs using Private Google Access

security_remove_externalIP.png

  • Cloud NAT provides internet access to private instances

security_cloudNAT.png

  • Cloud Armor works with HTTP(S) load balancing

security_cloudArmor.png

Designing for Legal Compliance

types_legal_compliance.png

What are the two most common compliance areas? Privacy regulations such as HIPAA and GPDR, and commercial and live business standards such as PCI DSS. Google Network has layers of protection. Each layer protects and compliments the next internal layer. The main thing to know is that Google handles security up to a point, after that, the security is up to you. So, you need to know where your responsibilities begin.

Google_security_up_to_a_point.png

  • secure VPC
  • Cloud Interconnect
  • 3rd-party Virtual Appliances
  • Google Cloud load balancing
  • Google Network
  • 3rd-party DDoS defense

MAP_Google_security.png

Here's some key concepts:

  • Cloud Armor,
  • Cloud Load Balancing,
  • Cloud Firewall Rules,
  • Service Accounts,
  • separation into front-end and back-end,
  • isolation of resources using separate service accounts between devices.

Because of pervasive availability of firewall rules, you don't have to install a router in the network at a particular location to get firewall protection. That means you can layer the firewalls as shown in this example.

Because of pervasive support for Service Accounts you can lock down connections between components.

When faced with a security question on an exam or in practice, determine which of the specific technologies or services is being discussed: Authentication, encryption for example, then determine exactly what the goals are for sufficient security. Is it deterrence? Is it meeting a standard for compliance? Is the goal to eliminate a particular risk or vulnerability? This will help you define a scope of a solution whether it's on an exam or in a real-world application

dummy.png

GCP provides several encryption options. Customer Managed encryption keys CMEK, using Cloud KMS. When you use Cloud Dataproc, cluster and job data is stored on persistent disks associated with the Compute Engine VMs in your cluster, and in a Cloud Storage bucket. The persistent disk and bucket data is encrypted using a Google-generated data encryption key called a DEK and a key encryption key called a KEK. The CMEK feature allows you to create use and revoke the key encryption key, the KEK. Google still controls the data encryption key or the DEK. Default encryption, encryption at rest uses the key management system KMS to generate KEKs and DEKs. The Key Management Service KMS allows you to generate AES-256 keys. You can use these values off Cloud. The service also handles key rotation and when a file is destroyed there is a 24-hour delay before final deletion.

MAP_Google_security_encryption.png

Practice Exam #3

video

Which Cloud IAM roles for security auditors requiring visibility across all projects?

  • A. Org viewer, project owner.
  • B. Org viewer, project viewer.
  • C. Org admin, project browser.
  • Or D, Project owner, network admin.

And the answer is B, Org viewer, project viewer. This solution gives read-only access across the entire company. The other options allow changes that should not be permitted. Dress4Win security has decided to standardize on AES256 for storage device encryption. Which strategy should be used with Compute Engine instances?

Dress4Win security has decided to standardize on AES256 for storage device encryption. Which strategy should be used with Compute Engine instances?

  • Select like SSDs rather than HDDs to ensure AES256 encryption.
  • Use the linux dm-crypt tool for whole-disk encryption.
  • Use the Customer Supplied Encryption Keys, CSEK.
  • Use open SSL for AES256 file encryption.

The answer is A, Select SSDs rather than HDD to ensure AES256 encryption. Selection of disk type determines the default method for whole-disk encryption. HDDs use AES128 and SDDs use AES256. In addition to the storage system level encryption described above, in most cases, data is also encrypted at the storage device level. With at least AES128 for hard disks HDD and AES256 for new solid state drives SSD. Using a separate device level key which is different than the key used to encrypt the data at the storage level. As older devices are replaced solely AES256 will be used for device level encryption.

Practice Case Study Analysis #4

video

Case Study #4

This section covers analyzing and optimizing technical and business processes in the exam guide outline. Let's start with a case that will illustrate business requirements.

analyze_business_requirements.png

Identify technical watchpoints

practice_identify_business_requirements.png

Designing a solution infrastructure that meets technical requirements

practice_identify_requirements_solution.png

practice_case_4.png

Preparing for Analyzing and defining technical processes

video

analyze_tecnical_processes.png

analyze_tecnical_processes_test_environment.png

analyze_tecnical_processes_pricing.png

analyze_tecnical_processes_pricing_discounts.png

analyze_tecnical_processes_pricing_disk_costs.png

analyze_tecnical_processes_pricing_network.png

Network & Performance

Network & Performance

network_performance.png

Analyzing and defining business processes

video

analyze_business.png

Analyzing and defining business processes is covered in our design and process courses.

Let's expand on a couple of these issues. In the change management outline item, there's a tip that says, quality is a process not a product. As a working cloud architect, you'll almost never have a job where you design and implement the technical solution and then you're done. Instead, you'll be required to stay on the project for a period after implementation and launch, to make sure that solution continues to run and stabilizes. Anticipating that, you'll want to develop process checks, and operational knobs to ensure that the solution can be monitored and adjusted during the stabilization period.

  • instance overhead estimation
  • persistent disk estimation
  • network capacity estimation
  • estimate workload (pipeline, batching, ...)

Developing (testing) procedures to test resilience

video

In this section, we'll discuss developing testing procedures. You can't test everything, so you need to consider what items can act as indicators. How do you prove that the solution is working properly? How do you know if the solution is highly available or scalable?

failover_Design.png

scale_out_decision_process.png

Now, here's a tip. Consider using Stackdriver custom metrics for auto-scaling. The reason is that CPU utilization is rarely a good measure of customer experience. A custom metric can enable auto-scaling on a more meaningful value.

For example, a game service might scale with the number of players, which might be more directly related to application performance than something like CPU utilization.

Practice Exam #4

video

Which of Dress4Win's requirements will Stackdriver dashboards, metrics, and reporting satisfy?

  • Improve security by defining and adhering to a set of security and identity and access management IAM best practices for cloud.
  • Encrypt data on the wire and at rest.
  • Analyze and optimize architecture for performance in the cloud,
  • or support multiple VPN connections between the production data center and cloud environment.

C is the correct answer. Analyze and optimize architecture for performance in the cloud. Stackdriver metrics will help to analyze and optimize performance for the cloud, because it can be used to gather metrics and custom metrics if needed, to get to specific behavior the applications being migrated. Stackdriver does not necessarily improve security.

How can a company connect cloud applications to an Oracle database in its datacenter to meet its business requirement of up to 10 gigabytes of transactions with an SLA?

  • Implement a high-throughput cloud VPN connection.
  • Cloud Router with VPN.
  • Dedicated interconnect.
  • Or partner interconnect.

The correct answer is D, partner interconnect. Partner interconnect is good up to 10 gigabits per second and provides an SLA. To differentiate the options, consider their support for speed and volume of data. Cloud VPN is useful for low volume connections. Partner interconnect is useful for data up to 10 gigabits per second. Direct Interconnect is useful for data from 10 gigabits per second to 80 gigabits per second. The Cloud VPN SLA covers the availability of the VPN service not the availability of the public internet. Business Internet Service Level Agreements, SLAs from ISPs are commonly between 99 percent and 99.5 percent for a dedicated line. Therefore even though the Cloud VPN services available at 99.9 percent of the time, the communication it relies on will be down between one half percent, and one percent of the time. That doesn't meet availability requirements.

Practice Case Study analysis #5

video

This case involves Finserv, which is how people in the industry refer to financial services. There are two industries where security comes up as a priority in nearly every transaction. Can you guess them? Financial services is one because the transactions involved are both private and involve the exchange of value. The other is the healthcare industry, where a lot of the information is what they call PII or personally identifiable information. Let's see how security is handled in this financial services example.

Case Study #5

case_study_04-fin-serv_industry.png

A lot of customers we see are in the enterprise space, so their needs are very similar. This example comes from a financial services company. We often see similar requirements among Finserv companies. So, a Finserv customer had this interesting business requirement. Encryption in transit and at rest for all developer operations.

Follow Google Best Practices:

  • All keys must be managed by the company. They wanted to own the keys. The real trick here is that the structure and solution had to be put into production at one time. It couldn't be built-in parts into production, it had to be all working when it went into production. That caused us to think about what parts were inherent and what parts we could automate.
Identify technical watchpoints

case_study_04-fin-serv_tech_analysis.png

So, we ended up using a Jenkins Pipeline and Deployment Manager Templates for parts of this automation.

We mapped that to technical requirements like this:

  • Use Google authentication.
  • No public IP access unless through a bastion host.
  • No Operations team access to production environment, that means NoOps. Everything is automated.
  • Minimize downloaded keys. Keys accounted for via business logic application. All the Google APIs are encrypted in transit and authenticated. So, that requirement was inherited and automatic.
  • The production team needed operations access but without handing them the keys. So, what we did is implemented all operations in deployment pipelines using Jenkins and Deployment Manager. The business logic was implemented using Python in the Deployment Manager Templates.
Designing a solution infrastructure that meets technical requirements

case_study_04-fin-serv_solution.png

This is how we implemented that technical requirement. All Google APIs are encrypted in transit and authenticated. Production has Operations team access. All deployment pipelines via Jenkins and Deployment Manager business logic and Python templates and Deployment Manager. CloudSDK was not installed in local machines. Cloud Shell ensures that no keys are downloaded. Service Account Keys when needed for off-GCP clients are managed via deployment pipelines. There are two kinds of operations actions: on-GCP actions and off-GCP actions. For on-GCP actions, we didn't install CloudSDK on local machines. Instead, we set them up to use Cloud Shell, that ensured that no keys were downloaded. For off-GCP actions, the Service Account Keys were managed via the deployment pipelines. Anytime there was a need for off-GCP access, the clients are managed via the deployment pipeline. So, that means there's a full audit control and records of those keys, and who had access to them, and when and where they were used.

case_study_05.png

Preparing for Advising

Most of the items in the exam outline have been covered already in another context.

Advising development operation teams

The first rule of testing is that you can't test everything so you need to make some decisions. Unit testing focuses on individual functional units, for example, exercising an API. In some development environments, it's common for the original software developer to provide a testing application that exercises the API and validates that it's working as expected. Integration testing has to do with putting parts together and testing them as an assembly. Sometimes the individual parts can pass unit test because each is working as designed, but when the units are assembled they may not be compatible. You can also discover timing issues called race conditions during integration testing.

advising.png

One good piece of advice is to create a launch checklist.

In this example, there are:

  • dependencies
  • capacities
  • single points of failure
  • security and access
  • and a phased roll out plan.

With all those items to be checked and some of them being very complex it's easy to see the value of using an organized approach to ensuring that everything's ready. General advice about release management? Well, automate everything you can. Also, instead of creating a process with a resource bottleneck which can slow down release, consider implementing a self-service approach. Let the lead developers or the product managers perform the release using the tools. Reliability and consistency are the keys to making release work well. Also, implement access control over critical release features and processes. For example, a team lead or a tech lead might be a member of the release group, and have special access.

launch_checklist.png

When we think about capacity planning for launch, it's common to create a moon shot event where everything has to come together perfectly at a single moment for the launch to succeed. Consider instead using a phased approach, by launching first to a smaller market. The service can generate feedback and even warn of issues that might not scale in subsequent phases.

phased_launch_approach.png

A classic example of this was when the first Pokemon Go game was launched. It was launched first in Japan. The game was so popular that it had scaling issues because the demand was much greater than the anticipated demand for which the service was designed. Fortunately, launches in Europe, the US and other locations were separated by a few days. Staging the launch gave the team the time needed to understand the scaling issue and redesign and reimplement the service before its second launch.

I'm pretty sure you know this already. There are three ways to interact with Google Cloud, first is GCP Console. Second is the tiny virtual machine that's started up inside Console called Cloud Shell. One thing that makes Cloud Shell useful is it's authorized in the project, and it has the Cloud SDK tools including gcloud, gsutil and bq installed. You can also install the Cloud SDK outside of Google Cloud on a local computer or VM.

Practice Exam #5

video

Implement back-out/rollback for website with hundreds of VMs. Site has frequent critical updates.

  • Create a nearline copy of static data in Cloud Storage.
  • Create a snapshot of each VM prior to update in case of failure.
  • Use managed instance groups with the rolling-action start update command when starting a rolling update.
  • Only deploy changes using Deployment Manager templates.

The correct answer is C. Use managed instance groups with the "rolling-action start update" command when starting a rolling update. Allows compute engine to handle updates, easy management of VMs. Website with hundreds of VMs, load-balanced likely already using a managed instance group. Now here's a tip. Did you know about this command? This is an example of the level of detail you should be familiar with. If you had studied managed instance groups features, you would have at least seen the "rolling-action start update" option and recognized it in the question. Nearline copy would have been unreliable because once the copy has overwritten, you can't roll it back. Creating VM snapshots could work but it's not an efficient way to backup big data. Also the bigger the data the longer the backup takes which could impact production. Using Deployment Manager templates runs the risk of version conflicts. So, managed instance group features are most efficient for this situation.

A car reservation system has long-running transactions, which one of the following deployment methods should be avoided?

  • Execute canary releases,
  • perform AB testing prior to release,
  • introduce a blue-green deployment model,
  • introduce a pipeline deployment model.

The answer is introduce a blue-green deployment model. Switching the load balancer from pointing at the green, good environment to the blue new environment is a fast way to roll back if there's a problem during a release. However, long-running transactions would be disrupted by that switch. This question requires you to know a little about AB testing and a little about blue-green deployments and a little about canary releases. They're covered lightly in our courses but it would be advisable to study these separately since they are not Google specific methods. The second link discusses long-running connections and how to support them.

Preparing for Reliability

Ensuring solution and operations reliability

video

Ensuring solution and operations reliability. How do you ensure that a solution is reliable? Part of it occurs in the design. Making sure that common changes like increased traffic are handled in elastic ways. However, part of it is also in planning to monitor the service and to notice and respond to unplanned events. Some of those activities require human intelligence. For this reason, operations reliability spans both the technical and the procedural domains.

SRE_google_approach_to_devops.png

Site reliability or SRE, is Google's approach to DevOps. It's a very comprehensive system that involves changing the culture about how maintenance occurs. One central idea is the division of aspects of operations into separate layers for clarity. Here's a tip, you ought to know something about each of these layers and most importantly, you should be able to distinguish between the layers. For example, monitoring is not incident response. They're related. Do you know what features relates them? It's alerts. A Stackdriver alert is triggered by monitoring and begins incident response, which is composed mainly of procedures.

evaluate_quality.png

Qualities are often where our goals start, but figuring out how to measure them quantitatively enables data-driven operations. It can be difficult to figure out exactly what to measure because sometimes what's easily measured is not a good indicator of customer interests.

types_monitoring.png

Speaking of alerts, at Google, we have the concept of alerting for the right reason. Often, alerts are designed to signify some metric passing some limit. But the question is whether that metric or trigger is something the customer cares about or not. We need to alert on some technical measures. But if there's something that is directly causing the customer frustration and upset, that should also be an alert or perhaps replace a more technical alert. Make sure you know the difference between blackbox monitoring and whitebox monitoring. Blackbox monitoring and whitebox monitoring are frequently misunderstood. In the cloud architect contexts, the difference has to do with the assumptions you can make when designing your monitoring framework. In blackbox monitoring, you're not supposed to know or think about the inner workings of the application. All you can see is the user interface or the API interface. So, the only assumptions you're allowed to make have to do with these interactions. Blackbox monitoring is very good for validating user experience. You end up monitoring things like latency between request and response. In whitebox monitoring, the application is assumed to be known to you. The inner workings of the application are transparent. So, you can use that special knowledge when defining the test. A good example would be if you knew that under certain conditions a critical resource will get oversubscribed and you've designed the system from resiliency. In this case, you might flood the interface to trigger the state as if the service was under attack to see if the resiliency worked as expected. That's whitebox monitoring, where the tests can be focused on inner workings and not just the UI. In practice of course, you need both kinds.

example_metrics.png

Here's an example, CPU utilization may or may not indicate user satisfaction. Round-trip delay or frequency of request errors might be a better measure of the user's experience. What metrics are you using? Can you define metrics that relate directly to user experience and service objectives? What are the watermarks or alert levels at which human processes are engaged? How are you setting those values? When do they need to be revisited and updated? How do you know they're related to important events?

stackdriver_benefits.png

Know how to use trace and debug. Examples of other tools that Stackdriver replaces. Note that it's not just a collection of alternate tools that's the issue, but how you use them together. The individual tools are not integrated or designed to work together. So, a lot of manual procedures and translation massaging of data are required to use them together. With Stackdriver, the integration is by design. So, that work disappears. Stackdriver is also multi-cloud, able to manage projects across GCP and AWS.

Another useful idea is that:

people don't plan to fail, they fail to plan.

Another way of saying this is, the only time we have to prepare for emergencies is before they happen. Once the emergency is occurring, it's too late to prepare. You can design a great technical solution, but if it doesn't include human processes, then it might not be adaptive and resilient. Easy buttons are tools and processes that automate common actions. A playbook is a list of what to do when. So, here's a general rule; for every alert you should have a play in the playbook.

diff_dashboard_response.png

What are the differences between a dashboard, an alert, and incident response? A dashboard is a display for monitoring a system. It's commonly tailored to the application. An alert occurs when a condition is met such as a metric crossing above a particular value for a given duration. The alert is the notification and alert could just be a warning or it could be a notification of an incident that needs to be handled immediately. Incident response consists of the steps you would take when a problem occurs. This might be written up in a playbook.

Find a lab such as Quick Labs lab that uses logging and trace and debug to identify and solve an application problem. This will give you a sense of the value and how these components work together. There's a lab like this in the architect in GCP infrastructure class.

blamelessness.png

Google's approach focuses on transparency, on involving the customer in the solution and blamelessness. Assigning blame establishes root cause with a person or an organization instead of getting to the real technical or procedural issue so that it can be fixed. If blame has been assigned, there's a high likelihood that the process has been prematurely suspended without really addressing the problem.

human_processes.png

What are the people supposed to do? What decisions or actions are they supposed to make or take? Are these documented? As mentioned, the metrics are not sufficient without the meeting to review the metrics, to evaluate them and make decisions and take actions. In those cases where timing is critical, you'll want to playbook and easy buttons supporting automation to increase the speed and consistency of incident response. Here's another tip, when something goes wrong with the cloud resource, give yourself or your team a limited period of time to solve it. For example, if a VM starts behaving incorrectly, see if it's something that's easily fixed. Then spare the VM to the side and replace it. Perform your diagnostics and debugging after the instance is replaced.

Workflow Orchestration

Workflow Orchestration reading:

  • automate infrastructure or workflow

automate_infrastructure.png

  • Create data infrastructure whe nthe workflow requires it

when_needed_infrastructure.png

  • Cloud COmposer: extensible workflow orchestration

composer.png

Monitoring, Alerting, and Uptime

Monitoring, Alerting, and Uptime reading:

monitoring.png

alerting.png

uptime_checks.png

Practice Case Study analysis #6

video

Case Study #6

case_study_6_system_update.png

A customer had this interesting business requirement. The back office system needs to support frequent updates. The back office system needs to be available especially between 6:00 AM and 06:00 PM. A failure in one part of the back office system shouldn't bring down the entire system, and the customer wants to re-architect the system. They don't want to bring down the entire system when doing an update.

Identify technical watchpoints

So, we map that to technical requirements like this. Microservices, break apart the back office system into independent services, create a standard way for teams to publish logs and metrics for their services, and create a standard way for services to be rolled out. This use case was a natural fit for microservices. They knew that when they told development groups that they would be developing their own microservices, that they needed standards for reliability and scalability, and that they want common ways to monitor the applications.

case_study_6_tech_analysis_microservices.png

Designing a solution infrastructure that meets technical requirements

case_study_6_tech_analysis_microservices_implementation.png

This is how we implemented that technical requirement. Google Kubernetes Engine, microservices deployed into a shared cluster. Surging rolling deployments with Kubernetes deployment resource, and Stackdriver, custom metrics, a wrapper library around Stackdriver client libraries and that enabled us to expose common metrics, and expose custom metrics. So, the solution was to use Stackdriver exposing the metrics that could be done through dashboards, exposed metrics through Prometheus standards scraped from API's and sent to Stackdriver, where it could be exposed through the dashboards. They use custom metrics and Stackdriver, so they were able to monitor and scale their microservices based on those metrics.

case_study_6.png

Practice Exam #6

video

A microservice has intermittent problems that bursts logs. How can you trap it for live debugging?

  • Log into machine with microservice and wait for the log messages.
  • Look for error in Stackdriver Error Reporting dashboard.
  • Configure microservice to send traces to Stackdriver Trace.
  • Set a log metric in Stackdriver logging, alert on it past a threshold.

D is the correct answer. Set a log metric in Stackdriver logging, and alert on it past a threshold. A Stackdriver metric can identify a burst of log lines. You can set an alert, then connect to the machine while the problem is happening. What's your tip from this? You should be familiar with basic Stackdriver features and operations. With distributed and scalable services, you need to debug from logs and centralized monitoring services. Error reporting is an instance whereas the a log is a history so you can see issues in context of time. Likewise, trace is a sampling system, so it might miss intermittent issues like this. Again, the tip here is to understand the differences and appropriate uses for the tools in Stack Driver.

Last week a region had a 1% failure rate in web tier VMs. How should you respond?

  • Monitor the application for a 5% failure rate.
  • Duplicate the application on prem to compensate for failures in the cloud.
  • Perform a root cause analysis, reviewing cloud provider and deployment details to prevent similar future failures.
  • Halt all development until the application issue can be found and fixed.

C is the correct answer.

Challenge Labs #2

A Challenge Lab has minimal instructions. It explains a circumstance and the expected results. You have to figure out how to implement them. This is a time lab. The lab will expire after one hour and 10 minutes. The lab can be completed in about 50 minutes.

subject: PCA Prep -- Update and Scale Out a Containerized Application on a Kubernetes Cluster

This lab is similar to a Challenge Lab in the "Challenge: GCP Architecture" Quest: Managing Deployments Using Kubernetes Engine

For this Challenge Lab, you must complete a series of tasks within a limited time period. Instead of following step-by-step instructions, you'll be given a scenario and task - you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.

Skills tested:

  • Update a docker application and push a new version to a container repository.
  • Deploy the updated application version to a Kubernetes cluster.
  • Scale out the application so that it is running 2 replicas.

lab notes

Challenge Labs #3

This challenge lab is more difficult than the previous one. The lab will expire after one hour and 30 minutes. The lab can be completed in about one hour and 15 minutes.

subject: PCA Prep -- Deploy a Compute Instance with a Remote Startup Script

This lab is similar to a Challenge Lab in the "Challenge: GCP Architecture" Quest.

In this Challenge Lab you must complete a series of tasks within a limited time period. Instead of following step-by-step instructions, you'll be given a scenario and task - you figure out how to complete it on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.

To score 100% you must complete all tasks within the time period!

When you take a Challenge Lab, you will not be taught GCP concepts. You'll need to use your advanced Google Compute Engine (GCE) skills to assess how to build the solution to the challenge presented. This lab is only recommended for students who have GCE skills. Are you up for the challenge?

Skills tested:

  • Create a storage bucket for startup scripts.
  • Create a virtual machine that runs a startup script from cloud storage.
  • Configure HTTP access for the virtual machine.
  • Deploy an application on an instance.

lab notes

Review

Don't expect to be taught the touchstone concepts here. The purpose of them in this course is to help you evaluate your preparedness. Seek training in the technical training courses, documentation, labs, and so forth.

Practice the case evaluation method on cases and on sample questions

  1. Business Requirements
  2. Technical Requirements
  3. Technical Watchpoints (requirements or facts that indicate elements of a solution)
  4. Proposed Solution

This is not just a test-taking skill. This is a skill used in practice by consultants on the job. It is how they think about their customer engagements and talk about it with other professionals.

What questions would you want to ask a client?

You can make reasonable assumptions about a case. But if you seem to be missing information, especially technical information, that may be useful. It may help constrain the degree of freedom and limit the number of potentially correct answers. It might indicate that an answer is incorrect. In other words, use what you don't know or what is missing in the case to help you evaluate the intent of the question.

Make sure you know what is being asked.

If you find yourself speculating and trying to add information to the case beyond reasonable assumptions, then you might be drifting off of the intent of the question.

Hands-on

It is a good idea to review and run through basic labs so that the hands-on details are fresh in mind. You might want to review steps of labs you performed before. Or you might want to do some of them again.

There are resources available on the Qwiklabs Google Cloud Catalog.

Instructions for GRADED & UNGRADED Practice Exam Quiz

There are two versions of the Practice Exam Quiz. The first version is UNGRADED Practice Exam Quiz and provides information about the answers you select and feedback to help you understand what you might need to study. You can try the ungraded version multiple times until you get everything right.

A good way to study is to look up every correct and incorrect answer and make sure you not only know which answer is correct, but also why it is correct.

The second version is the GRADED Practice Exam Quiz. This version is more like the actual exam. Because it offers limited feedback -- just a total score at the end. When you are ready, proceed to the GRADED Practice Exam Quiz which will give you credit towards completing this course. You may only attempt the GRADED Practice Exam Quiz three times in 8 hours.

Review of tips

  • TIP 1: Create your own custom preparation plan using the resources in this course.
  • TIP 2: Use the Exam Guide outline to help identify what to study.
  • TIP 3: Product and technology knowledge.
  • TIP 4: This course has touchstone concepts for self-evaluation, not technical training. Seek training if needed.
  • TIP 5: Problem solving is the key skill.
  • TIP 6: Practice evaluating your confidence in your answers.
  • TIP 7: Practice case evaluation and creating proposed solutions.
  • Tip 8: Use what you know and what you don't know to identify correct and incorrect answers.
  • Tip 9: Review or rehearse labs to refresh your experience
  • Tip 10: Prepare!

References: