Cloud Architecture Labs - Advanced Topics

Lab 22: Chaos Engineering with Gremlin & Litmus

Multi-Cloud Chaos / Expert

Scenario: Building Resilient Systems Through Chaos

StreamingPlatform Inc. operates a global video streaming service handling 50 million concurrent users. Implement chaos engineering practices to ensure system resilience. Configure Gremlin for AWS infrastructure chaos experiments, deploy Litmus Chaos for Kubernetes workloads, implement automated game days with failure injection, establish SLO-based chaos experiments, and create observability dashboards to measure impact. The system must maintain 99.99% availability even during chaos experiments.

Learning Objectives:

Chaos Principles: Implement controlled failure injection
Experiment Design: Create hypothesis-driven tests
Automation: Build automated chaos pipelines
Observability: Measure system behavior under stress

📋 Step-by-Step Instructions:

Install Chaos Engineering Tools & Configure Experiment

Explanation: Install Gremlin and Litmus Chaos frameworks, then configure your first chaos experiment using the GUI panel.

Part A - Terminal Installation:

1. Install Litmus Chaos operator in your Kubernetes cluster

2. Verify installation by checking pod status

💡 Tip: Make sure you have cluster-admin permissions before installing

Command to run: kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator.yaml

Part B - GUI Configuration:

3. In the Experiment Configuration Panel below, configure the following:

Experiment Name: network-latency-test

Target Type: Select "Kubernetes Pods"

Chaos Type: Select "Network Latency"

Blast Radius: Enter 10 (affects 10% of pods)

Duration: Enter 5 minutes

Schedule: Select "Run Once"

Required Checkboxes (all must be checked):

✓ Check "Auto-rollback on SLO violation"

✓ Check "Send Slack notifications"

✓ Check "Generate detailed report"

Click the copiable values above to copy them, then paste into the GUI fields
Define SLOs & Run Experiment

Explanation: Define Service Level Objectives to establish baseline metrics, then execute your configured experiment.

Part A - Terminal (Define SLOs):

1. Set baseline availability, latency, and error rate thresholds

💡 Tip: Document your steady state criteria - this becomes your hypothesis for chaos experiments

Command to run: chaos define-slo --availability 99.9 --latency-p99 100ms --error-rate 0.1

Part B - GUI (Run Experiment):

2. Click the "Create Experiment" button in the Experiment Configuration Panel above

3. Verify that your configuration matches instructions from Step 1

💡 Tip: The system will validate all fields before starting the experiment
Monitor Experiment Execution

Explanation: Monitor the chaos experiment in real-time and observe system behavior.

Part A - Terminal (Monitor):

1. Watch the experiment progress and track injected failures

2. Observe how your services respond to network latency

💡 Tip: Keep monitoring dashboards open (Grafana/Datadog) to visualize impact

Command to run: chaos experiment monitor --watch --experiment-id network-latency-test

Part B - GUI (View Live Metrics):

3. Observe the System Health Metrics panel updating in real-time

4. Watch for SLO violations (availability drop, latency increase)

💡 Tip: Healthy systems should maintain SLOs even with 10% of pods experiencing latency
Generate & Analyze Report

Explanation: After the experiment completes, generate a comprehensive report and identify improvements.

Part A - Terminal (Generate Report):

1. Export detailed experiment data in JSON format

2. Include all metrics, failures, and recovery times

💡 Tip: Save reports for trend analysis across multiple game days

Command to run: chaos report generate --experiment-id latest --format json --output chaos-report.json

Part B - GUI (Review Results):

3. Review the final metrics in the System Health Metrics panel

4. Identify services that violated SLOs or exhibited unexpected behavior

5. Document findings for system improvements

💡 Success Criteria: You should identify at least one area for resilience improvement

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

Click "Run Game Day" to execute an automated chaos simulation
Click "Chaos Report" to view your experiment results and resilience score
Observe the System Health Metrics panel - values update dynamically during experiments
Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time changes as you configure chaos experiments!

Chaos Engineering Control Center

Chaos Experiments Dashboard

System Status: Stable

Experiment Configuration

Experiment Name *

Target Type *

Chaos Type *

Blast Radius (%) *

Duration (minutes) *

Schedule *

Auto-rollback on SLO violation * Send Slack notifications * Generate detailed report *

* All checkboxes are required for lab completion

Network Chaos

Ready

Inject latency, packet loss

Resource Chaos

Ready

CPU/Memory stress tests

Pod Chaos

Ready

Random pod termination

State Chaos

Ready

Database failure simulation

System Health Metrics

99.99%

Availability

23ms

P99 Latency

0.01%

Error Rate

45s

MTTR

Chaos Terminal

chaos@engineering:~$

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Chaos engineering implemented successfully!

Lab 23: Enterprise Serverless Architecture

AWS Lambda / Expert

Scenario: Event-Driven Microservices Platform

DataAnalytics Corp needs to process 10 billion events daily with sub-second latency. Build a serverless architecture using AWS Lambda, API Gateway, DynamoDB, and EventBridge. Implement function composition patterns, asynchronous processing with SQS and SNS, distributed tracing with X-Ray, cost optimization with reserved concurrency, and multi-region active-active deployment. The system must scale automatically and maintain costs under $0.0001 per transaction.

Learning Objectives:

Serverless Patterns: Implement enterprise patterns
Event Architecture: Build event-driven systems
Performance: Optimize cold starts and latency
Cost Control: Implement FinOps practices

📋 Detailed CLI Instructions:

Design Architecture & Configure Lambda

Explanation: Initialize SAM project and configure your Lambda function using the GUI.

Part A - Terminal (Initialize Project):

1. Create new SAM project with Python 3.9 runtime

💡 Tip: SAM simplifies Lambda deployment by managing CloudFormation templates

Command to run:
sam init --runtime python3.9 --name data-processor
Part B - GUI (Configure Function):

2. In the Function Configuration Panel below, configure:

Function Name: data-processor

Runtime: Select "Python 3.9"

Memory: Select "512 MB"

Timeout: Enter 30 seconds

Execution Role: Select "lambda-execution-role"

Architecture: Select "x86_64"

Click copiable values to copy, then paste into GUI fields
Implement Function Patterns

Explanation: Build your Lambda functions with shared dependencies and orchestration logic.

Steps:

1. Create Lambda layer for shared utilities (boto3, requests libraries)

2. Implement Step Functions state machine for workflow orchestration

3. Configure Dead Letter Queue (DLQ) for failed invocations

4. Set up error handling with exponential backoff

💡 Tip: Lambda layers reduce deployment package size and enable code reuse across functions

Command to run:
sam build
This builds all Lambda functions and prepares them for deployment
Configure Event Sources

Explanation: Connect your Lambda functions to various AWS event sources for event-driven processing.

Steps:

1. Set up EventBridge rule to trigger Lambda on custom events

2. Configure Kinesis stream as trigger for real-time data processing (batch size: 100)

3. Implement SQS queue for asynchronous message processing

4. Add S3 event notification trigger for file processing

💡 Tip: Use SQS for workloads that can tolerate latency; use Kinesis for real-time streaming

Command to run:
sam deploy --guided
Follow the prompts to configure your deployment settings
Optimize Performance

Explanation: Reduce cold starts and improve response times through performance optimizations.

Steps:

1. Configure provisioned concurrency (100 concurrent executions) for critical functions

2. Implement connection pooling for database connections (reuse across invocations)

3. Optimize package size by removing unused dependencies

4. Use Lambda Powertools for structured logging and tracing

💡 Tip: Provisioned concurrency keeps functions warm but increases cost - use for latency-critical APIs only

Command to run:
aws lambda put-provisioned-concurrency-config --function-name data-processor --provisioned-concurrent-executions 100
This eliminates cold starts for your main function
Implement Observability

Explanation: Enable comprehensive monitoring and tracing to understand function behavior.

Steps:

1. Enable AWS X-Ray active tracing for all Lambda functions

2. Configure CloudWatch Insights queries for log analysis

3. Set up custom CloudWatch metrics (invocation count, duration, errors)

4. Create CloudWatch dashboards with key performance indicators

5. Configure SNS alerts for error rates > 1%

💡 Tip: X-Ray shows end-to-end request flow across your distributed serverless architecture

Command to run:
sam logs --tail --function data-processor
Monitor function logs in real-time
Deploy Multi-Region

Explanation: Create active-active multi-region deployment for high availability and low latency.

Steps:

1. Deploy your SAM application to us-east-1 and eu-west-1 regions

2. Configure Route 53 with geoproximity routing policy

3. Set up DynamoDB Global Tables for cross-region data replication

4. Implement S3 Cross-Region Replication for static assets

5. Test failover by simulating regional outage

💡 Tip: Multi-region adds cost but provides disaster recovery and better user experience globally

Commands to run:
sam deploy --region us-east-1
sam deploy --region eu-west-1
Deploy to multiple regions for global availability
Configure Lambda GUI Settings (Required)

Explanation: Configure the Function Configuration panel with required Lambda settings.

GUI Configuration (Required):

Function Name: data-processor

Runtime: Select "Python 3.9"

Memory: Select "512 MB"

Timeout: Enter 30 seconds

Execution Role: Select "lambda-execution-role"

Architecture: Select "x86_64"

Advanced Settings (Expand & Configure):

✓ Check "Enable X-Ray Tracing"

Reserved Concurrency: Enter 100

Provisioned Concurrency: Enter 10

Dead Letter Queue: Select "SQS Queue"

Click "Deploy Function" to deploy your Lambda configuration.

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

Click "Load Test" to run a performance benchmark on your serverless architecture
Click "Cost Analysis" to view detailed cost breakdown and optimization suggestions
Observe the Lambda Functions Monitor - function cards update as you deploy
Watch the Event Flow Monitor showing real-time request flow through your architecture
Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time Lambda metrics as you configure and deploy functions!

Serverless Architecture Dashboard

Lambda Functions Monitor

Function Configuration

Function Name *

Runtime *

Memory (MB) *

Timeout (seconds) *

Execution Role *

Architecture *

x86_64 arm64 (Graviton2)

Advanced Settings (Required) *

Enable VPC

Enable X-Ray Tracing *

Reserved Concurrency *

Provisioned Concurrency *

Dead Letter Queue *

Environment Variables

* Required fields for lab completion

api-authorizer

Invocations: 125,432/min

Duration: 12ms avg

Errors: 0.01%

Cost: $0.0002/invoke

data-processor

Invocations: 89,123/min

Duration: 145ms avg

Errors: 0.02%

Cost: $0.0008/invoke

report-generator

Invocations: 0/min

Duration: 2.3s avg

Errors: 0%

Cost: $0.0015/invoke

Event Flow Monitor

API Gateway

1.2M req/min

Lambda

Active

DynamoDB

15K WCU

SAM CLI Terminal

serverless@aws:~$

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Enterprise serverless architecture deployed!

Lab 24: Cloud FinOps & Cost Optimization

Multi-Cloud FinOps / Advanced

Scenario: Enterprise Cloud Cost Management

TechGiant Corp spends $5M monthly on cloud services across AWS, Azure, and GCP. Implement a comprehensive FinOps practice to reduce costs by 30% while maintaining performance. Deploy cost allocation tags and chargeback models, implement automated rightsizing and scheduling, configure reserved instances and savings plans, build cost anomaly detection, and create executive dashboards. Establish a Cloud Center of Excellence with clear accountability.

Learning Objectives:

Cost Visibility: Implement comprehensive tagging
Optimization: Automate cost reduction strategies
Governance: Establish FinOps practices
Reporting: Build executive dashboards

📋 Step-by-Step FinOps Instructions:

Implement Cost Allocation & Tagging

Explanation: Establish tagging standards to track cloud costs by department, project, and environment.

Steps:

1. Create tagging policy requiring: Environment, Project, Owner, CostCenter tags

2. Deploy AWS Organizations SCPs to enforce tagging on resource creation

3. Configure cost allocation tags in AWS Billing console

4. Run tag compliance scan and remediate non-compliant resources

💡 Tip: Without proper tagging, you can't accurately allocate costs to teams - this is FinOps foundation

In the FinOps Terminal below, type: enforce-tagging-policy
Deploy Cost Management Tools

Explanation: Set up native and third-party tools to visualize and analyze cloud spending.

Steps:

1. Enable AWS Cost Explorer with granular data (hourly breakdowns)

2. Configure Azure Cost Management + Billing with custom views

3. Integrate CloudHealth or CloudCheckr for multi-cloud visibility

4. Set up Cost & Usage Reports (CUR) to S3 with Athena queries

💡 Tip: Cost Explorer has a $0.01/query cost - use saved reports instead of ad-hoc queries

In the FinOps Terminal, type: deploy-cost-tools
Automate Rightsizing

Explanation: Automatically identify and resize underutilized resources to reduce waste.

Steps:

1. Enable AWS Compute Optimizer (analyzes CloudWatch metrics)

2. Create Lambda function to schedule start/stop of dev/test instances

3. Implement auto-scaling policies with CPU/Memory targets (70% utilization)

4. Set up weekly rightsizing review with stakeholders

💡 Tip: Rightsizing typically saves 20-30% of compute costs with minimal effort

In the FinOps Terminal, type: enable-rightsizing
Configure Reserved Instances & Savings Plans

Explanation: Commit to 1-year or 3-year terms to save up to 72% on steady-state workloads.

Steps:

1. Analyze RI recommendations in Cost Explorer (look for 60%+ utilization)

2. Purchase Compute Savings Plans for flexibility across instance families

3. Buy specific RIs for predictable workloads (databases, always-on services)

4. Implement RI/SP tracking dashboard to monitor coverage and utilization

💡 Tip: Start with 1-year terms; move to 3-year only for very stable workloads

In the FinOps Terminal, type: purchase-savings-plans
Build Cost Anomaly Detection

Explanation: Catch unexpected cost spikes before they become budget disasters.

Steps:

1. Configure AWS Cost Anomaly Detection with ML-based alerting

2. Set alert thresholds ($500 for services, $5000 for total spend)

3. Create SNS topic to notify FinOps team via email/Slack

4. Build automated response: Lambda to snapshot resources on anomaly

💡 Tip: Most runaway costs are from forgotten resources or misconfigurations

In the FinOps Terminal, type: setup-anomaly-detection
Create Executive Dashboards & Reporting

Explanation: Build dashboards that communicate cost trends to leadership and enable chargeback.

Steps:

1. Create QuickSight dashboards showing: month-over-month costs, forecast, savings opportunities

2. Implement chargeback reports per cost center/project

3. Configure automated monthly reports emailed to stakeholders

4. Build "showback" views for teams (visibility without billing)

💡 Tip: Use simple, executive-friendly visuals - show trends and actions, not raw data

In the FinOps Terminal, type: create-dashboards
Configure FinOps GUI Settings (Required)

Explanation: Configure the Cost Optimization Settings panel to apply FinOps policies.

GUI Configuration (Required):

Cost Center: Select "Engineering"

Cloud Provider: Select "AWS"

Environment: Select "Production"

Budget Alert Threshold: Enter 80 (80%)

Rightsizing Aggressiveness: Select "Moderate"

RI/SP Purchase Strategy: Select "Auto-purchase (1-year)"

Advanced Settings (Expand & Configure):

✓ Check "Auto-shutdown idle resources"

✓ Check "Schedule dev/test instances"

Click "Save Configuration" to apply your FinOps settings.

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

Click "Optimize Costs" to run an automated cost optimization analysis
Click "Executive Report" to generate a comprehensive FinOps report
Observe the Cost Metrics Cards - values update dynamically as you complete tasks
Watch the Cost by Service breakdown showing spending allocation
Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time cost savings as you implement FinOps practices!

FinOps Command Center

Cloud Cost Management

Cost Optimization Settings

Cost Center *

Cloud Provider *

Environment *

Budget Alert Threshold * % of monthly budget

Rightsizing Aggressiveness *

RI/SP Purchase Strategy *

Advanced FinOps Settings (Required) *

Auto-shutdown idle resources * Schedule dev/test instances * Automatic storage tiering Spot instance recommendations

Anomaly Threshold ($)

Notification Channel

$5.2M

Monthly Spend

↑ 12% vs last month

$1.8M

Potential Savings

35% optimization possible

68%

Resource Utilization

32% idle resources

$850K

RI Coverage Savings

42% coverage

Cost by Service

EC2 Instances $2.1M (40%)

RDS Databases $1.3M (25%)

S3 Storage $780K (15%)

Other Services $1.02M (20%)

FinOps Terminal

finops@cloud:~$

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

FinOps practices successfully implemented!

Advanced Cloud Architecture Labs

Advanced Cloud Labs - Module 8

Learning Objectives:

📋 Step-by-Step Instructions:

After Completing All Steps:

Chaos Engineering Control Center

Chaos Experiments Dashboard

Experiment Configuration

Network Chaos

Resource Chaos

Pod Chaos

State Chaos

System Health Metrics

Chaos Terminal

Lab Completed!

Learning Objectives:

📋 Detailed CLI Instructions:

After Completing All Steps:

Serverless Architecture Dashboard

Lambda Functions Monitor

Function Configuration

api-authorizer

data-processor

report-generator

Event Flow Monitor

SAM CLI Terminal

Lab Completed!

Learning Objectives:

📋 Step-by-Step FinOps Instructions:

After Completing All Steps:

FinOps Command Center

Cloud Cost Management

Cost Optimization Settings

Cost by Service

FinOps Terminal

Lab Completed!