Advanced Cloud Architecture Labs

Master cutting-edge cloud technologies with hands-on labs covering chaos engineering, serverless architectures, FinOps, and cloud-native observability at scale.

Advanced Cloud Labs - Module 8

Explore advanced cloud concepts and emerging technologies with expert-level hands-on scenarios.

Lab 22: Chaos Engineering with Gremlin & Litmus
Multi-Cloud Chaos / Expert
Scenario: Building Resilient Systems Through Chaos
StreamingPlatform Inc. operates a global video streaming service handling 50 million concurrent users. Implement chaos engineering practices to ensure system resilience. Configure Gremlin for AWS infrastructure chaos experiments, deploy Litmus Chaos for Kubernetes workloads, implement automated game days with failure injection, establish SLO-based chaos experiments, and create observability dashboards to measure impact. The system must maintain 99.99% availability even during chaos experiments.

Learning Objectives:

  • Chaos Principles: Implement controlled failure injection
  • Experiment Design: Create hypothesis-driven tests
  • Automation: Build automated chaos pipelines
  • Observability: Measure system behavior under stress

📋 Step-by-Step Instructions:

  1. Install Chaos Engineering Tools & Configure Experiment

    Explanation: Install Gremlin and Litmus Chaos frameworks, then configure your first chaos experiment using the GUI panel.

    Part A - Terminal Installation:

    1. Install Litmus Chaos operator in your Kubernetes cluster

    2. Verify installation by checking pod status

    💡 Tip: Make sure you have cluster-admin permissions before installing

    Command to run: kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator.yaml

    Part B - GUI Configuration:

    3. In the Experiment Configuration Panel below, configure the following:

    Experiment Name: network-latency-test

    Target Type: Select "Kubernetes Pods"

    Chaos Type: Select "Network Latency"

    Blast Radius: Enter 10 (affects 10% of pods)

    Duration: Enter 5 minutes

    Schedule: Select "Run Once"

    Required Checkboxes (all must be checked):

    ✓ Check "Auto-rollback on SLO violation"

    ✓ Check "Send Slack notifications"

    ✓ Check "Generate detailed report"

    Click the copiable values above to copy them, then paste into the GUI fields

  2. Define SLOs & Run Experiment

    Explanation: Define Service Level Objectives to establish baseline metrics, then execute your configured experiment.

    Part A - Terminal (Define SLOs):

    1. Set baseline availability, latency, and error rate thresholds

    💡 Tip: Document your steady state criteria - this becomes your hypothesis for chaos experiments

    Command to run: chaos define-slo --availability 99.9 --latency-p99 100ms --error-rate 0.1

    Part B - GUI (Run Experiment):

    2. Click the "Create Experiment" button in the Experiment Configuration Panel above

    3. Verify that your configuration matches instructions from Step 1

    💡 Tip: The system will validate all fields before starting the experiment

  3. Monitor Experiment Execution

    Explanation: Monitor the chaos experiment in real-time and observe system behavior.

    Part A - Terminal (Monitor):

    1. Watch the experiment progress and track injected failures

    2. Observe how your services respond to network latency

    💡 Tip: Keep monitoring dashboards open (Grafana/Datadog) to visualize impact

    Command to run: chaos experiment monitor --watch --experiment-id network-latency-test

    Part B - GUI (View Live Metrics):

    3. Observe the System Health Metrics panel updating in real-time

    4. Watch for SLO violations (availability drop, latency increase)

    💡 Tip: Healthy systems should maintain SLOs even with 10% of pods experiencing latency

  4. Generate & Analyze Report

    Explanation: After the experiment completes, generate a comprehensive report and identify improvements.

    Part A - Terminal (Generate Report):

    1. Export detailed experiment data in JSON format

    2. Include all metrics, failures, and recovery times

    💡 Tip: Save reports for trend analysis across multiple game days

    Command to run: chaos report generate --experiment-id latest --format json --output chaos-report.json

    Part B - GUI (Review Results):

    3. Review the final metrics in the System Health Metrics panel

    4. Identify services that violated SLOs or exhibited unexpected behavior

    5. Document findings for system improvements

    💡 Success Criteria: You should identify at least one area for resilience improvement

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

  1. Click "Run Game Day" to execute an automated chaos simulation
  2. Click "Chaos Report" to view your experiment results and resilience score
  3. Observe the System Health Metrics panel - values update dynamically during experiments
  4. Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time changes as you configure chaos experiments!

Chaos Engineering Control Center

Chaos Experiments Dashboard

System Status: Stable

Experiment Configuration

* All checkboxes are required for lab completion

Network Chaos

Ready
Inject latency, packet loss

Resource Chaos

Ready
CPU/Memory stress tests

Pod Chaos

Ready
Random pod termination

State Chaos

Ready
Database failure simulation

System Health Metrics

99.99%
Availability
23ms
P99 Latency
0.01%
Error Rate
45s
MTTR

Chaos Terminal

chaos@engineering:~$
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Chaos engineering implemented successfully!

Lab 23: Enterprise Serverless Architecture
AWS Lambda / Expert
Scenario: Event-Driven Microservices Platform
DataAnalytics Corp needs to process 10 billion events daily with sub-second latency. Build a serverless architecture using AWS Lambda, API Gateway, DynamoDB, and EventBridge. Implement function composition patterns, asynchronous processing with SQS and SNS, distributed tracing with X-Ray, cost optimization with reserved concurrency, and multi-region active-active deployment. The system must scale automatically and maintain costs under $0.0001 per transaction.

Learning Objectives:

  • Serverless Patterns: Implement enterprise patterns
  • Event Architecture: Build event-driven systems
  • Performance: Optimize cold starts and latency
  • Cost Control: Implement FinOps practices

📋 Detailed CLI Instructions:

  1. Design Architecture & Configure Lambda

    Explanation: Initialize SAM project and configure your Lambda function using the GUI.

    Part A - Terminal (Initialize Project):

    1. Create new SAM project with Python 3.9 runtime

    💡 Tip: SAM simplifies Lambda deployment by managing CloudFormation templates

    Command to run:

    sam init --runtime python3.9 --name data-processor

    Part B - GUI (Configure Function):

    2. In the Function Configuration Panel below, configure:

    Function Name: data-processor

    Runtime: Select "Python 3.9"

    Memory: Select "512 MB"

    Timeout: Enter 30 seconds

    Execution Role: Select "lambda-execution-role"

    Architecture: Select "x86_64"

    Click copiable values to copy, then paste into GUI fields

  2. Implement Function Patterns

    Explanation: Build your Lambda functions with shared dependencies and orchestration logic.

    Steps:

    1. Create Lambda layer for shared utilities (boto3, requests libraries)

    2. Implement Step Functions state machine for workflow orchestration

    3. Configure Dead Letter Queue (DLQ) for failed invocations

    4. Set up error handling with exponential backoff

    💡 Tip: Lambda layers reduce deployment package size and enable code reuse across functions

    Command to run:

    sam build

    This builds all Lambda functions and prepares them for deployment

  3. Configure Event Sources

    Explanation: Connect your Lambda functions to various AWS event sources for event-driven processing.

    Steps:

    1. Set up EventBridge rule to trigger Lambda on custom events

    2. Configure Kinesis stream as trigger for real-time data processing (batch size: 100)

    3. Implement SQS queue for asynchronous message processing

    4. Add S3 event notification trigger for file processing

    💡 Tip: Use SQS for workloads that can tolerate latency; use Kinesis for real-time streaming

    Command to run:

    sam deploy --guided

    Follow the prompts to configure your deployment settings

  4. Optimize Performance

    Explanation: Reduce cold starts and improve response times through performance optimizations.

    Steps:

    1. Configure provisioned concurrency (100 concurrent executions) for critical functions

    2. Implement connection pooling for database connections (reuse across invocations)

    3. Optimize package size by removing unused dependencies

    4. Use Lambda Powertools for structured logging and tracing

    💡 Tip: Provisioned concurrency keeps functions warm but increases cost - use for latency-critical APIs only

    Command to run:

    aws lambda put-provisioned-concurrency-config --function-name data-processor --provisioned-concurrent-executions 100

    This eliminates cold starts for your main function

  5. Implement Observability

    Explanation: Enable comprehensive monitoring and tracing to understand function behavior.

    Steps:

    1. Enable AWS X-Ray active tracing for all Lambda functions

    2. Configure CloudWatch Insights queries for log analysis

    3. Set up custom CloudWatch metrics (invocation count, duration, errors)

    4. Create CloudWatch dashboards with key performance indicators

    5. Configure SNS alerts for error rates > 1%

    💡 Tip: X-Ray shows end-to-end request flow across your distributed serverless architecture

    Command to run:

    sam logs --tail --function data-processor

    Monitor function logs in real-time

  6. Deploy Multi-Region

    Explanation: Create active-active multi-region deployment for high availability and low latency.

    Steps:

    1. Deploy your SAM application to us-east-1 and eu-west-1 regions

    2. Configure Route 53 with geoproximity routing policy

    3. Set up DynamoDB Global Tables for cross-region data replication

    4. Implement S3 Cross-Region Replication for static assets

    5. Test failover by simulating regional outage

    💡 Tip: Multi-region adds cost but provides disaster recovery and better user experience globally

    Commands to run:

    sam deploy --region us-east-1
    sam deploy --region eu-west-1

    Deploy to multiple regions for global availability

  7. Configure Lambda GUI Settings (Required)

    Explanation: Configure the Function Configuration panel with required Lambda settings.

    GUI Configuration (Required):

    Function Name: data-processor

    Runtime: Select "Python 3.9"

    Memory: Select "512 MB"

    Timeout: Enter 30 seconds

    Execution Role: Select "lambda-execution-role"

    Architecture: Select "x86_64"

    Advanced Settings (Expand & Configure):

    ✓ Check "Enable X-Ray Tracing"

    Reserved Concurrency: Enter 100

    Provisioned Concurrency: Enter 10

    Dead Letter Queue: Select "SQS Queue"

    Click "Deploy Function" to deploy your Lambda configuration.

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

  1. Click "Load Test" to run a performance benchmark on your serverless architecture
  2. Click "Cost Analysis" to view detailed cost breakdown and optimization suggestions
  3. Observe the Lambda Functions Monitor - function cards update as you deploy
  4. Watch the Event Flow Monitor showing real-time request flow through your architecture
  5. Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time Lambda metrics as you configure and deploy functions!

Serverless Architecture Dashboard

Lambda Functions Monitor

Function Configuration

Advanced Settings (Required) *

* Required fields for lab completion

api-authorizer

Invocations: 125,432/min
Duration: 12ms avg
Errors: 0.01%
Cost: $0.0002/invoke

data-processor

Invocations: 89,123/min
Duration: 145ms avg
Errors: 0.02%
Cost: $0.0008/invoke

report-generator

Invocations: 0/min
Duration: 2.3s avg
Errors: 0%
Cost: $0.0015/invoke

Event Flow Monitor

API Gateway
1.2M req/min
Lambda
Active
DynamoDB
15K WCU

SAM CLI Terminal

serverless@aws:~$
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Enterprise serverless architecture deployed!

Lab 24: Cloud FinOps & Cost Optimization
Multi-Cloud FinOps / Advanced
Scenario: Enterprise Cloud Cost Management
TechGiant Corp spends $5M monthly on cloud services across AWS, Azure, and GCP. Implement a comprehensive FinOps practice to reduce costs by 30% while maintaining performance. Deploy cost allocation tags and chargeback models, implement automated rightsizing and scheduling, configure reserved instances and savings plans, build cost anomaly detection, and create executive dashboards. Establish a Cloud Center of Excellence with clear accountability.

Learning Objectives:

  • Cost Visibility: Implement comprehensive tagging
  • Optimization: Automate cost reduction strategies
  • Governance: Establish FinOps practices
  • Reporting: Build executive dashboards

📋 Step-by-Step FinOps Instructions:

  1. Implement Cost Allocation & Tagging

    Explanation: Establish tagging standards to track cloud costs by department, project, and environment.

    Steps:

    1. Create tagging policy requiring: Environment, Project, Owner, CostCenter tags

    2. Deploy AWS Organizations SCPs to enforce tagging on resource creation

    3. Configure cost allocation tags in AWS Billing console

    4. Run tag compliance scan and remediate non-compliant resources

    💡 Tip: Without proper tagging, you can't accurately allocate costs to teams - this is FinOps foundation

    In the FinOps Terminal below, type: enforce-tagging-policy

  2. Deploy Cost Management Tools

    Explanation: Set up native and third-party tools to visualize and analyze cloud spending.

    Steps:

    1. Enable AWS Cost Explorer with granular data (hourly breakdowns)

    2. Configure Azure Cost Management + Billing with custom views

    3. Integrate CloudHealth or CloudCheckr for multi-cloud visibility

    4. Set up Cost & Usage Reports (CUR) to S3 with Athena queries

    💡 Tip: Cost Explorer has a $0.01/query cost - use saved reports instead of ad-hoc queries

    In the FinOps Terminal, type: deploy-cost-tools

  3. Automate Rightsizing

    Explanation: Automatically identify and resize underutilized resources to reduce waste.

    Steps:

    1. Enable AWS Compute Optimizer (analyzes CloudWatch metrics)

    2. Create Lambda function to schedule start/stop of dev/test instances

    3. Implement auto-scaling policies with CPU/Memory targets (70% utilization)

    4. Set up weekly rightsizing review with stakeholders

    💡 Tip: Rightsizing typically saves 20-30% of compute costs with minimal effort

    In the FinOps Terminal, type: enable-rightsizing

  4. Configure Reserved Instances & Savings Plans

    Explanation: Commit to 1-year or 3-year terms to save up to 72% on steady-state workloads.

    Steps:

    1. Analyze RI recommendations in Cost Explorer (look for 60%+ utilization)

    2. Purchase Compute Savings Plans for flexibility across instance families

    3. Buy specific RIs for predictable workloads (databases, always-on services)

    4. Implement RI/SP tracking dashboard to monitor coverage and utilization

    💡 Tip: Start with 1-year terms; move to 3-year only for very stable workloads

    In the FinOps Terminal, type: purchase-savings-plans

  5. Build Cost Anomaly Detection

    Explanation: Catch unexpected cost spikes before they become budget disasters.

    Steps:

    1. Configure AWS Cost Anomaly Detection with ML-based alerting

    2. Set alert thresholds ($500 for services, $5000 for total spend)

    3. Create SNS topic to notify FinOps team via email/Slack

    4. Build automated response: Lambda to snapshot resources on anomaly

    💡 Tip: Most runaway costs are from forgotten resources or misconfigurations

    In the FinOps Terminal, type: setup-anomaly-detection

  6. Create Executive Dashboards & Reporting

    Explanation: Build dashboards that communicate cost trends to leadership and enable chargeback.

    Steps:

    1. Create QuickSight dashboards showing: month-over-month costs, forecast, savings opportunities

    2. Implement chargeback reports per cost center/project

    3. Configure automated monthly reports emailed to stakeholders

    4. Build "showback" views for teams (visibility without billing)

    💡 Tip: Use simple, executive-friendly visuals - show trends and actions, not raw data

    In the FinOps Terminal, type: create-dashboards

  7. Configure FinOps GUI Settings (Required)

    Explanation: Configure the Cost Optimization Settings panel to apply FinOps policies.

    GUI Configuration (Required):

    Cost Center: Select "Engineering"

    Cloud Provider: Select "AWS"

    Environment: Select "Production"

    Budget Alert Threshold: Enter 80 (80%)

    Rightsizing Aggressiveness: Select "Moderate"

    RI/SP Purchase Strategy: Select "Auto-purchase (1-year)"

    Advanced Settings (Expand & Configure):

    ✓ Check "Auto-shutdown idle resources"

    ✓ Check "Schedule dev/test instances"

    Click "Save Configuration" to apply your FinOps settings.

After Completing All Steps:

Once you've completed all terminal commands AND configured the GUI:

  1. Click "Optimize Costs" to run an automated cost optimization analysis
  2. Click "Executive Report" to generate a comprehensive FinOps report
  3. Observe the Cost Metrics Cards - values update dynamically as you complete tasks
  4. Watch the Cost by Service breakdown showing spending allocation
  5. Review your final Progress Score and verify all tasks are completed

💡 Success: The dashboards show real-time cost savings as you implement FinOps practices!

FinOps Command Center

Cloud Cost Management

Cost Optimization Settings

% of monthly budget
Advanced FinOps Settings (Required) *
$5.2M
Monthly Spend
↑ 12% vs last month
$1.8M
Potential Savings
35% optimization possible
68%
Resource Utilization
32% idle resources
$850K
RI Coverage Savings
42% coverage

Cost by Service

EC2 Instances $2.1M (40%)
RDS Databases $1.3M (25%)
S3 Storage $780K (15%)
Other Services $1.02M (20%)

FinOps Terminal

finops@cloud:~$
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

FinOps practices successfully implemented!