Cloud Monitoring & Automation Labs

Master enterprise monitoring, infrastructure as code, and multi-cloud observability with authentic console interfaces.

Cloud Monitoring & Automation Labs - Module 5

Hands-on experience with production-grade monitoring and automation tools

Lab 12: Enterprise Monitoring with Prometheus & Grafana
Monitoring / Advanced
Scenario: Global E-Commerce Platform Monitoring
GlobalShop processes 50K orders/day across 12 regions. Configure Prometheus metrics collection with 15-second scrape intervals, build Grafana dashboards with environment variables, and set up SLO-based alerting for their microservices architecture.

Learning Objectives:

  • Prometheus Configuration: Set retention policies and scrape intervals
  • Service Discovery: Add multiple monitoring targets
  • Dashboard Design: Create multi-panel Grafana dashboards
  • Alert Management: Configure threshold-based alerts

📋 Step-by-Step Instructions

  1. Configure Prometheus Server Settings
    What you're doing: Configuring the core Prometheus server parameters that control data retention and scraping behavior.

    Instructions:
    1. Set Retention to 15 days (how long metrics are stored)
    2. Set Scrape Interval to 15 seconds (how often metrics are collected)
    3. Choose Storage Engine from dropdown
    4. Enable Compression if available
    5. Click Save Configuration
    💡 Tip: Longer retention (15+ days) increases storage requirements but enables better historical trend analysis. 15s scrape interval balances resource usage with metric granularity.
    📘 Real-world context: Production environments typically use 15-30 day retention for observability while keeping 365-day aggregated data for compliance.
  2. Register Service Discovery Targets
    What you're doing: Adding microservices to Prometheus monitoring by configuring scrape targets.

    Instructions:
    1. In the Target field, enter the first endpoint: api:9100
    2. Click Add Target
    3. Add the second endpoint: cart:9100
    4. Click Add Target again
    5. Click Test Connectivity to verify both targets are reachable
    ⚠️ Important: Both targets (api:9100 and cart:9100) must be added before connectivity test will pass. Port 9100 is the default node_exporter port.
    📘 Real-world context: In production, you'd use service discovery (Kubernetes, Consul, EC2) instead of static targets. Node exporters expose system-level metrics.
  3. Build Multi-Environment Dashboard
    What you're doing: Creating a Grafana dashboard with environment variables for multi-tenant monitoring.

    Instructions:
    1. Set Panel Count to 2 or more (number of metric panels)
    2. In Variable: env, hold Ctrl (Windows) or Cmd (Mac) and select both dev and prod
    3. Configure Refresh Interval and Time Range
    4. Click Create Dashboard
    💡 Tip: Variables enable a single dashboard to display metrics filtered by environment, region, or service. This reduces dashboard sprawl.
    📘 Best practice: Use template variables for any dimension you filter by frequently (environment, region, cluster, namespace).
  4. Configure SLO-Based Alert Rule
    What you're doing: Setting up threshold-based alerting for CPU utilization with specified evaluation window.

    Instructions:
    1. Select Metric: cpu_utilization from dropdown
    2. Enter Threshold: 80 (percentage)
    3. Set For Duration: 5 minutes (alert fires after 5min above threshold)
    4. Configure Severity level
    5. Click Add Alert Rule
    💡 Tip: The "For" duration prevents flapping alerts. CPU must stay above 80% for full 5 minutes before alerting.
    📘 Real-world context: Production alert thresholds depend on baseline performance. Consider using anomaly detection for dynamic thresholds.
  5. Configure Notification Channel
    What you're doing: Integrating Slack for alert notifications via webhook.

    Instructions:
    1. Enter a Slack webhook URL in format: https://hooks.slack.com/services/T00/B00/XX
    2. Set Notification Priority
    3. Click Test Notification to verify integration
    💡 Tip: Test your notification channel before saving to ensure webhooks are configured correctly. Different severity levels can route to different channels.
    📘 Best practice: Use different Slack channels for critical vs warning alerts to reduce alert fatigue.
  6. Execute PromQL Query
    What you're doing: Testing PromQL query language to extract and analyze metrics data.

    Instructions:
    1. Enter a PromQL query (examples below)
    2. Click Execute Query

    Example queries:
    rate(http_requests_total[5m]) - Request rate
    up - Service availability
    avg(cpu_usage) by (instance) - Average CPU by instance
    💡 Tip: PromQL's rate() function calculates per-second average rate. Use it for counter metrics. The [5m] is the time window.
Grafana Dashboard
GlobalShop Monitoring Platform

Prometheus Server Configuration

Service Discovery - Scrape Targets

No targets configured

Dashboard Configuration

Alert Rules

PromQL Query Builder

Ready to execute queries...
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Assessment Complete!

Excellent monitoring implementation!

Lab 13: Multi-Cloud Infrastructure as Code with Terraform
Terraform / Expert
Scenario: Hybrid Cloud Infrastructure Management
TechCorp manages 300+ resources across AWS, Azure, and VMware. Configure Terraform Cloud with remote state backend (S3+DynamoDB), enable multi-provider support, create production workspace, and enforce policy checks before infrastructure changes.

Learning Objectives:

  • State Management: Configure S3 backend with DynamoDB locking
  • Multi-Provider: Enable AWS, Azure, and VMware providers
  • Workspaces: Create environment-specific workspaces
  • Policy Enforcement: Enable compliance checks before apply

📋 Step-by-Step Instructions

  1. Configure Remote State Backend
    What you're doing: Setting up centralized state storage with locking to enable team collaboration and prevent state conflicts.

    Instructions:
    1. Enter S3 Bucket Name: tf-state-prod (stores Terraform state files)
    2. Enter DynamoDB Table: tf-locks (provides state locking)
    3. Click Save Backend to configure remote state
    💡 Tip: DynamoDB locking prevents concurrent Terraform runs from corrupting state. Without locking, two people running "terraform apply" simultaneously could cause state corruption.
    📘 Real-world context: Remote state backends are critical for team environments. S3 provides durability (99.999999999%) and versioning for state recovery. Always enable versioning on state buckets.
  2. Enable Multi-Cloud Providers
    What you're doing: Configuring Terraform to manage resources across AWS, Azure, and VMware simultaneously.

    Instructions:
    1. Check the AWS checkbox
    2. Check the Azure checkbox
    3. Check the VMware checkbox
    4. Select AWS Region: us-east-1 (N. Virginia) from dropdown
    5. Select Azure Region: East US from dropdown
    6. Click Save Providers
    💡 Tip: Multi-cloud strategies provide redundancy and avoid vendor lock-in. Each provider requires authentication (AWS keys, Azure service principals, VMware credentials).
    📘 Best practice: Use provider aliases when managing multiple regions or accounts of the same provider. Example: provider "aws" { alias = "west" }
  3. Create Production Workspace
    What you're doing: Creating an isolated environment for production infrastructure, separate from dev/staging.

    Instructions:
    1. Enter Workspace Name: production
    2. Select Execution Mode (Remote recommended for teams)
    3. Click Create Workspace
    4. Verify workspace appears as "current workspace" in header
    💡 Tip: Workspaces allow using the same Terraform code with different variable files. Each workspace maintains separate state, enabling environment isolation (dev, staging, prod).
    📘 Real-world context: Production workspaces should have stricter access controls, approval workflows, and potentially different providers/regions than lower environments.
  4. Define Infrastructure Modules
    What you're doing: Configuring reusable Terraform modules for VPC networking and compute resources with standardized tagging.

    Instructions:
    1. Enter VPC Module name (e.g., "vpc-module" or "network-module")
    2. Enter Compute Module name (e.g., "compute-module" or "ec2-module")
    3. Enter Resource Tags exactly as: env=prod,owner=platform
    4. Click Save Modules
    💡 Tip: Modules promote DRY (Don't Repeat Yourself) principle. Write infrastructure code once, reuse across projects. Tag format must be key=value pairs for proper resource tagging.
    📘 Best practice: Standardized tags enable cost allocation, automation, and resource governance. Common tags: environment, owner, project, cost-center, compliance-scope.
  5. Generate and Review Terraform Plan
    What you're doing: Running terraform plan to preview infrastructure changes before applying them.

    Instructions:
    1. Click Run Plan button
    2. Wait for plan generation (simulates what will be created/modified/destroyed)
    3. Review the output showing resources to add (should show ~15 resources)
    4. Verify VPC and Azure resources appear in plan
    💡 Tip: ALWAYS review plans before applying! The plan shows: + (create), ~ (modify), - (destroy). Red flags: unexpected deletions or modifications to critical resources.
    📘 Real-world context: Save plans for audit trails: terraform plan -out=plan.tfplan. This ensures the applied changes match what was reviewed and approved.
  6. Apply with Policy Enforcement (Sentinel)
    What you're doing: Applying infrastructure changes with policy-as-code validation to enforce governance rules.

    Instructions:
    1. Check the Enable Policy Validation checkbox
    2. Click Apply Changes
    3. Wait for policy checks to pass (validates compliance, security, cost limits)
    4. Verify "Policy check passed ✓" message and "Apply complete" confirmation
    ⚠️ Important: Policy validation MUST be enabled. Sentinel policies prevent non-compliant infrastructure (e.g., unencrypted storage, over-sized instances, missing tags).
    📘 Best practice: Implement hard/soft enforcement levels. Soft-mandatory allows overrides with justification, hard-mandatory blocks non-compliant changes entirely.
Terraform Cloud Workspace: default

Remote State Backend Configuration

Cloud Providers

Workspace Management

Infrastructure Modules

Plan & Apply

No plan generated yet. Click "Run Plan" to preview infrastructure changes.
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Assessment Complete!

Excellent Terraform implementation!

Lab 14: Unified Multi-Cloud Observability Platform
AWS/Azure / Intermediate
Scenario: Cross-Cloud Monitoring Hub
DataAnalytics Inc. processes 2PB data/month across AWS and Azure. Configure CloudWatch cross-account monitoring for AWS account 111111111111, set up Azure Log Analytics workspace, build unified dashboards querying both clouds, and enable ML-powered anomaly detection.

Learning Objectives:

  • CloudWatch: Configure cross-account AWS monitoring
  • Azure Monitor: Set up Log Analytics workspace
  • Unified Dashboard: Build multi-cloud observability views
  • ML Insights: Enable intelligent anomaly detection

📋 Step-by-Step Instructions

  1. Configure AWS CloudWatch Cross-Account Monitoring
    What you're doing: Setting up centralized CloudWatch monitoring for a specific AWS account to aggregate metrics across multiple accounts.

    Instructions:
    1. Enter AWS Account ID: 111111111111 (12-digit account identifier)
    2. Select AWS Region: us-east-1 from dropdown
    3. The configuration auto-validates when both fields are entered correctly
    💡 Tip: Cross-account monitoring requires IAM roles with CloudWatch read permissions. The monitoring account assumes a role in the target account to pull metrics.
    📘 Real-world context: Organizations with 50+ AWS accounts use cross-account monitoring to view all CloudWatch metrics in a single pane. This enables centralized alerting and reduces console switching.
  2. Configure Azure Monitor Log Analytics Workspace
    What you're doing: Connecting to Azure Log Analytics for querying logs and metrics across Azure resources.

    Instructions:
    1. Enter Workspace Name: analytics-workspace
    2. Enter Resource Group: RG-Monitor (groups related Azure resources)
    3. Click Connect Sources to establish connection
    4. Verify connection success message
    💡 Tip: Log Analytics workspaces support KQL (Kusto Query Language) for powerful log analysis. They can ingest 10GB-50TB/day depending on tier.
    📘 Best practice: Use separate workspaces for production vs non-production to prevent accidental data mixing and enable different retention policies per environment.
  3. Build Unified Multi-Cloud Dashboard
    What you're doing: Creating a single dashboard that displays metrics from both AWS CloudWatch and Azure Monitor simultaneously.

    Instructions:
    1. Set Panel Count to 2 or more (number of metric visualizations)
    2. In Variable: cloud, hold Ctrl (Windows) or Cmd (Mac) and select both aws and azure
    3. Click Create Dashboard
    4. Review the dashboard preview showing data sources
    💡 Tip: Multi-cloud dashboards enable side-by-side comparison of AWS vs Azure performance, costs, and availability. Use variables to filter by cloud provider dynamically.
    📘 Real-world context: Hybrid cloud architectures (AWS + Azure) require unified observability. Example: E-commerce site runs web tier on AWS, data tier on Azure - one dashboard shows full stack health.
  4. Execute Cross-Cloud Query
    What you're doing: Running a unified query that aggregates metrics from both AWS CloudWatch and Azure Log Analytics.

    Instructions:
    1. In the query field, enter any query (examples below)
    2. Select Query Language (KQL or SQL)
    3. Optionally enable Use Query Cache for faster repeated queries
    4. Click Run Query

    Example queries:
    SELECT * FROM metrics WHERE cloud IN ('aws','azure')
    AzureMetrics | where TimeGenerated > ago(1h)
    avg(cpu_percent) by cloud_provider
    💡 Tip: Cross-cloud queries normalize different metric formats (AWS uses CamelCase, Azure uses snake_case). Query cache reduces latency for frequently-run queries from seconds to milliseconds.
    📘 Best practice: Use query variables for time ranges and filters. This makes dashboards reusable across teams without hardcoding values.
  5. Enable ML-Powered Anomaly Detection
    What you're doing: Activating machine learning algorithms that automatically detect unusual patterns in metrics without manual threshold configuration.

    Instructions:
    1. Check the Enable Anomaly Detection checkbox
    2. Select Sensitivity dropdown and choose High
    3. Configuration auto-validates when both settings are enabled

    Sensitivity levels:
    High: Detects smaller deviations (more alerts, fewer missed issues)
    Normal: Balanced detection (recommended for most workloads)
    💡 Tip: ML anomaly detection learns normal patterns over 7-14 days. It's ideal for detecting unexpected spikes, drops, or pattern changes that static thresholds would miss.
    📘 Real-world context: Traditional alerts fail during Black Friday (traffic 10x normal but healthy). ML anomaly detection adapts to new baselines, alerting only on truly abnormal behavior.
  6. Configure Cost Optimization Controls
    What you're doing: Implementing log sampling and retention policies to reduce observability costs while maintaining visibility.

    Instructions:
    1. Check Enable Log Sampling checkbox (reduces ingestion volume)
    2. Enter Retention: 14 days (how long logs are stored)
    3. Both settings auto-validate when configured

    Cost impact:
    Sampling: 50% sampling = 50% cost reduction
    Retention: 14 days vs 90 days = 75% storage cost savings
    💡 Tip: Sample non-critical logs (INFO level) but keep 100% of ERROR/CRITICAL logs. Use tiered storage: hot data (7 days), warm (30 days), cold archive (90+ days).
    📘 Real-world context: Observability costs can reach $50K-500K/month at scale. CloudWatch charges $0.50/GB ingested, $0.03/GB stored. 100TB/month ingestion = $50K/month. Sampling + retention policies cut this 60-80%.
☁️ Multi-Cloud Observability Hub
AWS & Azure Integration

AWS CloudWatch Configuration

Azure Monitor Configuration

Unified Dashboard Builder

Dashboard preview: Not created yet

Cross-Cloud Query Builder

Ready to execute cross-cloud queries...

Advanced Configuration

Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Assessment Complete!

Unified observability achieved!