Cloud Monitoring & Automation Labs

Lab 12: Enterprise Monitoring with Prometheus & Grafana

Monitoring / Advanced

Scenario: Global E-Commerce Platform Monitoring

GlobalShop processes 50K orders/day across 12 regions. Configure Prometheus metrics collection with 15-second scrape intervals, build Grafana dashboards with environment variables, and set up SLO-based alerting for their microservices architecture.

Learning Objectives:

Prometheus Configuration: Set retention policies and scrape intervals
Service Discovery: Add multiple monitoring targets
Dashboard Design: Create multi-panel Grafana dashboards
Alert Management: Configure threshold-based alerts

📋 Step-by-Step Instructions

Configure Prometheus Server Settings
What you're doing: Configuring the core Prometheus server parameters that control data retention and scraping behavior.

Instructions:
1. Set Retention to 15 days (how long metrics are stored)
2. Set Scrape Interval to 15 seconds (how often metrics are collected)
3. Choose Storage Engine from dropdown
4. Enable Compression if available
5. Click Save Configuration

💡 Tip: Longer retention (15+ days) increases storage requirements but enables better historical trend analysis. 15s scrape interval balances resource usage with metric granularity.

📘 Real-world context: Production environments typically use 15-30 day retention for observability while keeping 365-day aggregated data for compliance.
Register Service Discovery Targets
What you're doing: Adding microservices to Prometheus monitoring by configuring scrape targets.

Instructions:
1. In the Target field, enter the first endpoint: api:9100
2. Click Add Target
3. Add the second endpoint: cart:9100
4. Click Add Target again
5. Click Test Connectivity to verify both targets are reachable

⚠️ Important: Both targets (api:9100 and cart:9100) must be added before connectivity test will pass. Port 9100 is the default node_exporter port.

📘 Real-world context: In production, you'd use service discovery (Kubernetes, Consul, EC2) instead of static targets. Node exporters expose system-level metrics.
Build Multi-Environment Dashboard
What you're doing: Creating a Grafana dashboard with environment variables for multi-tenant monitoring.

Instructions:
1. Set Panel Count to 2 or more (number of metric panels)
2. In Variable: env, hold Ctrl (Windows) or Cmd (Mac) and select both dev and prod
3. Configure Refresh Interval and Time Range
4. Click Create Dashboard

💡 Tip: Variables enable a single dashboard to display metrics filtered by environment, region, or service. This reduces dashboard sprawl.

📘 Best practice: Use template variables for any dimension you filter by frequently (environment, region, cluster, namespace).
Configure SLO-Based Alert Rule
What you're doing: Setting up threshold-based alerting for CPU utilization with specified evaluation window.

Instructions:
1. Select Metric: cpu_utilization from dropdown
2. Enter Threshold: 80 (percentage)
3. Set For Duration: 5 minutes (alert fires after 5min above threshold)
4. Configure Severity level
5. Click Add Alert Rule

💡 Tip: The "For" duration prevents flapping alerts. CPU must stay above 80% for full 5 minutes before alerting.

📘 Real-world context: Production alert thresholds depend on baseline performance. Consider using anomaly detection for dynamic thresholds.
Configure Notification Channel
What you're doing: Integrating Slack for alert notifications via webhook.

Instructions:
1. Enter a Slack webhook URL in format: https://hooks.slack.com/services/T00/B00/XX
2. Set Notification Priority
3. Click Test Notification to verify integration

💡 Tip: Test your notification channel before saving to ensure webhooks are configured correctly. Different severity levels can route to different channels.

📘 Best practice: Use different Slack channels for critical vs warning alerts to reduce alert fatigue.
Execute PromQL Query
What you're doing: Testing PromQL query language to extract and analyze metrics data.

Instructions:
1. Enter a PromQL query (examples below)
2. Click Execute Query

Example queries:
• rate(http_requests_total[5m]) - Request rate
• up - Service availability
• avg(cpu_usage) by (instance) - Average CPU by instance

💡 Tip: PromQL's rate() function calculates per-second average rate. Use it for counter metrics. The [5m] is the time window.

Grafana Dashboard

GlobalShop Monitoring Platform

Prometheus Server Configuration

Retention (days)

Scrape Interval (sec)

Storage Engine

Enable Compression Write-Ahead Log

Service Discovery - Scrape Targets

Target Endpoint (host:port)

Scrape Path

Protocol

Timeout (sec)

No targets configured

Dashboard Configuration

Panel Count

Variable: env (multi-select)

Refresh Interval

Time Range

Shared Dashboard Use Templates

Alert Rules

Metric

Condition

Threshold (%)

For (minutes)

Severity

Evaluation Interval

Send Notification

Notification Channel - Slack Webhook URL

PromQL Query Builder

PromQL Expression

Time Range

Instant Query

Ready to execute queries...

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Assessment Complete!

Excellent monitoring implementation!

Lab 13: Multi-Cloud Infrastructure as Code with Terraform

Terraform / Expert

Scenario: Hybrid Cloud Infrastructure Management

TechCorp manages 300+ resources across AWS, Azure, and VMware. Configure Terraform Cloud with remote state backend (S3+DynamoDB), enable multi-provider support, create production workspace, and enforce policy checks before infrastructure changes.

Learning Objectives:

State Management: Configure S3 backend with DynamoDB locking
Multi-Provider: Enable AWS, Azure, and VMware providers
Workspaces: Create environment-specific workspaces
Policy Enforcement: Enable compliance checks before apply

📋 Step-by-Step Instructions

Configure Remote State Backend
What you're doing: Setting up centralized state storage with locking to enable team collaboration and prevent state conflicts.

Instructions:
1. Enter S3 Bucket Name: tf-state-prod (stores Terraform state files)
2. Enter DynamoDB Table: tf-locks (provides state locking)
3. Click Save Backend to configure remote state

💡 Tip: DynamoDB locking prevents concurrent Terraform runs from corrupting state. Without locking, two people running "terraform apply" simultaneously could cause state corruption.

📘 Real-world context: Remote state backends are critical for team environments. S3 provides durability (99.999999999%) and versioning for state recovery. Always enable versioning on state buckets.
Enable Multi-Cloud Providers
What you're doing: Configuring Terraform to manage resources across AWS, Azure, and VMware simultaneously.

Instructions:
1. Check the AWS checkbox
2. Check the Azure checkbox
3. Check the VMware checkbox
4. Select AWS Region: us-east-1 (N. Virginia) from dropdown
5. Select Azure Region: East US from dropdown
6. Click Save Providers

💡 Tip: Multi-cloud strategies provide redundancy and avoid vendor lock-in. Each provider requires authentication (AWS keys, Azure service principals, VMware credentials).

📘 Best practice: Use provider aliases when managing multiple regions or accounts of the same provider. Example: provider "aws" { alias = "west" }
Create Production Workspace
What you're doing: Creating an isolated environment for production infrastructure, separate from dev/staging.

Instructions:
1. Enter Workspace Name: production
2. Select Execution Mode (Remote recommended for teams)
3. Click Create Workspace
4. Verify workspace appears as "current workspace" in header

💡 Tip: Workspaces allow using the same Terraform code with different variable files. Each workspace maintains separate state, enabling environment isolation (dev, staging, prod).

📘 Real-world context: Production workspaces should have stricter access controls, approval workflows, and potentially different providers/regions than lower environments.
Define Infrastructure Modules
What you're doing: Configuring reusable Terraform modules for VPC networking and compute resources with standardized tagging.

Instructions:
1. Enter VPC Module name (e.g., "vpc-module" or "network-module")
2. Enter Compute Module name (e.g., "compute-module" or "ec2-module")
3. Enter Resource Tags exactly as: env=prod,owner=platform
4. Click Save Modules

💡 Tip: Modules promote DRY (Don't Repeat Yourself) principle. Write infrastructure code once, reuse across projects. Tag format must be key=value pairs for proper resource tagging.

📘 Best practice: Standardized tags enable cost allocation, automation, and resource governance. Common tags: environment, owner, project, cost-center, compliance-scope.
Generate and Review Terraform Plan
What you're doing: Running terraform plan to preview infrastructure changes before applying them.

Instructions:
1. Click Run Plan button
2. Wait for plan generation (simulates what will be created/modified/destroyed)
3. Review the output showing resources to add (should show ~15 resources)
4. Verify VPC and Azure resources appear in plan

💡 Tip: ALWAYS review plans before applying! The plan shows: + (create), ~ (modify), - (destroy). Red flags: unexpected deletions or modifications to critical resources.

📘 Real-world context: Save plans for audit trails: terraform plan -out=plan.tfplan. This ensures the applied changes match what was reviewed and approved.
Apply with Policy Enforcement (Sentinel)
What you're doing: Applying infrastructure changes with policy-as-code validation to enforce governance rules.

Instructions:
1. Check the Enable Policy Validation checkbox
2. Click Apply Changes
3. Wait for policy checks to pass (validates compliance, security, cost limits)
4. Verify "Policy check passed ✓" message and "Apply complete" confirmation

⚠️ Important: Policy validation MUST be enabled. Sentinel policies prevent non-compliant infrastructure (e.g., unencrypted storage, over-sized instances, missing tags).

📘 Best practice: Implement hard/soft enforcement levels. Soft-mandatory allows overrides with justification, hard-mandatory blocks non-compliant changes entirely.

Terraform Cloud Workspace: default

Remote State Backend Configuration

S3 Bucket Name

DynamoDB Table

Cloud Providers

AWS Azure VMware

AWS Region

Azure Region

Workspace Management

Workspace Name

Execution Mode

Infrastructure Modules

VPC Module

Compute Module

Resource Tags (key=value,key=value)

Plan & Apply

Enable Policy Validation

No plan generated yet. Click "Run Plan" to preview infrastructure changes.

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Assessment Complete!

Excellent Terraform implementation!

Lab 14: Unified Multi-Cloud Observability Platform

AWS/Azure / Intermediate

Scenario: Cross-Cloud Monitoring Hub

DataAnalytics Inc. processes 2PB data/month across AWS and Azure. Configure CloudWatch cross-account monitoring for AWS account 111111111111, set up Azure Log Analytics workspace, build unified dashboards querying both clouds, and enable ML-powered anomaly detection.

Learning Objectives:

CloudWatch: Configure cross-account AWS monitoring
Azure Monitor: Set up Log Analytics workspace
Unified Dashboard: Build multi-cloud observability views
ML Insights: Enable intelligent anomaly detection

📋 Step-by-Step Instructions

Configure AWS CloudWatch Cross-Account Monitoring
What you're doing: Setting up centralized CloudWatch monitoring for a specific AWS account to aggregate metrics across multiple accounts.

Instructions:
1. Enter AWS Account ID: 111111111111 (12-digit account identifier)
2. Select AWS Region: us-east-1 from dropdown
3. The configuration auto-validates when both fields are entered correctly

💡 Tip: Cross-account monitoring requires IAM roles with CloudWatch read permissions. The monitoring account assumes a role in the target account to pull metrics.

📘 Real-world context: Organizations with 50+ AWS accounts use cross-account monitoring to view all CloudWatch metrics in a single pane. This enables centralized alerting and reduces console switching.
Configure Azure Monitor Log Analytics Workspace
What you're doing: Connecting to Azure Log Analytics for querying logs and metrics across Azure resources.

Instructions:
1. Enter Workspace Name: analytics-workspace
2. Enter Resource Group: RG-Monitor (groups related Azure resources)
3. Click Connect Sources to establish connection
4. Verify connection success message

💡 Tip: Log Analytics workspaces support KQL (Kusto Query Language) for powerful log analysis. They can ingest 10GB-50TB/day depending on tier.

📘 Best practice: Use separate workspaces for production vs non-production to prevent accidental data mixing and enable different retention policies per environment.
Build Unified Multi-Cloud Dashboard
What you're doing: Creating a single dashboard that displays metrics from both AWS CloudWatch and Azure Monitor simultaneously.

Instructions:
1. Set Panel Count to 2 or more (number of metric visualizations)
2. In Variable: cloud, hold Ctrl (Windows) or Cmd (Mac) and select both aws and azure
3. Click Create Dashboard
4. Review the dashboard preview showing data sources

💡 Tip: Multi-cloud dashboards enable side-by-side comparison of AWS vs Azure performance, costs, and availability. Use variables to filter by cloud provider dynamically.

📘 Real-world context: Hybrid cloud architectures (AWS + Azure) require unified observability. Example: E-commerce site runs web tier on AWS, data tier on Azure - one dashboard shows full stack health.
Execute Cross-Cloud Query
What you're doing: Running a unified query that aggregates metrics from both AWS CloudWatch and Azure Log Analytics.

Instructions:
1. In the query field, enter any query (examples below)
2. Select Query Language (KQL or SQL)
3. Optionally enable Use Query Cache for faster repeated queries
4. Click Run Query

Example queries:
• SELECT * FROM metrics WHERE cloud IN ('aws','azure')
• AzureMetrics | where TimeGenerated > ago(1h)
• avg(cpu_percent) by cloud_provider

💡 Tip: Cross-cloud queries normalize different metric formats (AWS uses CamelCase, Azure uses snake_case). Query cache reduces latency for frequently-run queries from seconds to milliseconds.

📘 Best practice: Use query variables for time ranges and filters. This makes dashboards reusable across teams without hardcoding values.
Enable ML-Powered Anomaly Detection
What you're doing: Activating machine learning algorithms that automatically detect unusual patterns in metrics without manual threshold configuration.

Instructions:
1. Check the Enable Anomaly Detection checkbox
2. Select Sensitivity dropdown and choose High
3. Configuration auto-validates when both settings are enabled

Sensitivity levels:
• High: Detects smaller deviations (more alerts, fewer missed issues)
• Normal: Balanced detection (recommended for most workloads)

💡 Tip: ML anomaly detection learns normal patterns over 7-14 days. It's ideal for detecting unexpected spikes, drops, or pattern changes that static thresholds would miss.

📘 Real-world context: Traditional alerts fail during Black Friday (traffic 10x normal but healthy). ML anomaly detection adapts to new baselines, alerting only on truly abnormal behavior.
Configure Cost Optimization Controls
What you're doing: Implementing log sampling and retention policies to reduce observability costs while maintaining visibility.

Instructions:
1. Check Enable Log Sampling checkbox (reduces ingestion volume)
2. Enter Retention: 14 days (how long logs are stored)
3. Both settings auto-validate when configured

Cost impact:
• Sampling: 50% sampling = 50% cost reduction
• Retention: 14 days vs 90 days = 75% storage cost savings

💡 Tip: Sample non-critical logs (INFO level) but keep 100% of ERROR/CRITICAL logs. Use tiered storage: hot data (7 days), warm (30 days), cold archive (90+ days).

📘 Real-world context: Observability costs can reach $50K-500K/month at scale. CloudWatch charges $0.50/GB ingested, $0.03/GB stored. 100TB/month ingestion = $50K/month. Sampling + retention policies cut this 60-80%.

☁️ Multi-Cloud Observability Hub

AWS & Azure Integration

AWS CloudWatch Configuration

AWS Account ID (12 digits)

AWS Region

Azure Monitor Configuration

Workspace Name

Resource Group

Unified Dashboard Builder

Panel Count

Variable: cloud (multi-select)

Dashboard preview: Not created yet

Cross-Cloud Query Builder

KQL/SQL Query

Query Language

Use Query Cache

                                            Ready to execute cross-cloud queries...
                                        

Advanced Configuration

Enable Anomaly Detection

Sensitivity

Retention (days)

Enable Log Sampling

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Assessment Complete!

Unified observability achieved!

Cloud Monitoring & Automation Labs