AI & Machine Learning Labs

Master AI infrastructure design, observability, and API management through specialized tool interfaces.

GenAI Expert Labs - Module 10

Specialized interfaces: Network Designer, Log Console, and API Gateway.

Lab 28: AI Infrastructure Designer
Network Tool / Expert
Scenario: ML Pipeline Architecture
CloudScale AI needs to design a production ML inference infrastructure. Use the Network Topology Designer to configure load balancers, model servers, GPU clusters, vector databases, and caching layers. Design for high availability and low latency.

Learning Objectives:

  • Architecture: Design scalable ML infrastructure
  • Load Balancing: Configure traffic distribution
  • GPU Clusters: Optimize compute resources
  • Caching: Reduce inference latency
ML Infrastructure Designer
Select Connect Delete
Components
Load Balancer
Model Server
GPU Cluster
Vector DB
Cache Layer
API Gateway
Monitoring
Clients
Load Balancer
Server 1
Server 2
GPU Cluster
Configuration
Load Balancer
Model Servers
GPU Cluster
Caching
Vector Database
Monitoring
Progress: 0/14 settings
Score: 0/100
0%

Infrastructure Deployed!

Architecture configuration complete.

Lab 29: AI Observability Console
Log Analysis / Expert
Scenario: Production Incident Analysis
ModelOps team detected anomalies in the production ML system. Use the Log Analysis Console to investigate errors, configure alert thresholds, set up log aggregation, and create dashboards. Identify the root cause and set up monitoring.

Learning Objectives:

  • Log Analysis: Parse and filter production logs
  • Alerting: Configure threshold-based alerts
  • Metrics: Define key performance indicators
  • Dashboards: Visualize system health
Log Analysis Console
All Error Warning Info
2024-01-15 14:20:01INFOModel server started on port 8080 - version: 2.4.1
2024-01-15 14:20:02INFOLoading model weights from s3://models/llm-v3.bin
2024-01-15 14:20:05INFOGPU cluster initialized: 4x NVIDIA A100 80GB
2024-01-15 14:20:08INFOVector database connected: Pinecone (us-east-1)
2024-01-15 14:20:10INFORedis cache initialized: 10GB allocated
2024-01-15 14:20:12DEBUGHealth check endpoint registered: /health
2024-01-15 14:20:15INFOPrometheus metrics endpoint: /metrics
2024-01-15 14:20:18INFOLoad balancer health check passed
2024-01-15 14:20:20INFOService registered with Kubernetes: ml-inference-prod
2024-01-15 14:20:25DEBUGBatch size configured: 128 requests
2024-01-15 14:21:01INFOIncoming request: POST /v1/inference - client: api-gateway
2024-01-15 14:21:02INFORequest processed: latency=145ms, tokens=512
2024-01-15 14:21:05INFOCache miss: generating embedding for new query
2024-01-15 14:21:08DEBUGVector similarity search: 10 results in 12ms
2024-01-15 14:21:10INFOGPU utilization: 72% - optimal range
2024-01-15 14:21:15WARNRequest latency above p95: 320ms (threshold: 200ms)
2024-01-15 14:21:18INFOBatch inference completed: 64 requests in 1.2s
2024-01-15 14:21:20INFOCache hit rate: 82.3% - excellent
2024-01-15 14:21:25DEBUGMemory pressure: 4.2GB / 8GB used
2024-01-15 14:21:30INFOStreaming response started: request_id=xyz789
2024-01-15 14:22:01WARNGPU memory usage at 85% - approaching threshold
2024-01-15 14:22:05INFOAuto-scaling evaluation: current=3, target=3
2024-01-15 14:22:08DEBUGConnection pool status: 45/100 active
2024-01-15 14:22:10INFORequest processed: latency=98ms, tokens=256
2024-01-15 14:22:15INFOModel checkpoint saved to S3
2024-01-15 14:22:20WARNRate limit approaching: 450/500 requests/min
2024-01-15 14:22:25INFOEmbeddings cached: 1024 new vectors
2024-01-15 14:22:30DEBUGToken generation rate: 45 tok/s
2024-01-15 14:22:35INFOHealth check passed: all services healthy
2024-01-15 14:22:40INFORequest queue depth: 25 (capacity: 500)
2024-01-15 14:23:01ERRORModel inference timeout after 30000ms - request_id: abc123
2024-01-15 14:23:02WARNGPU memory usage at 95% - node: gpu-cluster-01
2024-01-15 14:23:03INFOAuto-scaling triggered: adding 2 replicas
2024-01-15 14:23:05ERRORVector DB connection pool exhausted - max: 100
2024-01-15 14:23:06INFOCache hit rate: 78.5% - avg latency: 12ms
2024-01-15 14:23:08WARNRequest queue depth: 450 (threshold: 500)
2024-01-15 14:23:10ERROROOM killed process: model-server-3 (8GB limit)
2024-01-15 14:23:12INFOHealth check passed: all endpoints responding
2024-01-15 14:23:15DEBUGBatch inference: 128 requests processed in 2.3s
2024-01-15 14:23:18ERRORCUDA out of memory: tried to allocate 2GB
2024-01-15 14:23:20WARNFallback to CPU inference enabled
2024-01-15 14:23:22INFONew replica started: model-server-4
2024-01-15 14:23:25INFONew replica started: model-server-5
2024-01-15 14:23:28DEBUGLoad balancer updated: 5 backends
2024-01-15 14:23:30INFOTraffic redistributed across replicas
2024-01-15 14:23:35WARNLatency spike detected: p99=850ms
2024-01-15 14:23:38INFOCircuit breaker: monitoring failures
2024-01-15 14:23:40ERRORRequest failed: 503 Service Unavailable
2024-01-15 14:23:42ERRORUpstream timeout: model-server-2 not responding
2024-01-15 14:23:45WARNConnection pool low: 95/100 in use
Analysis Config
Error Threshold
Alert Severity
Log Retention
Aggregation Window
Latency SLO
Root Cause
Remediation
Dashboard Type
Notification Channel
Sampling Rate
Progress: 0/10 settings
Score: 0/100
0%

Analysis Complete!

Monitoring configuration saved.

Lab 30: AI API Gateway Manager
API Gateway / Expert
Scenario: Production API Configuration
AIaaS Platform is launching their inference API to enterprise customers. Use the API Gateway Manager to configure rate limiting, authentication, versioning, caching policies, and circuit breakers. Ensure the API is production-ready.

Learning Objectives:

  • Rate Limiting: Protect API from abuse
  • Authentication: Secure API access
  • Versioning: Manage API lifecycle
  • Resilience: Configure circuit breakers
API Gateway Console
Live
Endpoints
POST /v1/inference Active
POST /v1/embeddings Active
GET /v1/models Active
GET /v1/health Active
DELETE /v1/cache Active
Rate Limiting
Security & Config
Progress: 0/14 settings
Score: 0/100
0%

API Deployed!

Gateway configuration complete.

Lab 28: Infrastructure Designer

Objective

You are a CloudScale AI infrastructure engineer designing a production ML inference pipeline. Configure all infrastructure components for high availability, low latency, and scalability.

Using the Designer

  • Drag Components: Drag items from the left palette onto the canvas or into dashed placeholders
  • Move Nodes: Click and drag existing nodes to reposition them
  • Connection Lines: Purple animated arrows show data flow between components
  • Select Node: Click any node to highlight it

Diagram Placement Guide (Required)

Complete the ML inference architecture by placing these 4 components in the correct positions. The diagram shows data flow from Clients → Load Balancer → Model Servers → GPU Cluster.

Architecture Overview:

A production ML pipeline needs: (1) an entry point for external traffic, (2) response caching to reduce latency, (3) a vector store for embeddings/RAG, and (4) observability for the entire system. Place these supporting components around the core inference path.

  • Top-Left: API Gateway

    Entry point for all client requests. Sits at the top-left as the first component clients interact with before reaching the Load Balancer. Handles authentication, rate limiting, and request routing.

  • Top-Right: Cache Layer

    Positioned at the top near the entry point to intercept requests early. Caches embeddings and frequent responses to reduce GPU load and latency. Cache hits bypass the expensive inference path.

  • Bottom-Left: Vector DB

    Placed at the bottom near GPU Cluster. Stores embeddings for RAG (Retrieval-Augmented Generation). Model servers query Vector DB for context before generating responses on GPUs.

  • Bottom-Right: Monitoring

    Observes the entire pipeline from the side. Collects metrics from all components (GPU utilization, latency, error rates). Essential for detecting OOM issues, latency spikes, and triggering auto-scaling.

Click "Validate Diagram" to verify your placements are correct before configuring the settings below.

Step-by-Step Instructions

  1. Step 1: Load Balancer Configuration
    • Set Algorithm: Use "least-conn" for ML workloads (balances by active connections)
    • Set Health Check: Use "10s" for responsive failure detection
  2. Step 2: Model Server Setup
    • Set Replicas: Use 3+ for high availability
    • Set Auto-scaling: Use "latency" or "queue" for ML workloads
  3. Step 3: GPU Cluster Configuration
    • Set GPU Type: A100 for high throughput, T4 for cost efficiency
    • Set GPU Count: 4+ GPUs for production inference
  4. Step 4: Caching Layer
    • Set Cache Type: Redis for distributed caching
    • Set TTL: 5 minutes for embedding caches, 1 hour for static responses
  5. Step 5: Vector Database
    • Set Database: Pinecone or Weaviate for managed, Milvus for self-hosted
    • Set Index Type: HNSW for best accuracy/speed tradeoff
  6. Step 6: Monitoring Setup
    • Set Metrics: Prometheus for open-source, Datadog for managed
    • Set Alerting: PagerDuty for production incidents
Pro Tips
  • Least-connections LB is best for variable inference latencies
  • H100 GPUs offer 3x throughput vs A100 for LLM inference
  • HNSW index provides best recall with sub-ms latency
  • Always use distributed caching (Redis) for multi-replica setups
Optimal Configuration (100% Score)
  • LB Algorithm: least-conn or weighted
  • Health Check: 5s or 10s
  • Replicas: 3 or 5
  • Auto-scaling: latency or queue
  • GPU: A100 or H100, 4+ GPUs
  • Cache: Redis, 5 min TTL
  • Vector DB: Any with HNSW index
  • Monitoring: Prometheus + PagerDuty
Common Mistakes
  • Using round-robin LB (doesn't account for inference latency variance)
  • Only 2 replicas (no fault tolerance if one goes down)
  • CPU-based auto-scaling (GPU workloads need different metrics)
  • Local caching in multi-replica setup (cache inconsistency)

Lab 29: Observability Console

Objective

You are a ModelOps engineer investigating a production incident. Analyze the logs, identify the root cause, configure alerting thresholds, and set up monitoring dashboards.

Step-by-Step Instructions

  1. Step 1: Analyze the Logs
    • Review the log entries in the console
    • Identify ERROR and WARN patterns
    • Note: GPU memory at 95%, OOM kill, DB connection exhausted
  2. Step 2: Configure Error Thresholds
    • Set Error Threshold: 1% for production SLA
    • Set Alert Severity: Critical for high-impact issues
  3. Step 3: Log Management
    • Set Retention: 30+ days for incident analysis
    • Set Aggregation Window: 5 minutes for alert smoothing
  4. Step 4: Define SLOs
    • Set Latency SLO: P99 < 200ms for inference APIs
    • This defines your service level objective
  5. Step 5: Root Cause Analysis
    • Based on logs: OOM + DB connection exhaustion = Memory exhaustion
    • Select appropriate Remediation: Increase memory or scale horizontally
  6. Step 6: Dashboard & Alerting
    • Set Dashboard: Grafana for metrics visualization
    • Set Notification: PagerDuty or Slack for alerts
    • Set Sampling: 100% for critical production logs
Pro Tips
  • 5-minute aggregation reduces alert noise while catching issues
  • P99 latency is more important than average for user experience
  • OOM errors often indicate need for horizontal scaling, not just memory
  • Always use 100% sampling for ERROR level logs
Optimal Configuration (100% Score)
  • Error Threshold: 1% (production standard)
  • Alert Severity: Critical (P1)
  • Retention: 30 or 90 days
  • Aggregation: 5 minutes
  • Latency SLO: P99 < 200ms
  • Root Cause: Memory exhaustion (based on logs)
  • Remediation: Scale horizontally or increase memory
  • Dashboard: Grafana
  • Notification: PagerDuty or Slack
Log Analysis Hints
  • Line 1: Inference timeout - backend overloaded
  • Line 2: GPU memory at 95% - near OOM
  • Line 5: Vector DB pool exhausted - connection leak or high load
  • Line 7: OOM killed - confirms memory as root cause

Lab 30: API Gateway Manager

Objective

You are launching the AIaaS Platform inference API to enterprise customers. Configure the API Gateway with rate limiting, authentication, caching, and resilience patterns for production readiness.

Step-by-Step Instructions

  1. Step 1: Rate Limiting
    • Set Requests/min: 100-500 for standard tier, 1000 for enterprise
    • Set Burst Limit: 20-50 to handle traffic spikes
    • Watch the rate limit bar update as you configure!
  2. Step 2: Security Configuration
    • Set Authentication: JWT or OAuth2 for enterprise, API Key for simple use
    • Set API Version: v1 (stable) for production
  3. Step 3: Performance Settings
    • Set Response Caching: 5 minutes for GET endpoints
    • Set Timeout: 30-60 seconds for inference APIs
  4. Step 4: Resilience Patterns
    • Set Circuit Breaker: 10 failures to trigger
    • Set Recovery Time: 60 seconds before retry
    • Set Retry Policy: 3 retries for transient failures
  5. Step 5: Additional Config
    • Set CORS: Specific Origins for web clients
    • Set Validation: Full validation for production
    • Set Logging: Info level for production
Pro Tips
  • JWT auth is more secure and supports scopes/claims
  • Circuit breakers prevent cascade failures in microservices
  • Cache only GET endpoints, never POST inference requests
  • 60s timeout is typical for LLM inference APIs
  • Click different endpoints to see the endpoint selection work!
Optimal Configuration (100% Score)
  • Rate Limit: 100-500 req/min
  • Burst: 20 or 50 requests
  • Auth: JWT or OAuth2
  • Version: v1 (stable)
  • Cache: 5 minutes
  • Timeout: 30 or 60 seconds
  • Circuit Breaker: 10 failures
  • Recovery: 60 seconds
  • CORS: Specific Origins
  • Validation: Full validation
  • Logging: Info level
  • Retry: 3 retries
Common Mistakes
  • No rate limiting (API abuse risk)
  • API Key auth only (less secure than JWT/OAuth)
  • Caching POST requests (stale inference results)
  • No circuit breaker (cascade failures)
  • CORS Allow All (security vulnerability)