AI Labs | CertLabz

GenAI Expert Labs - Module 10

Specialized interfaces: Network Designer, Log Console, and API Gateway.

Lab 28: AI Infrastructure Designer

Network Tool / Expert

Scenario: ML Pipeline Architecture

CloudScale AI needs to design a production ML inference infrastructure. Use the Network Topology Designer to configure load balancers, model servers, GPU clusters, vector databases, and caching layers. Design for high availability and low latency.

Learning Objectives:

Architecture: Design scalable ML infrastructure
Load Balancing: Configure traffic distribution
GPU Clusters: Optimize compute resources
Caching: Reduce inference latency

ML Infrastructure Designer

Select Connect Delete

Components

Load Balancer

Model Server

GPU Cluster

Vector DB

Cache Layer

API Gateway

Monitoring

Clients

Load Balancer

Server 1

Server 2

GPU Cluster

Configuration

Load Balancer

Algorithm

Health Check Interval

Model Servers

Replicas

Auto-scaling

GPU Cluster

GPU Type

GPU Count

Caching

Cache Type

TTL

Vector Database

Database

Index Type

Monitoring

Metrics Backend

Alerting

Progress: 0/14 settings

Score: 0/100

Infrastructure Deployed!

Architecture configuration complete.

Lab 29: AI Observability Console

Log Analysis / Expert

Scenario: Production Incident Analysis

ModelOps team detected anomalies in the production ML system. Use the Log Analysis Console to investigate errors, configure alert thresholds, set up log aggregation, and create dashboards. Identify the root cause and set up monitoring.

Learning Objectives:

Log Analysis: Parse and filter production logs
Alerting: Configure threshold-based alerts
Metrics: Define key performance indicators
Dashboards: Visualize system health

Log Analysis Console

All Error Warning Info

2024-01-15 14:20:01INFOModel server started on port 8080 - version: 2.4.1

2024-01-15 14:20:02INFOLoading model weights from s3://models/llm-v3.bin

2024-01-15 14:20:05INFOGPU cluster initialized: 4x NVIDIA A100 80GB

2024-01-15 14:20:08INFOVector database connected: Pinecone (us-east-1)

2024-01-15 14:20:10INFORedis cache initialized: 10GB allocated

2024-01-15 14:20:12DEBUGHealth check endpoint registered: /health

2024-01-15 14:20:15INFOPrometheus metrics endpoint: /metrics

2024-01-15 14:20:18INFOLoad balancer health check passed

2024-01-15 14:20:20INFOService registered with Kubernetes: ml-inference-prod

2024-01-15 14:20:25DEBUGBatch size configured: 128 requests

2024-01-15 14:21:01INFOIncoming request: POST /v1/inference - client: api-gateway

2024-01-15 14:21:02INFORequest processed: latency=145ms, tokens=512

2024-01-15 14:21:05INFOCache miss: generating embedding for new query

2024-01-15 14:21:08DEBUGVector similarity search: 10 results in 12ms

2024-01-15 14:21:10INFOGPU utilization: 72% - optimal range

2024-01-15 14:21:15WARNRequest latency above p95: 320ms (threshold: 200ms)

2024-01-15 14:21:18INFOBatch inference completed: 64 requests in 1.2s

2024-01-15 14:21:20INFOCache hit rate: 82.3% - excellent

2024-01-15 14:21:25DEBUGMemory pressure: 4.2GB / 8GB used

2024-01-15 14:21:30INFOStreaming response started: request_id=xyz789

2024-01-15 14:22:01WARNGPU memory usage at 85% - approaching threshold

2024-01-15 14:22:05INFOAuto-scaling evaluation: current=3, target=3

2024-01-15 14:22:08DEBUGConnection pool status: 45/100 active

2024-01-15 14:22:10INFORequest processed: latency=98ms, tokens=256

2024-01-15 14:22:15INFOModel checkpoint saved to S3

2024-01-15 14:22:20WARNRate limit approaching: 450/500 requests/min

2024-01-15 14:22:25INFOEmbeddings cached: 1024 new vectors

2024-01-15 14:22:30DEBUGToken generation rate: 45 tok/s

2024-01-15 14:22:35INFOHealth check passed: all services healthy

2024-01-15 14:22:40INFORequest queue depth: 25 (capacity: 500)

2024-01-15 14:23:01ERRORModel inference timeout after 30000ms - request_id: abc123

2024-01-15 14:23:02WARNGPU memory usage at 95% - node: gpu-cluster-01

2024-01-15 14:23:03INFOAuto-scaling triggered: adding 2 replicas

2024-01-15 14:23:05ERRORVector DB connection pool exhausted - max: 100

2024-01-15 14:23:06INFOCache hit rate: 78.5% - avg latency: 12ms

2024-01-15 14:23:08WARNRequest queue depth: 450 (threshold: 500)

2024-01-15 14:23:10ERROROOM killed process: model-server-3 (8GB limit)

2024-01-15 14:23:12INFOHealth check passed: all endpoints responding

2024-01-15 14:23:15DEBUGBatch inference: 128 requests processed in 2.3s

2024-01-15 14:23:18ERRORCUDA out of memory: tried to allocate 2GB

2024-01-15 14:23:20WARNFallback to CPU inference enabled

2024-01-15 14:23:22INFONew replica started: model-server-4

2024-01-15 14:23:25INFONew replica started: model-server-5

2024-01-15 14:23:28DEBUGLoad balancer updated: 5 backends

2024-01-15 14:23:30INFOTraffic redistributed across replicas

2024-01-15 14:23:35WARNLatency spike detected: p99=850ms

2024-01-15 14:23:38INFOCircuit breaker: monitoring failures

2024-01-15 14:23:40ERRORRequest failed: 503 Service Unavailable

2024-01-15 14:23:42ERRORUpstream timeout: model-server-2 not responding

2024-01-15 14:23:45WARNConnection pool low: 95/100 in use

Analysis Config

Error Threshold

Alert Severity

Log Retention

Aggregation Window

Latency SLO

Root Cause

Remediation

Dashboard Type

Notification Channel

Sampling Rate

Progress: 0/10 settings

Score: 0/100

Analysis Complete!

Monitoring configuration saved.

Lab 30: AI API Gateway Manager

API Gateway / Expert

Scenario: Production API Configuration

AIaaS Platform is launching their inference API to enterprise customers. Use the API Gateway Manager to configure rate limiting, authentication, versioning, caching policies, and circuit breakers. Ensure the API is production-ready.

Learning Objectives:

Rate Limiting: Protect API from abuse
Authentication: Secure API access
Versioning: Manage API lifecycle
Resilience: Configure circuit breakers

API Gateway Console

Live

Endpoints

POST /v1/inference Active

POST /v1/embeddings Active

GET /v1/models Active

GET /v1/health Active

DELETE /v1/cache Active

Rate Limiting

Requests/min

Burst Limit

Security & Config

Authentication

API Version

Response Caching

Timeout

Circuit Breaker

Recovery Time

CORS Policy

Request Validation

Logging Level

Retry Policy

Progress: 0/14 settings

Score: 0/100

API Deployed!

Gateway configuration complete.

Lab 28: Infrastructure Designer

Objective

You are a CloudScale AI infrastructure engineer designing a production ML inference pipeline. Configure all infrastructure components for high availability, low latency, and scalability.

Using the Designer

Drag Components: Drag items from the left palette onto the canvas or into dashed placeholders
Move Nodes: Click and drag existing nodes to reposition them
Connection Lines: Purple animated arrows show data flow between components
Select Node: Click any node to highlight it

Diagram Placement Guide (Required)

Complete the ML inference architecture by placing these 4 components in the correct positions. The diagram shows data flow from Clients → Load Balancer → Model Servers → GPU Cluster.

Architecture Overview:

A production ML pipeline needs: (1) an entry point for external traffic, (2) response caching to reduce latency, (3) a vector store for embeddings/RAG, and (4) observability for the entire system. Place these supporting components around the core inference path.

Top-Left: API Gateway
Entry point for all client requests. Sits at the top-left as the first component clients interact with before reaching the Load Balancer. Handles authentication, rate limiting, and request routing.
Top-Right: Cache Layer
Positioned at the top near the entry point to intercept requests early. Caches embeddings and frequent responses to reduce GPU load and latency. Cache hits bypass the expensive inference path.
Bottom-Left: Vector DB
Placed at the bottom near GPU Cluster. Stores embeddings for RAG (Retrieval-Augmented Generation). Model servers query Vector DB for context before generating responses on GPUs.
Bottom-Right: Monitoring
Observes the entire pipeline from the side. Collects metrics from all components (GPU utilization, latency, error rates). Essential for detecting OOM issues, latency spikes, and triggering auto-scaling.

Click "Validate Diagram" to verify your placements are correct before configuring the settings below.

Step-by-Step Instructions

Step 1: Load Balancer Configuration
- Set Algorithm: Use "least-conn" for ML workloads (balances by active connections)
- Set Health Check: Use "10s" for responsive failure detection
Step 2: Model Server Setup
- Set Replicas: Use 3+ for high availability
- Set Auto-scaling: Use "latency" or "queue" for ML workloads
Step 3: GPU Cluster Configuration
- Set GPU Type: A100 for high throughput, T4 for cost efficiency
- Set GPU Count: 4+ GPUs for production inference
Step 4: Caching Layer
- Set Cache Type: Redis for distributed caching
- Set TTL: 5 minutes for embedding caches, 1 hour for static responses
Step 5: Vector Database
- Set Database: Pinecone or Weaviate for managed, Milvus for self-hosted
- Set Index Type: HNSW for best accuracy/speed tradeoff
Step 6: Monitoring Setup
- Set Metrics: Prometheus for open-source, Datadog for managed
- Set Alerting: PagerDuty for production incidents

Pro Tips

Least-connections LB is best for variable inference latencies
H100 GPUs offer 3x throughput vs A100 for LLM inference
HNSW index provides best recall with sub-ms latency
Always use distributed caching (Redis) for multi-replica setups

Optimal Configuration (100% Score)

LB Algorithm: least-conn or weighted
Health Check: 5s or 10s
Replicas: 3 or 5
Auto-scaling: latency or queue
GPU: A100 or H100, 4+ GPUs
Cache: Redis, 5 min TTL
Vector DB: Any with HNSW index
Monitoring: Prometheus + PagerDuty

Common Mistakes

Using round-robin LB (doesn't account for inference latency variance)
Only 2 replicas (no fault tolerance if one goes down)
CPU-based auto-scaling (GPU workloads need different metrics)
Local caching in multi-replica setup (cache inconsistency)

Lab 29: Observability Console

Objective

You are a ModelOps engineer investigating a production incident. Analyze the logs, identify the root cause, configure alerting thresholds, and set up monitoring dashboards.

Step-by-Step Instructions

Step 1: Analyze the Logs
- Review the log entries in the console
- Identify ERROR and WARN patterns
- Note: GPU memory at 95%, OOM kill, DB connection exhausted
Step 2: Configure Error Thresholds
- Set Error Threshold: 1% for production SLA
- Set Alert Severity: Critical for high-impact issues
Step 3: Log Management
- Set Retention: 30+ days for incident analysis
- Set Aggregation Window: 5 minutes for alert smoothing
Step 4: Define SLOs
- Set Latency SLO: P99 < 200ms for inference APIs
- This defines your service level objective
Step 5: Root Cause Analysis
- Based on logs: OOM + DB connection exhaustion = Memory exhaustion
- Select appropriate Remediation: Increase memory or scale horizontally
Step 6: Dashboard & Alerting
- Set Dashboard: Grafana for metrics visualization
- Set Notification: PagerDuty or Slack for alerts
- Set Sampling: 100% for critical production logs

Pro Tips

5-minute aggregation reduces alert noise while catching issues
P99 latency is more important than average for user experience
OOM errors often indicate need for horizontal scaling, not just memory
Always use 100% sampling for ERROR level logs

Optimal Configuration (100% Score)

Error Threshold: 1% (production standard)
Alert Severity: Critical (P1)
Retention: 30 or 90 days
Aggregation: 5 minutes
Latency SLO: P99 < 200ms
Root Cause: Memory exhaustion (based on logs)
Remediation: Scale horizontally or increase memory
Dashboard: Grafana
Notification: PagerDuty or Slack

Log Analysis Hints

Line 1: Inference timeout - backend overloaded
Line 2: GPU memory at 95% - near OOM
Line 5: Vector DB pool exhausted - connection leak or high load
Line 7: OOM killed - confirms memory as root cause

Lab 30: API Gateway Manager

Objective

You are launching the AIaaS Platform inference API to enterprise customers. Configure the API Gateway with rate limiting, authentication, caching, and resilience patterns for production readiness.

Step-by-Step Instructions

Step 1: Rate Limiting
- Set Requests/min: 100-500 for standard tier, 1000 for enterprise
- Set Burst Limit: 20-50 to handle traffic spikes
- Watch the rate limit bar update as you configure!
Step 2: Security Configuration
- Set Authentication: JWT or OAuth2 for enterprise, API Key for simple use
- Set API Version: v1 (stable) for production
Step 3: Performance Settings
- Set Response Caching: 5 minutes for GET endpoints
- Set Timeout: 30-60 seconds for inference APIs
Step 4: Resilience Patterns
- Set Circuit Breaker: 10 failures to trigger
- Set Recovery Time: 60 seconds before retry
- Set Retry Policy: 3 retries for transient failures
Step 5: Additional Config
- Set CORS: Specific Origins for web clients
- Set Validation: Full validation for production
- Set Logging: Info level for production

Pro Tips

JWT auth is more secure and supports scopes/claims
Circuit breakers prevent cascade failures in microservices
Cache only GET endpoints, never POST inference requests
60s timeout is typical for LLM inference APIs
Click different endpoints to see the endpoint selection work!

Optimal Configuration (100% Score)

Rate Limit: 100-500 req/min
Burst: 20 or 50 requests
Auth: JWT or OAuth2
Version: v1 (stable)
Cache: 5 minutes
Timeout: 30 or 60 seconds
Circuit Breaker: 10 failures
Recovery: 60 seconds
CORS: Specific Origins
Validation: Full validation
Logging: Info level
Retry: 3 retries

Common Mistakes

No rate limiting (API abuse risk)
API Key auth only (less secure than JWT/OAuth)
Caching POST requests (stale inference results)
No circuit breaker (cascade failures)
CORS Allow All (security vulnerability)

AI & Machine Learning Labs

GenAI Expert Labs - Module 10

Learning Objectives:

Infrastructure Deployed!

Learning Objectives:

Analysis Complete!

Learning Objectives:

API Deployed!

Lab 28: Infrastructure Designer

Objective

Using the Designer

Diagram Placement Guide (Required)

Step-by-Step Instructions

Pro Tips

Optimal Configuration (100% Score)

Common Mistakes

Lab 29: Observability Console

Objective

Step-by-Step Instructions

Pro Tips

Optimal Configuration (100% Score)

Log Analysis Hints

Lab 30: API Gateway Manager

Objective

Step-by-Step Instructions

Pro Tips

Optimal Configuration (100% Score)

Common Mistakes