AI & Machine Learning Labs

Master evaluation pipelines, governance policies, and A/B experimentation through specialized tool interfaces.

GenAI Expert Labs - Module 9

Specialized interfaces: CLI, IDE, and Analytics Dashboard.

Lab 25: Evaluation CLI Studio
CLI Tool / Expert
Scenario: AutoEval Command-Line Tool
QualityAI uses a terminal-based evaluation CLI to benchmark LLMs. Use the llm-eval command-line interface to configure dataset parameters, metric weights, quality gates, and run benchmarks. Master the CLI workflow used in real ML engineering environments.

Learning Objectives:

  • CLI Workflows: Execute evaluation commands
  • Configuration: Set dataset/metric parameters
  • Quality Gates: Configure pass/fail thresholds
  • Benchmarking: Compare against baselines
llm-eval — evaluation-pipeline — bash
qualityai@eval-server:~$ llm-eval --version
llm-eval v2.4.1 (ML Evaluation Framework)
qualityai@eval-server:~$ llm-eval init --config interactive
✓ Initializing evaluation pipeline configuration...
Dataset Configuration
--dataset-size
--split-ratio
--validation
Metric Weights (sum = 1.0)
--w-accuracy
--w-relevance
--w-groundedness
--w-coherence
Measured Scores (%)
--score-accuracy
--score-relevance
--score-groundedness
--score-coherence
Quality Gates
--threshold
--baseline
--significance
qualityai@eval-server:~$ llm-eval run --execute
Progress: 0/14 parameters
Score: 0/100
0%

Evaluation Complete!

Pipeline configured successfully.

Lab 26: Policy Editor Studio
IDE / Expert
Scenario: AI Governance IDE
HealthTech Corp uses a specialized IDE to manage AI governance policies. Use the Policy Editor to configure data protection, content safety, audit logging, and compliance frameworks. Edit the governance.yaml policy file for HIPAA compliance.

Learning Objectives:

  • Policy Config: Edit YAML governance policies
  • Data Protection: Configure PII/PHI rules
  • Content Safety: Set up filters/escalation
  • Compliance: HIPAA/GDPR frameworks
File Edit View Run Help
Explorer
policies/
governance.yaml
safety.yaml
audit.yaml
schemas/
README.md
governance.yaml ×
1# AI Governance Policy
2version: "2.0"
3
4data_protection:
5pii_strategy: null
6phi_handling: null
7retention_days: null
8
9content_safety:
10medical_filter: null
11selfharm_filter: null
12rx_filter: null
13
14audit:
15log_level: null
16storage: null
17access_tracking: null
18
19human_review:
20hitl_trigger: null
21escalation: null
22sla_minutes: null
23
24compliance:
25framework: null
26data_residency: null
Properties
Data Protection
Content Safety
Audit Settings
Human Review
Compliance
main YAML | UTF-8 Ln 1, Col 1
Progress: 0/14 policies
Score: 0/100
0%

Policy Validated!

Governance configuration saved.

Lab 27: Experiment Analytics Console
Dashboard / Expert
Scenario: A/B Testing Platform
ProductLabs uses an analytics dashboard to manage A/B experiments. Use the Experiment Console to configure traffic allocation, success metrics, monitor guardrails, and analyze results. Make data-driven rollout decisions.

Learning Objectives:

  • Traffic Split: Control vs treatment allocation
  • Metrics: Define success metrics and MDE
  • Guardrails: Set latency/error limits
  • Analysis: Statistical significance
Setup Results History
Total Users
0
Live
Conversion
--%
Calculating...
Power
--%
Target: 80%+
Status
Configuring
Set parameters
Traffic Allocation
Control Split (%)
Treatment Split (%)
Ramp Schedule
Primary Metric
Control
Treatment
Statistical Settings
MDE (%)
Power
CONTROL (A)
Users
Conv (%)
TREATMENT (B)
Users
Conv (%)
Guardrails
Latency P99 (ms)
Error Rate (%)
Rollback Trigger
Progress: 0/13 settings
Score: 0/100
0%

Experiment Configured!

Ready for launch.

Lab 25: Evaluation CLI

Objective

You are a QualityAI ML Engineer configuring an LLM evaluation pipeline using the llm-eval CLI tool. Configure all 14 parameters to benchmark your model against production baselines.

Step-by-Step Instructions

  1. Step 1: Dataset Configuration
    • Select --dataset-size: Choose 5000 or 10000 for reliable results
    • Select --split-ratio: Use 80-20 (industry standard train/test split)
    • Select --validation: Choose "kfold" for robust cross-validation
  2. Step 2: Configure Metric Weights
    • Set all 4 weights (accuracy, relevance, groundedness, coherence)
    • IMPORTANT: Weights must sum to exactly 1.0
    • For balanced evaluation, use 0.25 for each metric
  3. Step 3: Enter Measured Scores
    • Input the scores from your evaluation run for each metric
    • Scores are percentages (75-95 typical range)
    • Weighted score = Σ(weight × score) for all metrics
  4. Step 4: Set Quality Gates
    • Set --threshold: 80% is standard pass threshold
    • Select --baseline: "previous" to compare against production
    • Set --significance: 0.05 for standard p-value threshold
  5. Step 5: Execute & Review
    • Click "Execute" to validate your configuration
    • Click "Preview" to see the evaluation dashboard
    • Review weighted score and quality gate status
Pro Tips
  • Larger datasets (10000+) give more statistically reliable results
  • K-fold validation reduces variance compared to holdout
  • Use stricter significance (0.01) for high-stakes deployments
  • Always compare against previous production model as baseline
Optimal Configuration (100% Score)
  • Dataset: 5000 or 10000 samples
  • Split: 80-20
  • Validation: kfold
  • Weights: 0.25 each (sum = 1.0)
  • Threshold: 80 or 85
  • Baseline: previous
  • Significance: 0.05
Common Mistakes
  • Weights not summing to 1.0 (e.g., 0.3 + 0.3 + 0.3 + 0.2 = 1.1)
  • Using too small dataset (1000) - reduces statistical power
  • Forgetting to set all 4 measured scores
  • Not selecting a baseline for comparison

Lab 26: Policy Editor

Objective

You are a HealthTech Corp compliance engineer. Use the Policy Editor IDE to configure AI governance policies in the governance.yaml file. Ensure HIPAA compliance for healthcare AI applications.

Step-by-Step Instructions

  1. Step 1: Data Protection Settings
    • Set pii_strategy: Use "tokenize" for reversible protection
    • Set phi_handling: Use "never-store" for HIPAA compliance
    • Set retention_days: Use 2555 (7 years) for HIPAA requirement
  2. Step 2: Content Safety Filters
    • Set medical_filter: Use "block" to prevent misinformation
    • Set selfharm_filter: Use "block-escalate" for immediate action
    • Set rx_filter: Use "block-refer" to redirect to doctors
  3. Step 3: Audit Configuration
    • Set log_level: Use "comprehensive" for full audit trail
    • Set storage: Use "immutable" (WORM) for tamper-proof logs
    • Set access_tracking: Use "full" for complete access logs
  4. Step 4: Human Review Settings
    • Set hitl_trigger: Use "medical-advice" for clinical review
    • Set escalation: Use "clinical" or "tiered" for proper routing
    • Set sla_minutes: Use 15 or 60 for response time
  5. Step 5: Compliance Framework
    • Set framework: Use "hipaa" or "both" for HIPAA+GDPR
    • Set data_residency: Use "us" for HIPAA, "eu" for GDPR
  6. Step 6: Save & Validate
    • Watch the YAML code update as you change properties
    • Click "Save & Validate" to check your configuration
    • Click "Preview" to see governance dashboard
Pro Tips
  • Use the Properties panel on the right to edit values - YAML updates live!
  • Click different files in the Explorer to see file structure (read-only)
  • PHI must NEVER be stored in plain text for HIPAA compliance
  • Immutable logs prevent tampering and satisfy audit requirements
Optimal HIPAA Configuration (100% Score)
  • pii_strategy: tokenize
  • phi_handling: never-store
  • retention_days: 2555 (7 years)
  • medical_filter: block
  • selfharm_filter: block-escalate
  • rx_filter: block-refer
  • log_level: comprehensive
  • storage: immutable
  • framework: hipaa or both
HIPAA Compliance Requirements
  • PHI must use "never-store" or "tokenized" - encrypted alone is insufficient
  • Minimum 7-year retention (2555 days) for audit logs
  • Comprehensive logging required for all PHI access
  • Self-harm content must trigger immediate escalation

Lab 27: Experiment Console

Objective

You are a ProductLabs data scientist setting up an A/B experiment. Configure traffic allocation, define success metrics, set guardrails, and analyze results to make a ship/no-ship decision.

Step-by-Step Instructions

  1. Step 1: Traffic Allocation
    • Set Control Split: Start high (80-95%) for safety
    • Set Treatment Split: Start low (5-20%) initially
    • IMPORTANT: Control + Treatment must equal 100%
    • Set Ramp Schedule: Use progressive (5%→25%→50%→100%)
  2. Step 2: Define Success Metric
    • Select Primary Metric: Choose your key success indicator
    • Set MDE: 3-5% is standard minimum detectable effect
    • Set Power: 80% minimum, 90% recommended
  3. Step 3: Enter Observed Results
    • Enter Control Users: Number of users in control group
    • Enter Control Conv %: Conversion rate for control
    • Enter Treatment Users: Number of users in treatment
    • Enter Treatment Conv %: Conversion rate for treatment
    • Watch metrics update as you enter values!
  4. Step 4: Configure Guardrails
    • Set Latency P99: 500ms is standard threshold
    • Set Error Rate: 0.5% or 1% maximum
    • Set Rollback Trigger: Use "any-guardrail" for safety
  5. Step 5: Launch & Analyze
    • Click "Launch" to validate your experiment setup
    • Click "Analysis" to see detailed results dashboard
    • Review p-value and recommendation
Pro Tips
  • Start with conservative splits (90/10) and ramp up gradually
  • Need 1000+ users per variant for statistical significance
  • Lower MDE requires larger sample sizes to detect
  • Always have auto-rollback enabled for production experiments
  • Watch the traffic bar update as you change splits!
Optimal Configuration (100% Score)
  • Traffic Split: 80/20 or 90/10 (must sum to 100)
  • Ramp: 5%→25%→50%→100% (progressive)
  • MDE: 3% or 5%
  • Power: 80% or 90%
  • Users: 1000+ per variant
  • Latency: 500ms or 1000ms
  • Error Rate: 0.5% or 1%
  • Rollback: any-guardrail
Decision Framework
  • SHIP IT: Treatment > Control AND p-value < 0.05
  • DON'T SHIP: Treatment < Control AND p-value < 0.05
  • CONTINUE: p-value > 0.05 (not yet significant)
  • ROLLBACK: If any guardrail is breached