AI Labs | CertLabz

GenAI Expert Labs - Module 9

Specialized interfaces: CLI, IDE, and Analytics Dashboard.

Lab 25: Evaluation CLI Studio

CLI Tool / Expert

Scenario: AutoEval Command-Line Tool

QualityAI uses a terminal-based evaluation CLI to benchmark LLMs. Use the llm-eval command-line interface to configure dataset parameters, metric weights, quality gates, and run benchmarks. Master the CLI workflow used in real ML engineering environments.

Learning Objectives:

CLI Workflows: Execute evaluation commands
Configuration: Set dataset/metric parameters
Quality Gates: Configure pass/fail thresholds
Benchmarking: Compare against baselines

llm-eval — evaluation-pipeline — bash

qualityai@eval-server:~$ llm-eval --version

llm-eval v2.4.1 (ML Evaluation Framework)

qualityai@eval-server:~$ llm-eval init --config interactive

✓ Initializing evaluation pipeline configuration...

Dataset Configuration

--dataset-size

--split-ratio

--validation

Metric Weights (sum = 1.0)

--w-accuracy

--w-relevance

--w-groundedness

--w-coherence

Measured Scores (%)

--score-accuracy

--score-relevance

--score-groundedness

--score-coherence

Quality Gates

--threshold

--baseline

--significance

qualityai@eval-server:~$ llm-eval run --execute

Progress: 0/14 parameters

Score: 0/100

Evaluation Complete!

Pipeline configured successfully.

Lab 26: Policy Editor Studio

IDE / Expert

Scenario: AI Governance IDE

HealthTech Corp uses a specialized IDE to manage AI governance policies. Use the Policy Editor to configure data protection, content safety, audit logging, and compliance frameworks. Edit the governance.yaml policy file for HIPAA compliance.

Learning Objectives:

Policy Config: Edit YAML governance policies
Data Protection: Configure PII/PHI rules
Content Safety: Set up filters/escalation
Compliance: HIPAA/GDPR frameworks

governance.yaml ×

1# AI Governance Policy

2version: "2.0"

4data_protection:

5pii_strategy: null

6phi_handling: null

7retention_days: null

9content_safety:

10medical_filter: null

11selfharm_filter: null

12rx_filter: null

14audit:

15log_level: null

16storage: null

17access_tracking: null

19human_review:

20hitl_trigger: null

21escalation: null

22sla_minutes: null

24compliance:

25framework: null

26data_residency: null

Properties

Data Protection

pii_strategy

phi_handling

retention_days

Content Safety

medical_filter

selfharm_filter

rx_filter

Audit Settings

log_level

storage

access_tracking

Human Review

hitl_trigger

escalation

sla_minutes

Compliance

framework

data_residency

main YAML | UTF-8 Ln 1, Col 1

Progress: 0/14 policies

Score: 0/100

Policy Validated!

Governance configuration saved.

Lab 27: Experiment Analytics Console

Dashboard / Expert

Scenario: A/B Testing Platform

ProductLabs uses an analytics dashboard to manage A/B experiments. Use the Experiment Console to configure traffic allocation, success metrics, monitor guardrails, and analyze results. Make data-driven rollout decisions.

Learning Objectives:

Traffic Split: Control vs treatment allocation
Metrics: Define success metrics and MDE
Guardrails: Set latency/error limits
Analysis: Statistical significance

Experiment Console

Setup Results History

Total Users

Live

Conversion

--%

Calculating...

Power

--%

Target: 80%+

Status

Configuring

Set parameters

Traffic Allocation

Control Split (%)

Treatment Split (%)

Ramp Schedule

Primary Metric

Control

Treatment

Statistical Settings

MDE (%)

Power

CONTROL (A)

Users

Conv (%)

TREATMENT (B)

Users

Conv (%)

Guardrails

Latency P99 (ms)

Error Rate (%)

Rollback Trigger

Progress: 0/13 settings

Score: 0/100

Experiment Configured!

Ready for launch.

Lab 25: Evaluation CLI

Objective

You are a QualityAI ML Engineer configuring an LLM evaluation pipeline using the llm-eval CLI tool. Configure all 14 parameters to benchmark your model against production baselines.

Step-by-Step Instructions

Step 1: Dataset Configuration
- Select --dataset-size: Choose 5000 or 10000 for reliable results
- Select --split-ratio: Use 80-20 (industry standard train/test split)
- Select --validation: Choose "kfold" for robust cross-validation
Step 2: Configure Metric Weights
- Set all 4 weights (accuracy, relevance, groundedness, coherence)
- IMPORTANT: Weights must sum to exactly 1.0
- For balanced evaluation, use 0.25 for each metric
Step 3: Enter Measured Scores
- Input the scores from your evaluation run for each metric
- Scores are percentages (75-95 typical range)
- Weighted score = Σ(weight × score) for all metrics
Step 4: Set Quality Gates
- Set --threshold: 80% is standard pass threshold
- Select --baseline: "previous" to compare against production
- Set --significance: 0.05 for standard p-value threshold
Step 5: Execute & Review
- Click "Execute" to validate your configuration
- Click "Preview" to see the evaluation dashboard
- Review weighted score and quality gate status

Pro Tips

Larger datasets (10000+) give more statistically reliable results
K-fold validation reduces variance compared to holdout
Use stricter significance (0.01) for high-stakes deployments
Always compare against previous production model as baseline

Optimal Configuration (100% Score)

Dataset: 5000 or 10000 samples
Split: 80-20
Validation: kfold
Weights: 0.25 each (sum = 1.0)
Threshold: 80 or 85
Baseline: previous
Significance: 0.05

Common Mistakes

Weights not summing to 1.0 (e.g., 0.3 + 0.3 + 0.3 + 0.2 = 1.1)
Using too small dataset (1000) - reduces statistical power
Forgetting to set all 4 measured scores
Not selecting a baseline for comparison

Lab 26: Policy Editor

Objective

You are a HealthTech Corp compliance engineer. Use the Policy Editor IDE to configure AI governance policies in the governance.yaml file. Ensure HIPAA compliance for healthcare AI applications.

Step-by-Step Instructions

Step 1: Data Protection Settings
- Set pii_strategy: Use "tokenize" for reversible protection
- Set phi_handling: Use "never-store" for HIPAA compliance
- Set retention_days: Use 2555 (7 years) for HIPAA requirement
Step 2: Content Safety Filters
- Set medical_filter: Use "block" to prevent misinformation
- Set selfharm_filter: Use "block-escalate" for immediate action
- Set rx_filter: Use "block-refer" to redirect to doctors
Step 3: Audit Configuration
- Set log_level: Use "comprehensive" for full audit trail
- Set storage: Use "immutable" (WORM) for tamper-proof logs
- Set access_tracking: Use "full" for complete access logs
Step 4: Human Review Settings
- Set hitl_trigger: Use "medical-advice" for clinical review
- Set escalation: Use "clinical" or "tiered" for proper routing
- Set sla_minutes: Use 15 or 60 for response time
Step 5: Compliance Framework
- Set framework: Use "hipaa" or "both" for HIPAA+GDPR
- Set data_residency: Use "us" for HIPAA, "eu" for GDPR
Step 6: Save & Validate
- Watch the YAML code update as you change properties
- Click "Save & Validate" to check your configuration
- Click "Preview" to see governance dashboard

Pro Tips

Use the Properties panel on the right to edit values - YAML updates live!
Click different files in the Explorer to see file structure (read-only)
PHI must NEVER be stored in plain text for HIPAA compliance
Immutable logs prevent tampering and satisfy audit requirements

Optimal HIPAA Configuration (100% Score)

pii_strategy: tokenize
phi_handling: never-store
retention_days: 2555 (7 years)
medical_filter: block
selfharm_filter: block-escalate
rx_filter: block-refer
log_level: comprehensive
storage: immutable
framework: hipaa or both

HIPAA Compliance Requirements

PHI must use "never-store" or "tokenized" - encrypted alone is insufficient
Minimum 7-year retention (2555 days) for audit logs
Comprehensive logging required for all PHI access
Self-harm content must trigger immediate escalation

Lab 27: Experiment Console

Objective

You are a ProductLabs data scientist setting up an A/B experiment. Configure traffic allocation, define success metrics, set guardrails, and analyze results to make a ship/no-ship decision.

Step-by-Step Instructions

Step 1: Traffic Allocation
- Set Control Split: Start high (80-95%) for safety
- Set Treatment Split: Start low (5-20%) initially
- IMPORTANT: Control + Treatment must equal 100%
- Set Ramp Schedule: Use progressive (5%→25%→50%→100%)
Step 2: Define Success Metric
- Select Primary Metric: Choose your key success indicator
- Set MDE: 3-5% is standard minimum detectable effect
- Set Power: 80% minimum, 90% recommended
Step 3: Enter Observed Results
- Enter Control Users: Number of users in control group
- Enter Control Conv %: Conversion rate for control
- Enter Treatment Users: Number of users in treatment
- Enter Treatment Conv %: Conversion rate for treatment
- Watch metrics update as you enter values!
Step 4: Configure Guardrails
- Set Latency P99: 500ms is standard threshold
- Set Error Rate: 0.5% or 1% maximum
- Set Rollback Trigger: Use "any-guardrail" for safety
Step 5: Launch & Analyze
- Click "Launch" to validate your experiment setup
- Click "Analysis" to see detailed results dashboard
- Review p-value and recommendation

Pro Tips

Start with conservative splits (90/10) and ramp up gradually
Need 1000+ users per variant for statistical significance
Lower MDE requires larger sample sizes to detect
Always have auto-rollback enabled for production experiments
Watch the traffic bar update as you change splits!

Optimal Configuration (100% Score)

Traffic Split: 80/20 or 90/10 (must sum to 100)
Ramp: 5%→25%→50%→100% (progressive)
MDE: 3% or 5%
Power: 80% or 90%
Users: 1000+ per variant
Latency: 500ms or 1000ms
Error Rate: 0.5% or 1%
Rollback: any-guardrail

Decision Framework

SHIP IT: Treatment > Control AND p-value < 0.05
DON'T SHIP: Treatment < Control AND p-value < 0.05
CONTINUE: p-value > 0.05 (not yet significant)
ROLLBACK: If any guardrail is breached

AI & Machine Learning Labs

GenAI Expert Labs - Module 9

Learning Objectives:

Evaluation Complete!

Learning Objectives:

Policy Validated!

Learning Objectives:

Experiment Configured!

Lab 25: Evaluation CLI

Objective

Step-by-Step Instructions

Pro Tips

Optimal Configuration (100% Score)

Common Mistakes

Lab 26: Policy Editor

Objective

Step-by-Step Instructions

Pro Tips

Optimal HIPAA Configuration (100% Score)

HIPAA Compliance Requirements

Lab 27: Experiment Console

Objective

Step-by-Step Instructions

Pro Tips

Optimal Configuration (100% Score)

Decision Framework