Master evaluation pipelines, governance policies, and A/B experimentation through specialized tool interfaces.
Specialized interfaces: CLI, IDE, and Analytics Dashboard.
llm-eval command-line interface to configure dataset parameters, metric weights, quality gates, and run benchmarks. Master the CLI workflow used in real ML engineering environments.
Pipeline configured successfully.
governance.yaml policy file for HIPAA compliance.
Governance configuration saved.
Ready for launch.
You are a QualityAI ML Engineer configuring an LLM evaluation pipeline using the llm-eval CLI tool. Configure all 14 parameters to benchmark your model against production baselines.
--dataset-size: Choose 5000 or 10000 for reliable results--split-ratio: Use 80-20 (industry standard train/test split)--validation: Choose "kfold" for robust cross-validation--threshold: 80% is standard pass threshold--baseline: "previous" to compare against production--significance: 0.05 for standard p-value thresholdYou are a HealthTech Corp compliance engineer. Use the Policy Editor IDE to configure AI governance policies in the governance.yaml file. Ensure HIPAA compliance for healthcare AI applications.
pii_strategy: Use "tokenize" for reversible protectionphi_handling: Use "never-store" for HIPAA complianceretention_days: Use 2555 (7 years) for HIPAA requirementmedical_filter: Use "block" to prevent misinformationselfharm_filter: Use "block-escalate" for immediate actionrx_filter: Use "block-refer" to redirect to doctorslog_level: Use "comprehensive" for full audit trailstorage: Use "immutable" (WORM) for tamper-proof logsaccess_tracking: Use "full" for complete access logshitl_trigger: Use "medical-advice" for clinical reviewescalation: Use "clinical" or "tiered" for proper routingsla_minutes: Use 15 or 60 for response timeframework: Use "hipaa" or "both" for HIPAA+GDPRdata_residency: Use "us" for HIPAA, "eu" for GDPRYou are a ProductLabs data scientist setting up an A/B experiment. Configure traffic allocation, define success metrics, set guardrails, and analyze results to make a ship/no-ship decision.
Control Split: Start high (80-95%) for safetyTreatment Split: Start low (5-20%) initiallyRamp Schedule: Use progressive (5%→25%→50%→100%)Primary Metric: Choose your key success indicatorMDE: 3-5% is standard minimum detectable effectPower: 80% minimum, 90% recommendedControl Users: Number of users in control groupControl Conv %: Conversion rate for controlTreatment Users: Number of users in treatmentTreatment Conv %: Conversion rate for treatmentLatency P99: 500ms is standard thresholdError Rate: 0.5% or 1% maximumRollback Trigger: Use "any-guardrail" for safety