Data Analytics Expert Labs - Module 6

Lab 16: Data Mesh Architecture

Architecture / Advanced

Scenario: Decentralized Data Platform

GlobalCorp is transitioning from a centralized data warehouse to a federated data mesh architecture. You'll design domain-oriented data products, implement self-serve infrastructure, establish data contracts, and set up a federated governance model. Each domain team will own their data products with standardized quality, security, and discovery mechanisms.

Learning Objectives:

Domain Design: Create domain-bounded data products
Self-Serve Platform: Configure infrastructure as code
Data Contracts: Define schema and SLAs
Federated Governance: Implement policies and standards

📋 Step-by-Step Instructions

Step 1: Define Data Domain
Create a domain-bounded context with clear ownership. A domain represents a business area that owns its data end-to-end.

Configuration:
• Domain Name: Enter a unique identifier (e.g., "sales-analytics", "customer-360")
• Owner Team: The team responsible (e.g., "Sales Engineering Team")
• Business Capability: Select the business function this domain serves
• Data Sources: List source systems separated by commas (e.g., "crm, erp, salesforce")

💡 Tip: Domain names should be descriptive and follow kebab-case naming convention.
Step 2: Create Data Product
Design a data product that consumers can use. Products should be self-describing, discoverable, and trustworthy.

Configuration:
• Product Name: Descriptive name (e.g., "customer_revenue_metrics")
• Description: What this product provides and its use cases
• Output Format: Choose how consumers access data (API/Table/Stream)
• Update Frequency: How often data refreshes (Real-time/Hourly/Daily)

💡 Tip: Real-time is for event-driven systems; Daily is typical for analytics.
Step 3: Define Data Contract
Establish formal agreements between data producers and consumers with quality and availability guarantees.

Configuration:
• Schema Version: Use semantic versioning (e.g., "1.0.0", "2.1.3")
• Quality Threshold: Minimum data quality % (must be ≥80%)
• SLA Uptime: Availability guarantee % (must be ≥90%)
• Backward Compatibility: ✓ MUST be checked for production contracts

💡 Tip: Industry standard is 95%+ quality and 99.9% uptime for critical data.
Step 4: Configure Self-Serve Platform
Set up infrastructure that enables domain teams to deploy data products independently without central IT bottlenecks.

Configuration:
• Compute Resources: Select processing engines (Spark, Airflow, dbt) - at least one required
• Storage Type: Where data lives (S3/Warehouse/Lakehouse)
• IaC Tool: Infrastructure automation (Terraform/Pulumi/CloudFormation)

💡 Tip: Lakehouse combines best of data lakes and warehouses. Terraform is most widely adopted IaC.
Step 5: Implement Governance Policies
Define global policies enforced across all domains for security, privacy, and compliance consistency.

Configuration:
• Data Classification: Tag sensitivity levels (Public/Internal/Confidential) - select at least one
• Retention Policy: Days to keep data (e.g., 365 for 1 year)
• Access Control: RBAC (role-based), ABAC (attribute-based), or DAC

💡 Tip: RBAC is simplest; ABAC offers fine-grained control for complex orgs.
Step 6: Enable Data Discovery
Make data products findable through a searchable catalog with metadata, lineage, and ownership info.

Configuration:
• Catalog Tool: Choose your metadata platform (Google Data Catalog/AWS Glue/Azure Purview)
• Tags: Searchable keywords (e.g., "sales, revenue, monthly")
• Data Lineage: ✓ MUST be enabled to track data flow
• Auto Profiling: ✓ MUST be enabled for automatic stats collection

💡 Tip: Good tags make data discoverable. Use business terms, not technical jargon.

Data Mesh Control Plane

Step 1: Data Domain

Domain Name:

Owner Team:

Business Capability:

Data Sources (comma-separated):

Step 2: Data Product

Product Name:

Description:

Output Format:

Update Frequency:

Step 3: Data Contract

Schema Version:

Quality Threshold (%):

SLA Uptime (%):

Enable Backward Compatibility

Step 4: Self-Serve Platform

Compute Resources:

Spark Airflow dbt

Storage Type:

IaC Tool:

Step 5: Governance Policies

Data Classification:

Public Internal Confidential

Retention Policy (days):

Access Control Model:

Step 6: Data Discovery

Catalog Tool:

Tags (comma-separated):

Enable Data Lineage

Auto Data Profiling

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent mesh architecture!

Lab 17: MLOps Pipeline

ML / Advanced

Scenario: Automated ML Deployment

DataAI Corp needs an end-to-end MLOps pipeline for automated model training, validation, and deployment. You'll build a CI/CD pipeline for ML models, implement feature stores, set up model monitoring, configure A/B testing infrastructure, and establish model governance. The system must handle model versioning, automated retraining, and performance tracking.

Learning Objectives:

Feature Engineering: Build and version feature stores
Model Training: Automate training pipelines
Deployment: Configure canary and blue-green deployments
Monitoring: Track drift, performance, and fairness metrics

📋 Step-by-Step Instructions

Step 1: Setup Feature Store
A feature store centralizes ML features for training and serving, ensuring consistency between training and inference.

Configuration:
• Store Name: Identifier for the store (e.g., "customer_features", "fraud_detection_store")
• Feature Group: Logical grouping of related features (e.g., "user_behavior", "transaction_patterns")
• Storage Backend: Choose your feature store platform (Feast/Tecton/Hopsworks)
• Serving Mode: Online (real-time inference), Offline (batch), or Both
• Versioning: ✓ MUST be enabled to track feature changes

💡 Tip: Use "Both" serving mode if you need real-time predictions AND batch training.
Step 2: Define Training Pipeline
Create an automated pipeline that handles data ingestion, preprocessing, training, and validation.

Configuration:
• Pipeline Name: Descriptive name (e.g., "churn_prediction", "fraud_classifier")
• Orchestrator: Choose workflow engine (Kubeflow/MLflow/Airflow)
• Training Framework: Scikit-learn (tabular), TensorFlow/PyTorch (deep learning)
• Hyperparameter Tuning: Grid Search, Random Search, or Bayesian optimization

💡 Tip: Bayesian optimization is most efficient for complex hyperparameter spaces.
Step 3: Configure Model Registry
A model registry stores trained models with versioning, metadata, and lifecycle management.

Configuration:
• Registry Tool: Where models are stored (MLflow/Neptune.ai/Weights & Biases)
• Model Version: Semantic versioning (e.g., "1.0.0")
• Stage: Staging (testing) → Production (live) → Archived (deprecated)
• Approval Workflow: ✓ MUST be enabled for production safety

💡 Tip: Always use Staging before Production. Never skip the approval step!
Step 4: Deployment Strategy
Configure how models are released to production with safety mechanisms for rollback.

Configuration:
• Deployment Type: Canary (gradual 10%→100%), Blue-Green (instant switch), Shadow (parallel)
• Traffic Split: % of traffic to new model (0-100)
• Rollback Threshold: Error % that triggers automatic rollback (0-100)
• Serving Platform: Seldon Core, KServe, or SageMaker

💡 Tip: Start with 10% traffic split for canary. Set rollback threshold at 5% for safety.
Step 5: Model Monitoring
Set up continuous monitoring to detect when model performance degrades in production.

Configuration:
• Monitoring Metrics: Check ALL - Data Drift, Concept Drift, Performance
• Alert Threshold: PSI/drift score that triggers alert (e.g., 0.15)
• Retraining Trigger: When to automatically retrain (On Drift/On Performance/Scheduled)

💡 Tip: Alert threshold of 0.15 PSI is industry standard. Check ALL monitoring metrics!
Step 6: Governance & Compliance
Ensure models are explainable, fair, and compliant with regulations.

Configuration:
• Explainability Tool: How to interpret predictions (SHAP/LIME/ELI5)
• Bias Detection: Fairlearn, AIF360, or What-If Tool
• Compliance Framework: GDPR (EU), CCPA (California), HIPAA (Healthcare)
• Audit Trail: ✓ MUST be enabled for regulatory compliance

💡 Tip: SHAP is most widely accepted for explainability. Always enable audit trail!

MLOps Control Center

Step 1: Feature Store

Store Name:

Feature Group:

Storage Backend:

Serving Mode:

Enable Feature Versioning

Step 2: Training Pipeline

Pipeline Name:

Orchestrator:

Training Framework:

Hyperparameter Tuning:

Step 3: Model Registry

Registry Tool:

Model Version:

Stage:

Require Approval Workflow

Step 5: Model Monitoring

Monitoring Metrics:

Data Drift Concept Drift Performance

Alert Threshold:

Retraining Trigger:

Step 6: Model Governance

Explainability Tool:

Bias Detection:

Compliance Framework:

Enable Audit Trail

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent MLOps pipeline!

Lab 18: Data Governance Framework

Governance / Advanced

Scenario: Enterprise Data Governance

ComplianceFirst Inc. requires a comprehensive data governance framework spanning data quality, lineage, privacy, and regulatory compliance. You'll design a governance operating model, implement data quality scorecards, establish lineage tracking, configure privacy controls, and set up compliance reporting. The framework must support GDPR, CCPA, and SOC 2 requirements with automated auditing.

Learning Objectives:

Operating Model: Define roles, responsibilities, and workflows
Data Quality: Implement quality rules and scorecards
Lineage: Track end-to-end data lineage
Compliance: Automate privacy and regulatory controls

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent governance framework!

Data Analytics Advanced Labs

Data Mesh & Governance - Module 6

Learning Objectives:

📋 Step-by-Step Instructions

Step 1: Data Domain

Step 2: Data Product

Step 3: Data Contract

Step 4: Self-Serve Platform

Step 5: Governance Policies

Step 6: Data Discovery

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

Step 1: Feature Store

Step 2: Training Pipeline

Step 3: Model Registry

Step 4: Deployment Strategy

Step 5: Model Monitoring

Step 6: Model Governance

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

Step 1: Operating Model

Step 2: Data Quality

Step 3: Data Lineage

Step 4: Privacy Controls

Step 5: Compliance Automation

Step 6: Audit & Reporting

Lab Completed!