Data Analytics Advanced Labs

Master data mesh architecture, MLOps pipelines, and comprehensive data governance. Build decentralized data platforms with automated ML deployment and enterprise compliance.

Data Mesh & Governance - Module 6

Advanced architecture labs covering data mesh design, ML operations, and governance frameworks.

Lab 16: Data Mesh Architecture
Architecture / Advanced
Scenario: Decentralized Data Platform
GlobalCorp is transitioning from a centralized data warehouse to a federated data mesh architecture. You'll design domain-oriented data products, implement self-serve infrastructure, establish data contracts, and set up a federated governance model. Each domain team will own their data products with standardized quality, security, and discovery mechanisms.

Learning Objectives:

  • Domain Design: Create domain-bounded data products
  • Self-Serve Platform: Configure infrastructure as code
  • Data Contracts: Define schema and SLAs
  • Federated Governance: Implement policies and standards

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Define Data Domain
    Create a domain-bounded context with clear ownership. A domain represents a business area that owns its data end-to-end.
    Configuration:
    β€’ Domain Name: Enter a unique identifier (e.g., "sales-analytics", "customer-360")
    β€’ Owner Team: The team responsible (e.g., "Sales Engineering Team")
    β€’ Business Capability: Select the business function this domain serves
    β€’ Data Sources: List source systems separated by commas (e.g., "crm, erp, salesforce")
    πŸ’‘ Tip: Domain names should be descriptive and follow kebab-case naming convention.
  2. Step 2: Create Data Product
    Design a data product that consumers can use. Products should be self-describing, discoverable, and trustworthy.
    Configuration:
    β€’ Product Name: Descriptive name (e.g., "customer_revenue_metrics")
    β€’ Description: What this product provides and its use cases
    β€’ Output Format: Choose how consumers access data (API/Table/Stream)
    β€’ Update Frequency: How often data refreshes (Real-time/Hourly/Daily)
    πŸ’‘ Tip: Real-time is for event-driven systems; Daily is typical for analytics.
  3. Step 3: Define Data Contract
    Establish formal agreements between data producers and consumers with quality and availability guarantees.
    Configuration:
    β€’ Schema Version: Use semantic versioning (e.g., "1.0.0", "2.1.3")
    β€’ Quality Threshold: Minimum data quality % (must be β‰₯80%)
    β€’ SLA Uptime: Availability guarantee % (must be β‰₯90%)
    β€’ Backward Compatibility: βœ“ MUST be checked for production contracts
    πŸ’‘ Tip: Industry standard is 95%+ quality and 99.9% uptime for critical data.
  4. Step 4: Configure Self-Serve Platform
    Set up infrastructure that enables domain teams to deploy data products independently without central IT bottlenecks.
    Configuration:
    β€’ Compute Resources: Select processing engines (Spark, Airflow, dbt) - at least one required
    β€’ Storage Type: Where data lives (S3/Warehouse/Lakehouse)
    β€’ IaC Tool: Infrastructure automation (Terraform/Pulumi/CloudFormation)
    πŸ’‘ Tip: Lakehouse combines best of data lakes and warehouses. Terraform is most widely adopted IaC.
  5. Step 5: Implement Governance Policies
    Define global policies enforced across all domains for security, privacy, and compliance consistency.
    Configuration:
    β€’ Data Classification: Tag sensitivity levels (Public/Internal/Confidential) - select at least one
    β€’ Retention Policy: Days to keep data (e.g., 365 for 1 year)
    β€’ Access Control: RBAC (role-based), ABAC (attribute-based), or DAC
    πŸ’‘ Tip: RBAC is simplest; ABAC offers fine-grained control for complex orgs.
  6. Step 6: Enable Data Discovery
    Make data products findable through a searchable catalog with metadata, lineage, and ownership info.
    Configuration:
    β€’ Catalog Tool: Choose your metadata platform (Google Data Catalog/AWS Glue/Azure Purview)
    β€’ Tags: Searchable keywords (e.g., "sales, revenue, monthly")
    β€’ Data Lineage: βœ“ MUST be enabled to track data flow
    β€’ Auto Profiling: βœ“ MUST be enabled for automatic stats collection
    πŸ’‘ Tip: Good tags make data discoverable. Use business terms, not technical jargon.
Data Mesh Control Plane
Step 1: Data Domain
Step 2: Data Product
Step 3: Data Contract
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent mesh architecture!

Lab 17: MLOps Pipeline
ML / Advanced
Scenario: Automated ML Deployment
DataAI Corp needs an end-to-end MLOps pipeline for automated model training, validation, and deployment. You'll build a CI/CD pipeline for ML models, implement feature stores, set up model monitoring, configure A/B testing infrastructure, and establish model governance. The system must handle model versioning, automated retraining, and performance tracking.

Learning Objectives:

  • Feature Engineering: Build and version feature stores
  • Model Training: Automate training pipelines
  • Deployment: Configure canary and blue-green deployments
  • Monitoring: Track drift, performance, and fairness metrics

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Setup Feature Store
    A feature store centralizes ML features for training and serving, ensuring consistency between training and inference.
    Configuration:
    β€’ Store Name: Identifier for the store (e.g., "customer_features", "fraud_detection_store")
    β€’ Feature Group: Logical grouping of related features (e.g., "user_behavior", "transaction_patterns")
    β€’ Storage Backend: Choose your feature store platform (Feast/Tecton/Hopsworks)
    β€’ Serving Mode: Online (real-time inference), Offline (batch), or Both
    β€’ Versioning: βœ“ MUST be enabled to track feature changes
    πŸ’‘ Tip: Use "Both" serving mode if you need real-time predictions AND batch training.
  2. Step 2: Define Training Pipeline
    Create an automated pipeline that handles data ingestion, preprocessing, training, and validation.
    Configuration:
    β€’ Pipeline Name: Descriptive name (e.g., "churn_prediction", "fraud_classifier")
    β€’ Orchestrator: Choose workflow engine (Kubeflow/MLflow/Airflow)
    β€’ Training Framework: Scikit-learn (tabular), TensorFlow/PyTorch (deep learning)
    β€’ Hyperparameter Tuning: Grid Search, Random Search, or Bayesian optimization
    πŸ’‘ Tip: Bayesian optimization is most efficient for complex hyperparameter spaces.
  3. Step 3: Configure Model Registry
    A model registry stores trained models with versioning, metadata, and lifecycle management.
    Configuration:
    β€’ Registry Tool: Where models are stored (MLflow/Neptune.ai/Weights & Biases)
    β€’ Model Version: Semantic versioning (e.g., "1.0.0")
    β€’ Stage: Staging (testing) β†’ Production (live) β†’ Archived (deprecated)
    β€’ Approval Workflow: βœ“ MUST be enabled for production safety
    πŸ’‘ Tip: Always use Staging before Production. Never skip the approval step!
  4. Step 4: Deployment Strategy
    Configure how models are released to production with safety mechanisms for rollback.
    Configuration:
    β€’ Deployment Type: Canary (gradual 10%β†’100%), Blue-Green (instant switch), Shadow (parallel)
    β€’ Traffic Split: % of traffic to new model (0-100)
    β€’ Rollback Threshold: Error % that triggers automatic rollback (0-100)
    β€’ Serving Platform: Seldon Core, KServe, or SageMaker
    πŸ’‘ Tip: Start with 10% traffic split for canary. Set rollback threshold at 5% for safety.
  5. Step 5: Model Monitoring
    Set up continuous monitoring to detect when model performance degrades in production.
    Configuration:
    β€’ Monitoring Metrics: Check ALL - Data Drift, Concept Drift, Performance
    β€’ Alert Threshold: PSI/drift score that triggers alert (e.g., 0.15)
    β€’ Retraining Trigger: When to automatically retrain (On Drift/On Performance/Scheduled)
    πŸ’‘ Tip: Alert threshold of 0.15 PSI is industry standard. Check ALL monitoring metrics!
  6. Step 6: Governance & Compliance
    Ensure models are explainable, fair, and compliant with regulations.
    Configuration:
    β€’ Explainability Tool: How to interpret predictions (SHAP/LIME/ELI5)
    β€’ Bias Detection: Fairlearn, AIF360, or What-If Tool
    β€’ Compliance Framework: GDPR (EU), CCPA (California), HIPAA (Healthcare)
    β€’ Audit Trail: βœ“ MUST be enabled for regulatory compliance
    πŸ’‘ Tip: SHAP is most widely accepted for explainability. Always enable audit trail!
MLOps Control Center
Step 1: Feature Store
Step 2: Training Pipeline
Step 3: Model Registry
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent MLOps pipeline!

Lab 18: Data Governance Framework
Governance / Advanced
Scenario: Enterprise Data Governance
ComplianceFirst Inc. requires a comprehensive data governance framework spanning data quality, lineage, privacy, and regulatory compliance. You'll design a governance operating model, implement data quality scorecards, establish lineage tracking, configure privacy controls, and set up compliance reporting. The framework must support GDPR, CCPA, and SOC 2 requirements with automated auditing.

Learning Objectives:

  • Operating Model: Define roles, responsibilities, and workflows
  • Data Quality: Implement quality rules and scorecards
  • Lineage: Track end-to-end data lineage
  • Compliance: Automate privacy and regulatory controls

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Operating Model
    Define the organizational structure for data governance with clear roles, responsibilities, and decision rights.
    Configuration:
    β€’ Data Owner: Executive accountable for data (e.g., "VP of Data", "Chief Data Officer")
    β€’ Data Stewards: Day-to-day data managers (comma-separated list)
    β€’ Governance Committee: Decision-making body (e.g., "Data Governance Council")
    β€’ Meeting Frequency: How often the committee meets (Weekly/Bi-weekly/Monthly)
    πŸ’‘ Tip: Start with bi-weekly meetings. Include at least 2 stewards per major domain.
  2. Step 2: Data Quality Framework
    Implement the 4 dimensions of data quality (ACCT) with measurable rules and automated validation.
    Configuration:
    β€’ Quality Dimensions: Check ALL 4 - Accuracy, Completeness, Consistency, Timeliness
    β€’ Quality Threshold: Minimum acceptable quality % (must be β‰₯80%)
    β€’ DQ Tool: Great Expectations (open-source), Deequ (Spark), Monte Carlo (enterprise)
    πŸ’‘ Tip: Industry standard is 95%+ threshold. Check ALL 4 dimensions!
  3. Step 3: Data Lineage
    Track data flow from source to destination to understand dependencies and impact of changes.
    Configuration:
    β€’ Lineage Tool: Choose your platform (Manta/Alation/Collibra)
    β€’ Capture Method: Automatic (scans SQL/ETL), API (integrates pipelines), or Manual
    β€’ Impact Analysis: βœ“ MUST be enabled for change management
    β€’ Column-Level: βœ“ MUST be enabled for detailed tracking
    πŸ’‘ Tip: Use Automatic capture + Column-Level lineage for best coverage!
  4. Step 4: Privacy Controls
    Configure protections for personally identifiable information (PII) and data subject rights.
    Configuration:
    β€’ PII Categories: Select data types to protect (Email/SSN/Phone) - at least one
    β€’ Consent Management: Explicit Opt-in (GDPR), Implicit, or Granular
    β€’ Data Subject Rights: βœ“ Access βœ“ Deletion βœ“ Portability - check ALL THREE
    πŸ’‘ Tip: GDPR requires all 3 rights + Explicit consent. Check all checkboxes!
  5. Step 5: Compliance Automation
    Automate regulatory compliance checks and remediation workflows.
    Configuration:
    β€’ Compliance Frameworks: Check ALL applicable - GDPR, CCPA, SOC 2
    β€’ Scan Frequency: Continuous (recommended), Daily, or Weekly
    β€’ Remediation SLA: Days to fix violations (must be β‰₯1 day)
    πŸ’‘ Tip: Check all 3 frameworks for comprehensive compliance. Use 30-day SLA as baseline.
  6. Step 6: Audit & Reporting
    Configure audit trails and automated reporting for regulators and stakeholders.
    Configuration:
    β€’ Audit Retention: Years to keep audit logs (β‰₯1 year, recommend 7 for regulated)
    β€’ Report Type: Executive (KPIs), Detailed (technical), Regulatory (auditors)
    β€’ Report Schedule: Weekly, Monthly, or Quarterly delivery
    β€’ Automated Delivery: βœ“ MUST be enabled for compliance
    πŸ’‘ Tip: Use 7-year retention for regulated industries. Always enable automated delivery!
Governance Console
Step 1: Operating Model
Step 2: Data Quality
Step 3: Data Lineage
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent governance framework!