Data Analytics Expert Labs

Master real-time stream processing, distributed data systems, and advanced analytics platforms. Build scalable data architectures with cutting-edge technologies.

Stream Processing & MLOps - Module 7

Expert-level labs covering stream processing, distributed systems, and analytics infrastructure.

Lab 19: Real-Time Stream Processing
Streaming / Expert
Scenario: High-Throughput Event Processing
TechStream Inc needs a real-time data processing pipeline to handle millions of events per second from IoT devices, user interactions, and system logs. You'll design a streaming architecture using Apache Kafka and Apache Flink, implement complex event processing with stateful computations, configure exactly-once semantics, and build real-time analytics dashboards with sub-second latency requirements.

Learning Objectives:

  • Event Ingestion: Configure Kafka topics with partitioning strategies
  • Stream Processing: Build Flink jobs with windowing and aggregations
  • State Management: Implement checkpointing and savepoints
  • Delivery Guarantees: Configure exactly-once processing semantics

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Configure Kafka Topics
    Set up Kafka topics with appropriate partitioning for high-throughput event ingestion and parallel processing.
    Configuration:
    β€’ Topic Name: Descriptive identifier (e.g., "iot-sensor-events", "user-clickstream")
    β€’ Partitions: Number of partitions (minimum 3, recommended 10-50 for high volume)
    β€’ Replication Factor: Minimum 3 for production reliability
    β€’ Compression: Choose codec (Snappy/LZ4/ZSTD) - affects throughput vs CPU
    πŸ’‘ Tip: More partitions = more parallelism but more overhead. Start with 10-20 for most use cases.
  2. Step 2: Design Flink Stream Job
    Create a Flink streaming application with event deserialization, filtering, transformation, and enrichment logic.
    Configuration:
    β€’ Job Name: Application identifier (e.g., "realtime-analytics-pipeline")
    β€’ Parallelism: Number of parallel instances (1-100, match partition count)
    β€’ Time Characteristic: Event Time (for accuracy) or Processing Time (for speed)
    β€’ Watermark Strategy: Bounded/Unbounded out-of-orderness tolerance (milliseconds)
    πŸ’‘ Tip: Use Event Time with watermarks for accurate time-based operations. Processing Time is simpler but less accurate.
  3. Step 3: Configure Windowing Operations
    Define time windows for aggregations like counts, sums, averages over sliding or tumbling time periods.
    Configuration:
    β€’ Window Type: Tumbling (non-overlapping), Sliding (overlapping), or Session (gap-based)
    β€’ Window Size: Duration in seconds (e.g., 60 for 1-minute windows)
    β€’ Allowed Lateness: Grace period for late events (seconds)
    β€’ Trigger Strategy: When to emit results (on time/on count/on event)
    πŸ’‘ Tip: Tumbling windows for discrete time buckets (hourly reports). Sliding for moving averages.
  4. Step 4: Implement State Management
    Configure stateful operations with checkpointing to enable fault tolerance and exactly-once processing.
    Configuration:
    β€’ State Backend: Choose storage (Memory/RocksDB/Filesystem)
    β€’ Checkpoint Interval: How often to checkpoint (ms, recommended 10000-60000)
    β€’ Checkpoint Timeout: Max duration before failure (milliseconds)
    β€’ Exactly-Once: βœ“ MUST be enabled for production consistency
    πŸ’‘ Tip: RocksDB for large state, Filesystem for medium. Checkpoint every 30-60 seconds for balance.
  5. Step 5: Configure Sink Connectors
    Define output destinations for processed events with appropriate consistency and latency guarantees.
    Configuration:
    β€’ Sink Type: Choose destination (Kafka/Elasticsearch/Cassandra/PostgreSQL)
    β€’ Write Mode: Append-only, Upsert, or Retract for handling updates
    β€’ Batch Size: Events per write batch (100-1000 for efficiency)
    β€’ Flush Interval: Max time to wait before flushing (seconds)
    πŸ’‘ Tip: Elasticsearch for search/analytics, Kafka for further streaming, PostgreSQL for transactional data.
  6. Step 6: Setup Monitoring & Alerting
    Configure metrics collection and alerting for throughput, latency, backpressure, and failures.
    Configuration:
    β€’ Metrics Reporter: Prometheus, Datadog, or CloudWatch for metrics export
    β€’ Lag Alert Threshold: Consumer lag triggering alert (records, e.g., 100000)
    β€’ Latency Alert: Processing delay threshold (ms, e.g., 5000)
    β€’ Dashboard: βœ“ Grafana or Kibana for visualization
    πŸ’‘ Tip: Monitor consumer lag closely - it indicates if processing can't keep up with ingestion rate.
Data Mesh Control Plane
Step 1: Data Domain
Step 2: Data Product
Step 3: Data Contract
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent mesh architecture!

Lab 20: MLOps Pipeline
ML / Advanced
Scenario: Automated ML Deployment
DataAI Corp needs an end-to-end MLOps pipeline for automated model training, validation, and deployment. You'll build a CI/CD pipeline for ML models, implement feature stores, set up model monitoring, configure A/B testing infrastructure, and establish model governance. The system must handle model versioning, automated retraining, and performance tracking.

Learning Objectives:

  • Feature Engineering: Build and version feature stores
  • Model Training: Automate training pipelines
  • Deployment: Configure canary and blue-green deployments
  • Monitoring: Track drift, performance, and fairness metrics

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Setup Feature Store
    A feature store centralizes ML features for training and serving, ensuring consistency between training and inference.
    Configuration:
    β€’ Store Name: Identifier for the store (e.g., "customer_features", "fraud_detection_store")
    β€’ Feature Group: Logical grouping of related features (e.g., "user_behavior", "transaction_patterns")
    β€’ Storage Backend: Choose your feature store platform (Feast/Tecton/Hopsworks)
    β€’ Serving Mode: Online (real-time inference), Offline (batch), or Both
    β€’ Versioning: βœ“ MUST be enabled to track feature changes
    πŸ’‘ Tip: Use "Both" serving mode if you need real-time predictions AND batch training.
  2. Step 2: Define Training Pipeline
    Create an automated pipeline that handles data ingestion, preprocessing, training, and validation.
    Configuration:
    β€’ Pipeline Name: Descriptive name (e.g., "churn_prediction", "fraud_classifier")
    β€’ Orchestrator: Choose workflow engine (Kubeflow/MLflow/Airflow)
    β€’ Training Framework: Scikit-learn (tabular), TensorFlow/PyTorch (deep learning)
    β€’ Hyperparameter Tuning: Grid Search, Random Search, or Bayesian optimization
    πŸ’‘ Tip: Bayesian optimization is most efficient for complex hyperparameter spaces.
  3. Step 3: Configure Model Registry
    A model registry stores trained models with versioning, metadata, and lifecycle management.
    Configuration:
    β€’ Registry Tool: Where models are stored (MLflow/Neptune.ai/Weights & Biases)
    β€’ Model Version: Semantic versioning (e.g., "1.0.0")
    β€’ Stage: Staging (testing) β†’ Production (live) β†’ Archived (deprecated)
    β€’ Approval Workflow: βœ“ MUST be enabled for production safety
    πŸ’‘ Tip: Always use Staging before Production. Never skip the approval step!
  4. Step 4: Deployment Strategy
    Configure how models are released to production with safety mechanisms for rollback.
    Configuration:
    β€’ Deployment Type: Canary (gradual 10%β†’100%), Blue-Green (instant switch), Shadow (parallel)
    β€’ Traffic Split: % of traffic to new model (0-100)
    β€’ Rollback Threshold: Error % that triggers automatic rollback (0-100)
    β€’ Serving Platform: Seldon Core, KServe, or SageMaker
    πŸ’‘ Tip: Start with 10% traffic split for canary. Set rollback threshold at 5% for safety.
  5. Step 5: Model Monitoring
    Set up continuous monitoring to detect when model performance degrades in production.
    Configuration:
    β€’ Monitoring Metrics: Check ALL - Data Drift, Concept Drift, Performance
    β€’ Alert Threshold: PSI/drift score that triggers alert (e.g., 0.15)
    β€’ Retraining Trigger: When to automatically retrain (On Drift/On Performance/Scheduled)
    πŸ’‘ Tip: Alert threshold of 0.15 PSI is industry standard. Check ALL monitoring metrics!
  6. Step 6: Governance & Compliance
    Ensure models are explainable, fair, and compliant with regulations.
    Configuration:
    β€’ Explainability Tool: How to interpret predictions (SHAP/LIME/ELI5)
    β€’ Bias Detection: Fairlearn, AIF360, or What-If Tool
    β€’ Compliance Framework: GDPR (EU), CCPA (California), HIPAA (Healthcare)
    β€’ Audit Trail: βœ“ MUST be enabled for regulatory compliance
    πŸ’‘ Tip: SHAP is most widely accepted for explainability. Always enable audit trail!
MLOps Control Center
Step 1: Feature Store
Step 2: Training Pipeline
Step 3: Model Registry
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent MLOps pipeline!

Lab 21: Data Governance Framework
Governance / Advanced
Scenario: Enterprise Data Governance
ComplianceFirst Inc. requires a comprehensive data governance framework spanning data quality, lineage, privacy, and regulatory compliance. You'll design a governance operating model, implement data quality scorecards, establish lineage tracking, configure privacy controls, and set up compliance reporting. The framework must support GDPR, CCPA, and SOC 2 requirements with automated auditing.

Learning Objectives:

  • Operating Model: Define roles, responsibilities, and workflows
  • Data Quality: Implement quality rules and scorecards
  • Lineage: Track end-to-end data lineage
  • Compliance: Automate privacy and regulatory controls

πŸ“‹ Step-by-Step Instructions

  1. Step 1: Operating Model
    Define the organizational structure for data governance with clear roles, responsibilities, and decision rights.
    Configuration:
    β€’ Data Owner: Executive accountable for data (e.g., "VP of Data", "Chief Data Officer")
    β€’ Data Stewards: Day-to-day data managers (comma-separated list)
    β€’ Governance Committee: Decision-making body (e.g., "Data Governance Council")
    β€’ Meeting Frequency: How often the committee meets (Weekly/Bi-weekly/Monthly)
    πŸ’‘ Tip: Start with bi-weekly meetings. Include at least 2 stewards per major domain.
  2. Step 2: Data Quality Framework
    Implement the 4 dimensions of data quality (ACCT) with measurable rules and automated validation.
    Configuration:
    β€’ Quality Dimensions: Check ALL 4 - Accuracy, Completeness, Consistency, Timeliness
    β€’ Quality Threshold: Minimum acceptable quality % (must be β‰₯80%)
    β€’ DQ Tool: Great Expectations (open-source), Deequ (Spark), Monte Carlo (enterprise)
    πŸ’‘ Tip: Industry standard is 95%+ threshold. Check ALL 4 dimensions!
  3. Step 3: Data Lineage
    Track data flow from source to destination to understand dependencies and impact of changes.
    Configuration:
    β€’ Lineage Tool: Choose your platform (Manta/Alation/Collibra)
    β€’ Capture Method: Automatic (scans SQL/ETL), API (integrates pipelines), or Manual
    β€’ Impact Analysis: βœ“ MUST be enabled for change management
    β€’ Column-Level: βœ“ MUST be enabled for detailed tracking
    πŸ’‘ Tip: Use Automatic capture + Column-Level lineage for best coverage!
  4. Step 4: Privacy Controls
    Configure protections for personally identifiable information (PII) and data subject rights.
    Configuration:
    β€’ PII Categories: Select data types to protect (Email/SSN/Phone) - at least one
    β€’ Consent Management: Explicit Opt-in (GDPR), Implicit, or Granular
    β€’ Data Subject Rights: βœ“ Access βœ“ Deletion βœ“ Portability - check ALL THREE
    πŸ’‘ Tip: GDPR requires all 3 rights + Explicit consent. Check all checkboxes!
  5. Step 5: Compliance Automation
    Automate regulatory compliance checks and remediation workflows.
    Configuration:
    β€’ Compliance Frameworks: Check ALL applicable - GDPR, CCPA, SOC 2
    β€’ Scan Frequency: Continuous (recommended), Daily, or Weekly
    β€’ Remediation SLA: Days to fix violations (must be β‰₯1 day)
    πŸ’‘ Tip: Check all 3 frameworks for comprehensive compliance. Use 30-day SLA as baseline.
  6. Step 6: Audit & Reporting
    Configure audit trails and automated reporting for regulators and stakeholders.
    Configuration:
    β€’ Audit Retention: Years to keep audit logs (β‰₯1 year, recommend 7 for regulated)
    β€’ Report Type: Executive (KPIs), Detailed (technical), Regulatory (auditors)
    β€’ Report Schedule: Weekly, Monthly, or Quarterly delivery
    β€’ Automated Delivery: βœ“ MUST be enabled for compliance
    πŸ’‘ Tip: Use 7-year retention for regulated industries. Always enable automated delivery!
Governance Console
Step 1: Operating Model
Step 2: Data Quality
Step 3: Data Lineage
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent governance framework!