Master expert-level data engineering with machine learning pipelines, warehouse optimization, and cloud data integration. Build enterprise-grade solutions.
ML Pipelines & Cloud - Module 4
Expert-level labs covering machine learning pipelines, data warehouse optimization, and cloud integration.
Lab 10: Machine Learning Data Pipeline
ML / Expert
Scenario: Predictive Analytics Pipeline
DataML Corp needs an end-to-end ML pipeline for customer churn prediction. You'll build a data pipeline that ingests raw customer data, performs feature engineering, trains multiple models, evaluates performance, and deploys the best model. Your pipeline must handle data preprocessing, feature scaling, model selection, and hyperparameter tuning.
Learning Objectives:
Data Preprocessing: Handle missing values, outliers, and encoding
Feature Engineering: Create and select meaningful features
Model Training: Train and compare multiple ML algorithms
Model Evaluation: Calculate metrics and validate performance
📋 Step-by-Step Instructions
Step 1: Data Ingestion
🎯 Goal: Load and validate customer dataset
📝 Data Quality Checks:
Before training, validate data quality: check for nulls, duplicates, data types, value ranges. Bad data = bad models. Use pandas profiling or custom validation functions.
💡 Exam Tip: Always use cross-validation (k-fold) to prevent overfitting. Never evaluate on training data alone!
Step 5: Evaluate Performance
🎯 Goal: Calculate metrics and compare models
📝 Key Metrics:
• Accuracy: Overall correctness • Precision: Positive prediction accuracy • Recall: True positive rate • F1-Score: Harmonic mean of precision/recall • AUC-ROC: Model discrimination ability
💻 Action: Click "Evaluate Models"
Step 6: Deploy Best Model
🎯 Goal: Select and deploy top-performing model
💻 Deployment: 1. Review model comparison 2. Select best model by F1-score 3. Click "Deploy Model" 4. Generate predictions on test set
✅ Success: Model deployed! Monitor performance metrics in production and retrain when accuracy degrades.
ML Pipeline Builder - DataML Platform
Pipeline
Metrics
Pipeline Steps
Data IngestionPending
PreprocessingPending
Feature EngineeringPending
Model TrainingPending
EvaluationPending
DeploymentPending
Pipeline Actions
Pipeline Info
Dataset: customer_data.csv
Target: churn (binary classification)
Rows:50,000
Features:15
Train models to see performance metrics
Progress:0/6 tasks completed
Score: 0/100
0%
Lab Completed!
Excellent ML pipeline skills!
Lab 11: Data Warehouse Optimization
Performance / Expert
Scenario: Snowflake Performance Tuning
FastQuery Inc. has a Snowflake data warehouse with slow query performance. You'll analyze query execution plans, create clustering keys, implement materialized views, optimize join strategies, and configure warehouse auto-scaling. Your optimizations must reduce query times by at least 50% while maintaining cost efficiency.
Learning Objectives:
Query Optimization: Analyze and improve execution plans
Clustering Keys: Improve table scan performance
Materialized Views: Pre-compute aggregations
Warehouse Sizing: Right-size compute resources
📋 Step-by-Step Instructions
Step 1: Analyze Query Performance
🎯 Goal: Identify slow queries using query history
📝 Performance Analysis:
Use Snowflake's QUERY_HISTORY view to find queries with high execution time, high bytes scanned, or full table scans. Look for missing joins, unindexed filters, or large result sets.
💻 Command: SELECT query_id, query_text, execution_time FROM query_history WHERE execution_time > 60 ORDER BY execution_time DESC LIMIT 10;
💡 Pro Tip: Focus on queries that run frequently AND take long - high-impact optimizations.
Step 2: Create Clustering Keys
🎯 Goal: Add clustering to improve table scan performance
📝 Clustering Keys:
Clustering organizes data physically by specified columns. Best for columns used in WHERE clauses, JOINs, and ORDER BY. Snowflake auto-maintains clusters as data changes.
💻 Command: ALTER TABLE sales_fact CLUSTER BY (date_id, region_id);
📖 Best Practice: Use 1-3 columns for clustering. Too many columns reduce effectiveness and increase maintenance cost.
Step 3: Implement Materialized View
🎯 Goal: Pre-compute expensive aggregations
📝 Materialized Views:
Store pre-computed query results. Perfect for frequently-run aggregations (daily sales, monthly KPIs). Snowflake auto-refreshes when base tables change. Much faster than recomputing every time.
💻 Command: CREATE MATERIALIZED VIEW daily_sales_mv AS SELECT date_id, SUM(amount) FROM sales_fact GROUP BY date_id;
Step 4: Optimize Join Strategy
🎯 Goal: Reduce join execution time
📝 Join Optimization:
• Filter before joining (reduce row count) • Join smaller tables first • Use proper join types (INNER vs LEFT) • Ensure join keys have statistics • Consider broadcasting small dimension tables
💻 Optimized Query: SELECT * FROM sales_fact f JOIN (SELECT * FROM dim_date WHERE year = 2023) d ON f.date_id = d.date_id;
💡 Exam Tip: Always filter dimensions BEFORE joining to facts. Reduces join cardinality significantly.
Step 5: Configure Warehouse Auto-Scaling
🎯 Goal: Enable auto-scaling for variable workloads
📝 Auto-Scaling:
Auto-scaling adds compute clusters when query queue builds up. Set MIN and MAX clusters. Scale-out handles concurrent queries, scale-up handles complex single queries. Balance performance vs cost.
💻 Command: ALTER WAREHOUSE analytics_wh SET MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 4 AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;
✅ Success: Target at least 50% reduction in query execution time and 70% reduction in bytes scanned!
Snowflake Console - FastQuery Warehouse
SQL Console
Results
Query Editor
0s
Execution Time
0 MB
Bytes Scanned
Execute optimizations to see performance results
Progress:0/6 tasks completed
Score: 0/100
0%
Lab Completed!
Excellent optimization skills!
Lab 12: Cloud Data Integration
Cloud / Expert
Scenario: Multi-Cloud Data Pipeline
GlobalData Corp operates across AWS, Azure, and GCP. You'll build a unified data integration pipeline that ingests data from S3, processes it with Azure Data Factory, stores it in BigQuery, and syncs to Snowflake. Your pipeline must handle authentication, data transformation, error handling, and monitoring across all platforms.
Learning Objectives:
Cloud Storage: Connect to S3, Azure Blob, GCS
Data Transformation: Apply cloud-native ETL
Cross-Platform Sync: Move data between clouds
Security: Implement IAM roles and encryption
📋 Step-by-Step Instructions
Step 1: Connect to AWS S3
🎯 Goal: Establish connection to S3 data source
📝 S3 Authentication:
Use IAM roles (best practice) or access keys. Configure bucket policy to allow cross-account access if needed. Use S3 Select for efficient querying of large files.
💡 Security: Never hard-code credentials! Use environment variables or secret managers (AWS Secrets Manager, Azure Key Vault).
Step 2: Setup Azure Data Factory
🎯 Goal: Configure ADF pipeline for transformation
📝 ADF Pipelines:
Azure Data Factory orchestrates data movement and transformation. Create linked services (connections), datasets (data structures), and pipelines (workflows). Use mapping data flows for complex transformations.
📝 BigQuery Connection:
Use service account with BigQuery Data Editor role. Enable BigQuery Data Transfer Service for scheduled loads. Consider partitioning and clustering on target tables for performance.
📖 Cost Optimization: Use BigQuery's streaming insert for real-time, batch load for cost-effective bulk inserts.
Step 4: Transform & Load Data
🎯 Goal: Execute ETL pipeline across clouds
📝 Data Transformation:
Apply transformations: data type conversions, column renaming, filtering, aggregations. Validate data quality (nulls, duplicates, constraints). Handle errors with retry logic and dead letter queues.
📝 Cross-Platform Sync:
Export from BigQuery to GCS, then COPY INTO Snowflake. Use Snowpipe for continuous loading. Configure external stages pointing to cloud storage. Consider change data capture (CDC) for incremental updates.