Data Analytics Expert Labs

ML Pipelines & Cloud - Module 4

Expert-level labs covering machine learning pipelines, data warehouse optimization, and cloud integration.

Lab 10: Machine Learning Data Pipeline

ML / Expert

Scenario: Predictive Analytics Pipeline

DataML Corp needs an end-to-end ML pipeline for customer churn prediction. You'll build a data pipeline that ingests raw customer data, performs feature engineering, trains multiple models, evaluates performance, and deploys the best model. Your pipeline must handle data preprocessing, feature scaling, model selection, and hyperparameter tuning.

Learning Objectives:

Data Preprocessing: Handle missing values, outliers, and encoding
Feature Engineering: Create and select meaningful features
Model Training: Train and compare multiple ML algorithms
Model Evaluation: Calculate metrics and validate performance

📋 Step-by-Step Instructions

Step 1: Data Ingestion
🎯 Goal: Load and validate customer dataset

📝 Data Quality Checks:
Before training, validate data quality: check for nulls, duplicates, data types, value ranges. Bad data = bad models. Use pandas profiling or custom validation functions.

💻 Configuration:
1. Source: customer_data.csv
2. Rows: 50,000 records
3. Features: 15 columns
4. Click "Load Data"

💡 Best Practice: Always explore data first - check distributions, correlations, missing patterns before any transformations.
Step 2: Data Preprocessing
🎯 Goal: Clean and prepare data for modeling

📝 Preprocessing Steps:
• Handle missing values (impute or drop)
• Remove outliers (IQR method or Z-score)
• Encode categorical variables (one-hot or label encoding)
• Scale numerical features (StandardScaler or MinMaxScaler)

💻 Required Actions:
1. Strategy: impute_mean
2. Encoding: one_hot
3. Scaling: standard_scaler
4. Click "Preprocess"

📖 Note: StandardScaler works best for algorithms sensitive to feature scales (SVM, KNN, Neural Networks).
Step 3: Feature Engineering
🎯 Goal: Create derived features to improve model performance

📝 Feature Types:
Create interaction features (product/ratio of existing features), polynomial features, binning continuous variables, aggregations (customer lifetime value, recency/frequency/monetary).

💻 Configuration:
1. Add feature: tenure_revenue_ratio
2. Add feature: usage_frequency_bucket
3. Click "Generate Features"
Step 4: Train ML Models
🎯 Goal: Train multiple classification models

📝 Model Selection:
Try multiple algorithms: Logistic Regression (baseline), Random Forest (ensemble), XGBoost (gradient boosting), SVM (support vector machines). Compare performance to select best.

💻 Required Models:
1. Logistic Regression
2. Random Forest
3. XGBoost
4. Click "Train Models"

💡 Exam Tip: Always use cross-validation (k-fold) to prevent overfitting. Never evaluate on training data alone!
Step 5: Evaluate Performance
🎯 Goal: Calculate metrics and compare models

📝 Key Metrics:
• Accuracy: Overall correctness
• Precision: Positive prediction accuracy
• Recall: True positive rate
• F1-Score: Harmonic mean of precision/recall
• AUC-ROC: Model discrimination ability

💻 Action:
Click "Evaluate Models"
Step 6: Deploy Best Model
🎯 Goal: Select and deploy top-performing model

💻 Deployment:
1. Review model comparison
2. Select best model by F1-score
3. Click "Deploy Model"
4. Generate predictions on test set

✅ Success: Model deployed! Monitor performance metrics in production and retrain when accuracy degrades.

ML Pipeline Builder - DataML Platform

Pipeline

Metrics

Pipeline Steps

Data Ingestion Pending

Preprocessing Pending

Feature Engineering Pending

Model Training Pending

Evaluation Pending

Deployment Pending

Pipeline Actions

Select Action:

Pipeline Info

Dataset: customer_data.csv

Target: churn (binary classification)

Rows: 50,000

Features: 15

Progress: 0/6 tasks completed

Score: 0/100

Lab Completed!

Excellent ML pipeline skills!

Lab 11: Data Warehouse Optimization

Performance / Expert

Scenario: Snowflake Performance Tuning

FastQuery Inc. has a Snowflake data warehouse with slow query performance. You'll analyze query execution plans, create clustering keys, implement materialized views, optimize join strategies, and configure warehouse auto-scaling. Your optimizations must reduce query times by at least 50% while maintaining cost efficiency.

Learning Objectives:

Query Optimization: Analyze and improve execution plans
Clustering Keys: Improve table scan performance
Materialized Views: Pre-compute aggregations
Warehouse Sizing: Right-size compute resources

📋 Step-by-Step Instructions

Step 1: Analyze Query Performance
🎯 Goal: Identify slow queries using query history

📝 Performance Analysis:
Use Snowflake's QUERY_HISTORY view to find queries with high execution time, high bytes scanned, or full table scans. Look for missing joins, unindexed filters, or large result sets.

💻 Command:
SELECT query_id, query_text, execution_time FROM query_history WHERE execution_time > 60 ORDER BY execution_time DESC LIMIT 10;

💡 Pro Tip: Focus on queries that run frequently AND take long - high-impact optimizations.
Step 2: Create Clustering Keys
🎯 Goal: Add clustering to improve table scan performance

📝 Clustering Keys:
Clustering organizes data physically by specified columns. Best for columns used in WHERE clauses, JOINs, and ORDER BY. Snowflake auto-maintains clusters as data changes.

💻 Command:
ALTER TABLE sales_fact CLUSTER BY (date_id, region_id);

📖 Best Practice: Use 1-3 columns for clustering. Too many columns reduce effectiveness and increase maintenance cost.
Step 3: Implement Materialized View
🎯 Goal: Pre-compute expensive aggregations

📝 Materialized Views:
Store pre-computed query results. Perfect for frequently-run aggregations (daily sales, monthly KPIs). Snowflake auto-refreshes when base tables change. Much faster than recomputing every time.

💻 Command:
CREATE MATERIALIZED VIEW daily_sales_mv AS SELECT date_id, SUM(amount) FROM sales_fact GROUP BY date_id;
Step 4: Optimize Join Strategy
🎯 Goal: Reduce join execution time

📝 Join Optimization:
• Filter before joining (reduce row count)
• Join smaller tables first
• Use proper join types (INNER vs LEFT)
• Ensure join keys have statistics
• Consider broadcasting small dimension tables

💻 Optimized Query:
SELECT * FROM sales_fact f JOIN (SELECT * FROM dim_date WHERE year = 2023) d ON f.date_id = d.date_id;

💡 Exam Tip: Always filter dimensions BEFORE joining to facts. Reduces join cardinality significantly.
Step 5: Configure Warehouse Auto-Scaling
🎯 Goal: Enable auto-scaling for variable workloads

📝 Auto-Scaling:
Auto-scaling adds compute clusters when query queue builds up. Set MIN and MAX clusters. Scale-out handles concurrent queries, scale-up handles complex single queries. Balance performance vs cost.

💻 Command:
ALTER WAREHOUSE analytics_wh SET MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 4 AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;
Step 6: Validate Performance Gains
🎯 Goal: Measure optimization impact

💻 Validation:
1. Re-run slow queries from Step 1
2. Compare execution time before/after
3. Check bytes scanned reduction
4. Verify clustering effectiveness

✅ Success: Target at least 50% reduction in query execution time and 70% reduction in bytes scanned!

Snowflake Console - FastQuery Warehouse

SQL Console

Results

Query Editor

SQL Command:

Execution Time

0 MB

Bytes Scanned

Progress: 0/6 tasks completed

Score: 0/100

Lab Completed!

Excellent optimization skills!

Lab 12: Cloud Data Integration

Cloud / Expert

Scenario: Multi-Cloud Data Pipeline

GlobalData Corp operates across AWS, Azure, and GCP. You'll build a unified data integration pipeline that ingests data from S3, processes it with Azure Data Factory, stores it in BigQuery, and syncs to Snowflake. Your pipeline must handle authentication, data transformation, error handling, and monitoring across all platforms.

Learning Objectives:

Cloud Storage: Connect to S3, Azure Blob, GCS
Data Transformation: Apply cloud-native ETL
Cross-Platform Sync: Move data between clouds
Security: Implement IAM roles and encryption

📋 Step-by-Step Instructions

Step 1: Connect to AWS S3
🎯 Goal: Establish connection to S3 data source

📝 S3 Authentication:
Use IAM roles (best practice) or access keys. Configure bucket policy to allow cross-account access if needed. Use S3 Select for efficient querying of large files.

💻 Configuration:
1. Service: AWS S3
2. Bucket: globaldata-source
3. Region: us-east-1
4. Click "Connect"

💡 Security: Never hard-code credentials! Use environment variables or secret managers (AWS Secrets Manager, Azure Key Vault).
Step 2: Setup Azure Data Factory
🎯 Goal: Configure ADF pipeline for transformation

📝 ADF Pipelines:
Azure Data Factory orchestrates data movement and transformation. Create linked services (connections), datasets (data structures), and pipelines (workflows). Use mapping data flows for complex transformations.

💻 Configuration:
1. Service: Azure Data Factory
2. Pipeline: S3-to-BigQuery
3. Click "Configure"
Step 3: Connect to BigQuery
🎯 Goal: Setup BigQuery as destination warehouse

📝 BigQuery Connection:
Use service account with BigQuery Data Editor role. Enable BigQuery Data Transfer Service for scheduled loads. Consider partitioning and clustering on target tables for performance.

💻 Configuration:
1. Service: GCP BigQuery
2. Project: globaldata-analytics
3. Dataset: integrated_data
4. Click "Connect"

📖 Cost Optimization: Use BigQuery's streaming insert for real-time, batch load for cost-effective bulk inserts.
Step 4: Transform & Load Data
🎯 Goal: Execute ETL pipeline across clouds

📝 Data Transformation:
Apply transformations: data type conversions, column renaming, filtering, aggregations. Validate data quality (nulls, duplicates, constraints). Handle errors with retry logic and dead letter queues.

💻 Action:
1. Source: S3 CSV files (10GB)
2. Transform: Clean + Aggregate
3. Destination: BigQuery table
4. Click "Start Transfer"
Step 5: Sync to Snowflake
🎯 Goal: Replicate BigQuery data to Snowflake

📝 Cross-Platform Sync:
Export from BigQuery to GCS, then COPY INTO Snowflake. Use Snowpipe for continuous loading. Configure external stages pointing to cloud storage. Consider change data capture (CDC) for incremental updates.

💻 Configuration:
1. Source: BigQuery
2. Destination: Snowflake
3. Mode: Incremental
4. Click "Sync"

💡 Performance: Use file formats optimized for both platforms (Parquet or ORC) for faster transfers.
Step 6: Monitor & Validate
🎯 Goal: Verify pipeline success and monitor ongoing transfers

💻 Monitoring:
1. Check transfer logs for errors
2. Verify row counts match source
3. Validate data quality rules
4. Setup CloudWatch/Azure Monitor alerts

✅ Success: Data successfully integrated across AWS, Azure, GCP, and Snowflake! Set up scheduled runs for production.

Cloud Integration Hub - GlobalData Platform

Connections

Transfer

Results

Cloud Services

AWS S3

Source: globaldata-source bucket

Disconnected

Azure Data Factory

Transform: S3-to-BigQuery pipeline

Disconnected

GCP BigQuery

Destination: integrated_data dataset

Disconnected

Snowflake

Sync target: analytics warehouse

Disconnected

Connection Actions

Select Service:

Progress: 0/6 tasks completed

Score: 0/100

Lab Completed!

Excellent cloud integration skills!

ML Pipelines & Cloud - Module 4

Learning Objectives:

📋 Step-by-Step Instructions

ML Pipeline Builder - DataML Platform

Pipeline Steps

Pipeline Actions

Pipeline Info

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

Snowflake Console - FastQuery Warehouse

Query Editor

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

Cloud Integration Hub - GlobalData Platform

Cloud Services

Connection Actions

Data Transfer

Transfer Progress

Lab Completed!