Data Analytics Expert Labs

Master expert-level data engineering with machine learning pipelines, warehouse optimization, and cloud data integration. Build enterprise-grade solutions.

ML Pipelines & Cloud - Module 4

Expert-level labs covering machine learning pipelines, data warehouse optimization, and cloud integration.

Lab 10: Machine Learning Data Pipeline
ML / Expert
Scenario: Predictive Analytics Pipeline
DataML Corp needs an end-to-end ML pipeline for customer churn prediction. You'll build a data pipeline that ingests raw customer data, performs feature engineering, trains multiple models, evaluates performance, and deploys the best model. Your pipeline must handle data preprocessing, feature scaling, model selection, and hyperparameter tuning.

Learning Objectives:

  • Data Preprocessing: Handle missing values, outliers, and encoding
  • Feature Engineering: Create and select meaningful features
  • Model Training: Train and compare multiple ML algorithms
  • Model Evaluation: Calculate metrics and validate performance

📋 Step-by-Step Instructions

  1. Step 1: Data Ingestion
    🎯 Goal: Load and validate customer dataset

    📝 Data Quality Checks:
    Before training, validate data quality: check for nulls, duplicates, data types, value ranges. Bad data = bad models. Use pandas profiling or custom validation functions.

    💻 Configuration:
    1. Source: customer_data.csv
    2. Rows: 50,000 records
    3. Features: 15 columns
    4. Click "Load Data"
    💡 Best Practice: Always explore data first - check distributions, correlations, missing patterns before any transformations.
  2. Step 2: Data Preprocessing
    🎯 Goal: Clean and prepare data for modeling

    📝 Preprocessing Steps:
    • Handle missing values (impute or drop)
    • Remove outliers (IQR method or Z-score)
    • Encode categorical variables (one-hot or label encoding)
    • Scale numerical features (StandardScaler or MinMaxScaler)


    💻 Required Actions:
    1. Strategy: impute_mean
    2. Encoding: one_hot
    3. Scaling: standard_scaler
    4. Click "Preprocess"
    📖 Note: StandardScaler works best for algorithms sensitive to feature scales (SVM, KNN, Neural Networks).
  3. Step 3: Feature Engineering
    🎯 Goal: Create derived features to improve model performance

    📝 Feature Types:
    Create interaction features (product/ratio of existing features), polynomial features, binning continuous variables, aggregations (customer lifetime value, recency/frequency/monetary).

    💻 Configuration:
    1. Add feature: tenure_revenue_ratio
    2. Add feature: usage_frequency_bucket
    3. Click "Generate Features"
  4. Step 4: Train ML Models
    🎯 Goal: Train multiple classification models

    📝 Model Selection:
    Try multiple algorithms: Logistic Regression (baseline), Random Forest (ensemble), XGBoost (gradient boosting), SVM (support vector machines). Compare performance to select best.

    💻 Required Models:
    1. Logistic Regression
    2. Random Forest
    3. XGBoost
    4. Click "Train Models"
    💡 Exam Tip: Always use cross-validation (k-fold) to prevent overfitting. Never evaluate on training data alone!
  5. Step 5: Evaluate Performance
    🎯 Goal: Calculate metrics and compare models

    📝 Key Metrics:
    • Accuracy: Overall correctness
    • Precision: Positive prediction accuracy
    • Recall: True positive rate
    • F1-Score: Harmonic mean of precision/recall
    • AUC-ROC: Model discrimination ability


    💻 Action:
    Click "Evaluate Models"
  6. Step 6: Deploy Best Model
    🎯 Goal: Select and deploy top-performing model

    💻 Deployment:
    1. Review model comparison
    2. Select best model by F1-score
    3. Click "Deploy Model"
    4. Generate predictions on test set
    Success: Model deployed! Monitor performance metrics in production and retrain when accuracy degrades.

ML Pipeline Builder - DataML Platform

Pipeline
Metrics
Pipeline Steps
Data Ingestion Pending
Preprocessing Pending
Feature Engineering Pending
Model Training Pending
Evaluation Pending
Deployment Pending
Pipeline Actions
Pipeline Info
Dataset: customer_data.csv
Target: churn (binary classification)
Rows: 50,000
Features: 15
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent ML pipeline skills!

Lab 11: Data Warehouse Optimization
Performance / Expert
Scenario: Snowflake Performance Tuning
FastQuery Inc. has a Snowflake data warehouse with slow query performance. You'll analyze query execution plans, create clustering keys, implement materialized views, optimize join strategies, and configure warehouse auto-scaling. Your optimizations must reduce query times by at least 50% while maintaining cost efficiency.

Learning Objectives:

  • Query Optimization: Analyze and improve execution plans
  • Clustering Keys: Improve table scan performance
  • Materialized Views: Pre-compute aggregations
  • Warehouse Sizing: Right-size compute resources

📋 Step-by-Step Instructions

  1. Step 1: Analyze Query Performance
    🎯 Goal: Identify slow queries using query history

    📝 Performance Analysis:
    Use Snowflake's QUERY_HISTORY view to find queries with high execution time, high bytes scanned, or full table scans. Look for missing joins, unindexed filters, or large result sets.

    💻 Command:
    SELECT query_id, query_text, execution_time FROM query_history WHERE execution_time > 60 ORDER BY execution_time DESC LIMIT 10;
    💡 Pro Tip: Focus on queries that run frequently AND take long - high-impact optimizations.
  2. Step 2: Create Clustering Keys
    🎯 Goal: Add clustering to improve table scan performance

    📝 Clustering Keys:
    Clustering organizes data physically by specified columns. Best for columns used in WHERE clauses, JOINs, and ORDER BY. Snowflake auto-maintains clusters as data changes.

    💻 Command:
    ALTER TABLE sales_fact CLUSTER BY (date_id, region_id);
    📖 Best Practice: Use 1-3 columns for clustering. Too many columns reduce effectiveness and increase maintenance cost.
  3. Step 3: Implement Materialized View
    🎯 Goal: Pre-compute expensive aggregations

    📝 Materialized Views:
    Store pre-computed query results. Perfect for frequently-run aggregations (daily sales, monthly KPIs). Snowflake auto-refreshes when base tables change. Much faster than recomputing every time.

    💻 Command:
    CREATE MATERIALIZED VIEW daily_sales_mv AS SELECT date_id, SUM(amount) FROM sales_fact GROUP BY date_id;
  4. Step 4: Optimize Join Strategy
    🎯 Goal: Reduce join execution time

    📝 Join Optimization:
    • Filter before joining (reduce row count)
    • Join smaller tables first
    • Use proper join types (INNER vs LEFT)
    • Ensure join keys have statistics
    • Consider broadcasting small dimension tables


    💻 Optimized Query:
    SELECT * FROM sales_fact f JOIN (SELECT * FROM dim_date WHERE year = 2023) d ON f.date_id = d.date_id;
    💡 Exam Tip: Always filter dimensions BEFORE joining to facts. Reduces join cardinality significantly.
  5. Step 5: Configure Warehouse Auto-Scaling
    🎯 Goal: Enable auto-scaling for variable workloads

    📝 Auto-Scaling:
    Auto-scaling adds compute clusters when query queue builds up. Set MIN and MAX clusters. Scale-out handles concurrent queries, scale-up handles complex single queries. Balance performance vs cost.

    💻 Command:
    ALTER WAREHOUSE analytics_wh SET MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 4 AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;
  6. Step 6: Validate Performance Gains
    🎯 Goal: Measure optimization impact

    💻 Validation:
    1. Re-run slow queries from Step 1
    2. Compare execution time before/after
    3. Check bytes scanned reduction
    4. Verify clustering effectiveness
    Success: Target at least 50% reduction in query execution time and 70% reduction in bytes scanned!

Snowflake Console - FastQuery Warehouse

SQL Console
Results
Query Editor
0s
Execution Time
0 MB
Bytes Scanned
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent optimization skills!

Lab 12: Cloud Data Integration
Cloud / Expert
Scenario: Multi-Cloud Data Pipeline
GlobalData Corp operates across AWS, Azure, and GCP. You'll build a unified data integration pipeline that ingests data from S3, processes it with Azure Data Factory, stores it in BigQuery, and syncs to Snowflake. Your pipeline must handle authentication, data transformation, error handling, and monitoring across all platforms.

Learning Objectives:

  • Cloud Storage: Connect to S3, Azure Blob, GCS
  • Data Transformation: Apply cloud-native ETL
  • Cross-Platform Sync: Move data between clouds
  • Security: Implement IAM roles and encryption

📋 Step-by-Step Instructions

  1. Step 1: Connect to AWS S3
    🎯 Goal: Establish connection to S3 data source

    📝 S3 Authentication:
    Use IAM roles (best practice) or access keys. Configure bucket policy to allow cross-account access if needed. Use S3 Select for efficient querying of large files.

    💻 Configuration:
    1. Service: AWS S3
    2. Bucket: globaldata-source
    3. Region: us-east-1
    4. Click "Connect"
    💡 Security: Never hard-code credentials! Use environment variables or secret managers (AWS Secrets Manager, Azure Key Vault).
  2. Step 2: Setup Azure Data Factory
    🎯 Goal: Configure ADF pipeline for transformation

    📝 ADF Pipelines:
    Azure Data Factory orchestrates data movement and transformation. Create linked services (connections), datasets (data structures), and pipelines (workflows). Use mapping data flows for complex transformations.

    💻 Configuration:
    1. Service: Azure Data Factory
    2. Pipeline: S3-to-BigQuery
    3. Click "Configure"
  3. Step 3: Connect to BigQuery
    🎯 Goal: Setup BigQuery as destination warehouse

    📝 BigQuery Connection:
    Use service account with BigQuery Data Editor role. Enable BigQuery Data Transfer Service for scheduled loads. Consider partitioning and clustering on target tables for performance.

    💻 Configuration:
    1. Service: GCP BigQuery
    2. Project: globaldata-analytics
    3. Dataset: integrated_data
    4. Click "Connect"
    📖 Cost Optimization: Use BigQuery's streaming insert for real-time, batch load for cost-effective bulk inserts.
  4. Step 4: Transform & Load Data
    🎯 Goal: Execute ETL pipeline across clouds

    📝 Data Transformation:
    Apply transformations: data type conversions, column renaming, filtering, aggregations. Validate data quality (nulls, duplicates, constraints). Handle errors with retry logic and dead letter queues.

    💻 Action:
    1. Source: S3 CSV files (10GB)
    2. Transform: Clean + Aggregate
    3. Destination: BigQuery table
    4. Click "Start Transfer"
  5. Step 5: Sync to Snowflake
    🎯 Goal: Replicate BigQuery data to Snowflake

    📝 Cross-Platform Sync:
    Export from BigQuery to GCS, then COPY INTO Snowflake. Use Snowpipe for continuous loading. Configure external stages pointing to cloud storage. Consider change data capture (CDC) for incremental updates.

    💻 Configuration:
    1. Source: BigQuery
    2. Destination: Snowflake
    3. Mode: Incremental
    4. Click "Sync"
    💡 Performance: Use file formats optimized for both platforms (Parquet or ORC) for faster transfers.
  6. Step 6: Monitor & Validate
    🎯 Goal: Verify pipeline success and monitor ongoing transfers

    💻 Monitoring:
    1. Check transfer logs for errors
    2. Verify row counts match source
    3. Validate data quality rules
    4. Setup CloudWatch/Azure Monitor alerts
    Success: Data successfully integrated across AWS, Azure, GCP, and Snowflake! Set up scheduled runs for production.

Cloud Integration Hub - GlobalData Platform

Connections
Transfer
Results
Cloud Services
AWS S3
Source: globaldata-source bucket
Disconnected
Azure Data Factory
Transform: S3-to-BigQuery pipeline
Disconnected
GCP BigQuery
Destination: integrated_data dataset
Disconnected
Snowflake
Sync target: analytics warehouse
Disconnected
Connection Actions
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent cloud integration skills!