Data Analytics Foundation Labs

Master data analytics with hands-on labs covering SQL, Python, ETL pipelines, and data visualization. Build real-world data solutions through interactive scenarios.

Data Foundation Essentials - Module 1

Start your data analytics journey with fundamental concepts and hands-on implementation across major data platforms.

Lab 1: SQL Database Fundamentals
SQL / Beginner
Scenario: E-Commerce Database Analysis
RetailMax Inc. needs to analyze their e-commerce database to understand sales patterns, customer behavior, and inventory management. As a data analyst, you'll write SQL queries to extract insights from their production database containing customers, orders, products, and transactions tables.

Learning Objectives:

  • SELECT Queries: Master data retrieval with filtering and sorting
  • JOIN Operations: Combine data from multiple tables effectively
  • Aggregations: Use GROUP BY, COUNT, SUM, AVG for analytics
  • Subqueries: Write complex nested queries for advanced analysis

📋 Step-by-Step Instructions

  1. Step 1: Explore the Database Schema
    🎯 Goal: Understand the database structure before querying

    📝 Why Schema Exploration?
    Before writing queries, you must understand what tables exist, their columns, and relationships. This prevents errors and helps you write efficient queries.

    💻 Command:
    SHOW TABLES;

    🔍 Expected Output:
    • customers
    • products
    • orders
    • order_items
    • categories
    💡 Pro Tip: Use DESCRIBE table_name; to see column details for any table.
  2. Step 2: Basic SELECT with Filtering
    🎯 Goal: Retrieve customer data with specific conditions

    💻 Query:
    SELECT customer_id, first_name, last_name, email FROM customers WHERE created_at >= '2024-01-01' ORDER BY last_name;

    🔍 What this does:
    • Selects specific columns (not SELECT *)
    • Filters customers created in 2024
    • Sorts alphabetically by last name
    📖 Best Practice: Always specify column names instead of SELECT * for better performance and clarity.
  3. Step 3: JOIN Tables for Order Analysis
    🎯 Goal: Combine customer and order data to analyze purchasing behavior

    💻 Query:
    SELECT c.first_name, c.last_name, o.order_id, o.total_amount, o.order_date FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id WHERE o.total_amount > 100 ORDER BY o.total_amount DESC;

    🔍 JOIN Types Explained:
    • INNER JOIN: Only matching records
    • LEFT JOIN: All from left + matches
    • RIGHT JOIN: All from right + matches
    💡 Exam Tip: Always use table aliases (c, o) for cleaner code. JOIN conditions go in the ON clause, filters in WHERE.
  4. Step 4: Aggregations with GROUP BY
    🎯 Goal: Calculate sales metrics by category

    💻 Query:
    SELECT cat.category_name, COUNT(oi.order_item_id) as items_sold, SUM(oi.quantity * oi.unit_price) as total_revenue, AVG(oi.unit_price) as avg_price FROM categories cat JOIN products p ON cat.category_id = p.category_id JOIN order_items oi ON p.product_id = oi.product_id GROUP BY cat.category_name HAVING total_revenue > 1000 ORDER BY total_revenue DESC;

    🔍 Aggregate Functions:
    • COUNT(): Number of records
    • SUM(): Total of values
    • AVG(): Average of values
    • MAX/MIN(): Highest/lowest values
    🎓 Key Concept: HAVING filters after aggregation, WHERE filters before. Use HAVING with GROUP BY for filtered aggregates.
  5. Step 5: Subqueries for Advanced Analysis
    🎯 Goal: Find customers who spent above average

    💻 Query:
    SELECT c.customer_id, c.first_name, c.last_name, SUM(o.total_amount) as lifetime_value FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.first_name, c.last_name HAVING lifetime_value > (SELECT AVG(total_amount) FROM orders) ORDER BY lifetime_value DESC LIMIT 10;

    🔍 Subquery Types:
    • Scalar: Returns single value (used here)
    • Row: Returns single row
    • Table: Returns result set
    Performance Note: Subqueries in HAVING can be slow. For production, consider CTEs (Common Table Expressions) instead.
  6. Step 6: Validate Your Analysis
    🎯 Goal: Verify your queries return correct results

    ✅ Validation Queries:
    SELECT COUNT(*) as total_customers FROM customers;
    SELECT SUM(total_amount) as total_revenue FROM orders;

    🔍 Sanity Checks:
    • Row counts should match expectations
    • Totals should be positive and reasonable
    • JOINs should not create duplicates
    🎓 Learning Checkpoint: Can you explain the difference between WHERE, HAVING, and ON clauses? Each filters at different stages of query execution!

SQL Query Editor - RetailMax Database

Run a query to see results

Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent SQL skills!

Lab 2: Python Data Analysis with Pandas
Python / Beginner
Scenario: Sales Data Processing
DataCorp needs to process and analyze their monthly sales data stored in CSV files. Using Python and Pandas, you'll clean the data, perform transformations, calculate metrics, and prepare reports for the executive team.

Learning Objectives:

  • Data Loading: Import CSV/Excel files into DataFrames
  • Data Cleaning: Handle missing values and duplicates
  • Transformations: Filter, group, and aggregate data
  • Analysis: Calculate statistics and generate insights

📋 Step-by-Step Instructions

  1. Step 1: Import Libraries and Load Data
    🎯 Goal: Set up your Python environment and load the sales dataset

    📝 Why Pandas?
    Pandas is the industry-standard library for data manipulation. It provides DataFrames - powerful tabular data structures that make cleaning, transforming, and analyzing data intuitive.

    💻 Commands:
    import pandas as pd
    import numpy as np
    df = pd.read_csv('sales_data.csv')

    🔍 What to expect:
    DataFrame with columns: customer_id, product_id, quantity, price, date
    💡 Pro Tip: Use pd.read_excel() for Excel files, or pd.read_json() for JSON data.
  2. Step 2: Explore Data Structure
    🎯 Goal: Understand your data before analysis

    📝 Why Explore First?
    Data exploration reveals data types, missing values, and distribution patterns. Skipping this step often leads to errors and incorrect analysis.

    💻 Commands:
    df.info()
    df.describe()
    df.isnull().sum()

    🔍 Key Metrics:
    • info(): Data types and non-null counts
    • describe(): Statistical summary (mean, std, min, max)
    • isnull(): Count of missing values per column
    📖 Best Practice: Always explore data shape with df.shape and first rows with df.head() before complex operations.
  3. Step 3: Clean Missing Values
    🎯 Goal: Handle missing data appropriately

    📝 Strategies for Missing Data:
    • fillna(value): Replace with constant or calculated value
    • dropna(): Remove rows with missing values
    • interpolate(): Fill based on surrounding values


    💻 Commands:
    df['price'].fillna(df['price'].median(), inplace=True)
    df.dropna(subset=['customer_id'], inplace=True)
    ⚠️ Warning: Using mean for skewed data can be misleading. Median is more robust to outliers.
  4. Step 4: Create Calculated Columns
    🎯 Goal: Add derived columns for analysis

    📝 Feature Engineering:
    Creating new columns from existing data is called feature engineering. It's essential for meaningful analysis and machine learning.

    💻 Commands:
    df['total'] = df['quantity'] * df['price']
    df['month'] = pd.to_datetime(df['date']).dt.month

    🔍 Datetime Attributes:
    • .dt.year, .dt.month, .dt.day
    • .dt.weekday (0=Monday)
    • .dt.quarter
    💡 Pro Tip: Use pd.to_datetime() to convert string dates, then access .dt properties for extraction.
  5. Step 5: Group and Aggregate
    🎯 Goal: Calculate summary statistics by category

    📝 Why GroupBy?
    GroupBy splits data into groups, applies a function to each group, and combines results. It's the Pandas equivalent of SQL's GROUP BY.

    💻 Command:
    monthly_sales = df.groupby('month').agg({'total': 'sum', 'quantity': 'sum', 'order_id': 'count'})

    🔍 Aggregation Functions:
    • sum, mean, median, min, max
    • count, nunique (unique count)
    • first, last, std (standard deviation)
    Power Move: Use .agg() with a dictionary to apply different functions to different columns simultaneously.
  6. Step 6: Export Results
    🎯 Goal: Save your analysis results

    💻 Commands:
    monthly_sales.to_csv('monthly_report.csv')
    print("Report generated!")

    🔍 Export Options:
    • to_csv(): Comma-separated values
    • to_excel(): Excel workbook
    • to_json(): JSON format
    • to_sql(): Direct to database
    🎓 Learning Checkpoint: Can you now load data, clean it, transform it, and export results? These are the core skills of any data analyst!

Python 3.11 - Jupyter Environment

In [1]:
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Great Pandas skills!

Lab 3: ETL Data Pipeline Design
ETL / Intermediate
Scenario: Data Warehouse Integration
TechAnalytics Inc. needs to build an ETL pipeline that extracts data from multiple sources (APIs, databases, files), transforms it for analysis, and loads it into their data warehouse. Design and implement the pipeline using industry best practices.

Learning Objectives:

  • Extract: Pull data from various source systems
  • Transform: Clean, normalize, and enrich data
  • Load: Insert data into target data warehouse
  • Orchestration: Schedule and monitor pipeline runs

📋 Step-by-Step Instructions

  1. Step 1: Define Data Sources
    🎯 Goal: Configure connections to all source systems

    📝 Why Define Sources First?
    ETL pipelines must know where data comes from before extraction. Source definitions include connection strings, authentication, and data formats. This is the foundation of your pipeline.

    💻 Configuration:
    sources

    🔍 Source Types:
    • REST APIs: JSON/XML over HTTP
    • Databases: PostgreSQL, MySQL, MongoDB
    • Files: CSV, Parquet, JSON on S3/GCS
    • Streams: Kafka, Kinesis
    💡 Pro Tip: Store credentials in environment variables or secret managers, never in code!
  2. Step 2: Create Extract Jobs
    🎯 Goal: Write extraction functions for each data source

    📝 Why Modular Extraction?
    Each source should have its own extract function. This allows independent testing, easier debugging, and parallel execution. Functions should return DataFrames for consistency.

    💻 Command:
    extract

    🔍 Extraction Patterns:
    • Full Extract: All data every run
    • Incremental: Only new/changed records
    • CDC: Change Data Capture for real-time
    📖 Best Practice: Always add retry logic with exponential backoff for API calls.
  3. Step 3: Implement Transformations
    🎯 Goal: Clean, normalize, and enrich the extracted data

    📝 Why Transform?
    Raw data is rarely analysis-ready. Transformations standardize formats, handle missing values, merge datasets, and create derived fields. This is where data quality is enforced.

    💻 Command:
    transform

    🔍 Common Transformations:
    • Data type casting and validation
    • Deduplication and null handling
    • Joins and aggregations
    • Business rule application
    ⚠️ Warning: Never modify source data. Transformations should create new columns or DataFrames.
  4. Step 4: Configure Load Process
    🎯 Goal: Write transformed data to the target warehouse

    📝 Load Strategies:
    • Append: Add new rows (incremental loads)
    • Replace: Drop and recreate table
    • Upsert: Insert or update based on key
    • Merge: Complex SCD handling


    💻 Command:
    load

    🔍 Performance Tips:
    • Use chunked inserts for large datasets
    • Disable indexes during bulk loads
    • Use COPY instead of INSERT when possible
    Key Insight: chunksize=10000 balances memory usage with database round-trips.
  5. Step 5: Add Error Handling
    🎯 Goal: Make your pipeline resilient to failures

    📝 Why Error Handling?
    Production pipelines WILL fail. Network issues, schema changes, data quality problems - all must be handled gracefully. Good error handling means quick recovery and minimal data loss.

    💻 Command:
    error

    🔍 Error Handling Patterns:
    • Try/except with specific exceptions
    • Logging with context (timestamps, record IDs)
    • Alerting via Slack, PagerDuty, email
    • Dead letter queues for failed records
    💡 Pro Tip: Log failed records to a separate table for later reprocessing instead of losing them.
  6. Step 6: Schedule Pipeline
    🎯 Goal: Automate pipeline execution on a schedule

    📝 Scheduling Options:
    • Cron: Simple time-based scheduling
    • Airflow: DAG-based orchestration with dependencies
    • Prefect/Dagster: Modern Python-native orchestrators
    • Cloud: AWS Step Functions, GCP Cloud Composer


    💻 Command:
    schedule

    🔍 Cron Syntax:
    • "0 2 * * *" = Daily at 2 AM
    • "0 */6 * * *" = Every 6 hours
    • "0 0 * * 0" = Weekly on Sunday
    🎓 Learning Checkpoint: You now understand the complete ETL lifecycle - Extract, Transform, Load, plus error handling and scheduling!

ETL Pipeline Console

pipeline@etl:~$
Progress: 0/6 tasks completed
Score: 0/100
0%

Lab Completed!

Excellent ETL pipeline design!