Data Analytics Advanced Labs

Lab 7: Data Modeling & Schema Design

Modeling / Advanced

Scenario: E-Commerce Data Warehouse Design

ShopHub Inc. is building a new data warehouse for their e-commerce platform. You'll design the dimensional model, define fact and dimension tables, establish relationships, and implement slowly changing dimensions (SCD). Your design must support business analytics for sales, inventory, and customer behavior.

Learning Objectives:

Dimensional Modeling: Design star and snowflake schemas
Fact Tables: Model measurable business events
Dimension Tables: Create descriptive context for analysis
Relationships: Define primary/foreign keys and cardinality

📋 Step-by-Step Instructions

Step 1: Create Fact Table
🎯 Goal: Design the central fact table for sales transactions

📝 What is a Fact Table?
Fact tables store quantitative business measurements (metrics). They contain foreign keys to dimensions and measure columns like quantity, amount, cost. Fact tables are typically the largest tables in a warehouse.

💻 Configuration:
1. Table name: fact_sales
2. Type: Fact
3. Grain: One row per order line item
4. Click "Create Table"

💡 Pro Tip: Always define grain first - what does one row represent? This prevents design mistakes.
Step 2: Add Dimension - Date
🎯 Goal: Create date dimension for time-based analysis

📝 Why Date Dimensions?
Date dimensions enable time intelligence - analyzing by year, quarter, month, week, day. They also store attributes like holiday flags, fiscal periods, and business day indicators.

💻 Configuration:
1. Table name: dim_date
2. Type: Dimension
3. Attributes: year, quarter, month, day
4. Click "Create Table"
Step 3: Add Dimension - Customer
🎯 Goal: Model customer dimension with SCD Type 2

📝 SCD Type 2 Explained:
Type 2 tracks historical changes. When a customer's address changes, insert a new row with effective dates. This preserves history - you can report on where customer lived at time of purchase.

💻 Configuration:
1. Table name: dim_customer
2. Type: Dimension (SCD Type 2)
3. Include: effective_date, expiry_date
4. Click "Create Table"

📖 SCD Types: Type 1 overwrites (no history), Type 2 inserts new row (full history), Type 3 adds column (limited history).
Step 4: Add Dimension - Product
🎯 Goal: Create product dimension with hierarchy

📝 Hierarchies in Dimensions:
Product dimensions often have hierarchies: Category > Subcategory > Product. This enables drill-down analysis from high-level to detail.

💻 Configuration:
1. Table name: dim_product
2. Hierarchy: category, subcategory, product_name
3. Click "Create Table"
Step 5: Define Relationships
🎯 Goal: Connect fact table to dimensions via foreign keys

📝 Star Schema Pattern:
Fact table in center, dimension tables around it. Each dimension connects to fact via FK. This creates "star" shape - optimized for query performance.

💻 Required Relationships:
fact_sales.date_key → dim_date.date_key
fact_sales.customer_key → dim_customer.customer_key
fact_sales.product_key → dim_product.product_key

💡 Performance Tip: Use surrogate keys (integers) instead of natural keys for better join performance.
Step 6: Validate Schema
🎯 Goal: Verify model completeness and correctness

✅ Validation Checklist:
• All fact measures have appropriate data types
• Every fact FK has corresponding dimension
• Date dimension spans full business range
• SCD columns present where needed
• Grain clearly defined

💻 Action:
Click "Validate Model" to check your design

✅ Success Criteria: Schema passes validation with 1 fact table, 3+ dimensions, and correct relationships.

Schema Designer - ShopHub Data Warehouse

Builder

Results

Table Builder

Table Name:

Table Type:

Grain Description:

Relationship Builder

From Table:

To Table:

Create tables to see your schema design

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent data modeling skills!

Lab 8: Real-time Streaming Analytics

Streaming / Advanced

Scenario: IoT Sensor Data Processing

SmartCity Ltd. collects real-time data from 10,000+ IoT sensors (traffic cameras, air quality monitors, parking meters). You'll build a streaming pipeline using Kafka for ingestion and Spark Streaming for real-time analytics. Process events, apply windowed aggregations, and detect anomalies in near real-time.

Learning Objectives:

Kafka Topics: Create and manage message streams
Streaming Joins: Enrich events with reference data
Windowing: Aggregate data over time windows
State Management: Track stateful computations

📋 Step-by-Step Instructions

Step 1: Create Kafka Topic
🎯 Goal: Set up topic for sensor data ingestion

📝 What is Kafka?
Apache Kafka is a distributed event streaming platform. Topics are categories for organizing messages. Partitions enable parallel processing. Replication provides fault tolerance.

💻 Command:
kafka-topics --create --topic sensor-data --partitions 3 --replication-factor 2

💡 Pro Tip: Set partitions = number of consumer instances for max parallelism.
Step 2: Start Stream Consumer
🎯 Goal: Initialize Spark streaming consumer

📝 Spark Streaming Basics:
Spark Streaming processes data in micro-batches (typically 1-10 seconds). It provides same API as batch Spark, making code reusable. DStreams represent continuous data streams.

💻 Command:
spark-submit --class StreamProcessor streaming-app.jar
Step 3: Apply Transformation
🎯 Goal: Filter and transform incoming sensor events

📝 Stream Transformations:
• map(): Transform each record
• filter(): Keep records matching condition
• flatMap(): Split records into multiple
• reduce(): Aggregate within batch

💻 Command:
stream.filter(reading > threshold).map(to_celsius)

📖 Best Practice: Apply filters early in pipeline to reduce data volume downstream.
Step 4: Configure Time Window
🎯 Goal: Aggregate sensor readings over 5-minute windows

📝 Windowing Types:
• Tumbling: Non-overlapping fixed intervals
• Sliding: Overlapping windows
• Session: Dynamic based on inactivity gap
• Event-time vs Processing-time

💻 Command:
window(duration=5min, slide=1min).aggregate(avg)

💡 Exam Tip: Choose window size based on latency requirements and data volume trade-offs.
Step 5: Join with Reference Data
🎯 Goal: Enrich sensor events with location metadata

📝 Stream-Static Join:
Join streaming data with static lookup tables (sensor locations, device types). Static data cached in memory for fast lookups. Refresh periodically if reference data changes.

💻 Command:
stream.join(sensor_metadata).on(sensor_id)
Step 6: Write to Sink
🎯 Goal: Output processed results to time-series database

📝 Output Sinks:
• Databases: TimescaleDB, InfluxDB for metrics
• Data Warehouses: Snowflake, BigQuery for analytics
• Kafka: Chain to another streaming pipeline
• File Systems: Parquet on S3/HDFS

💻 Command:
write.format(timescaledb).save()

✅ Checkpoint: Use checkpointing for fault tolerance - enables recovery from failures.

Streaming Console - SmartCity Platform

Console

Results

Stream Status

Kafka Cluster: STOPPED

Spark Stream: STOPPED

Events Processed: 0

Throughput: 0 events/sec

SmartCity Streaming Platform v2.1.0

Connected to: kafka://smartcity-cluster:9092

---

streaming$

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent streaming skills!

Lab 9: API Development for Data Services

API / Advanced

Scenario: Data Analytics REST API

DataServe Inc. needs to expose their data warehouse through a RESTful API for external partners and applications. You'll design and implement endpoints for data retrieval, implement pagination and filtering, add authentication, and write endpoint handlers that query the underlying database efficiently.

Learning Objectives:

REST Principles: Design resource-based endpoints
Query Optimization: Implement efficient database queries
Pagination: Handle large result sets efficiently
Authentication: Secure endpoints with API keys

📋 Step-by-Step Instructions

Step 1: Create GET Endpoint
🎯 Goal: Build endpoint to retrieve sales data

📝 REST Conventions:
• GET: Retrieve resources (read-only)
• Use nouns for resources: /api/sales not /api/getSales
• Return proper HTTP status codes
• Support filtering via query parameters

💻 Configuration:
1. Method: GET
2. Path: /api/sales
3. Add query params: region, date_from, date_to
4. Click "Create Endpoint"

💡 Pro Tip: Always validate query parameters to prevent SQL injection and invalid requests.
Step 2: Implement Query Logic
🎯 Goal: Write SQL query for the endpoint handler

📝 Query Requirements:
Your query must filter by region and date range, join with customer dimension, and return aggregated sales metrics.

💻 Required Query:
SELECT region, SUM(amount) as total FROM sales WHERE date BETWEEN ? AND ? GROUP BY region

📖 Security: Always use parameterized queries (?) - never concatenate user input into SQL!
Step 3: Add Pagination
🎯 Goal: Implement limit/offset pagination

📝 Why Paginate?
Without pagination, returning millions of records crashes clients and overloads servers. Pagination splits results into manageable chunks (pages).

💻 Configuration:
1. Add params: page, page_size
2. Default: page=1, page_size=100
3. Return total_count in response
4. Update query with LIMIT/OFFSET

💡 Best Practice: Include pagination metadata in response: page, page_size, total_pages, total_records.
Step 4: Create POST Endpoint
🎯 Goal: Build endpoint for complex analytics queries

📝 POST vs GET:
Use POST for complex queries with large payloads (filters, aggregations). GET query strings have length limits (~2KB). POST can handle JSON request bodies.

💻 Configuration:
1. Method: POST
2. Path: /api/analytics/query
3. Accept JSON body with filters
4. Click "Create Endpoint"
Step 5: Add Authentication
🎯 Goal: Secure endpoints with API key validation

📝 API Key Auth:
Client sends key in X-API-Key header. Server validates against database. Simple but effective for service-to-service authentication. Consider JWT for user authentication.

💻 Configuration:
1. Enable authentication: API Key
2. Header name: X-API-Key
3. Return 401 for invalid keys
4. Apply to all endpoints

⚠️ Security: Always use HTTPS in production - API keys in plain HTTP can be intercepted!
Step 6: Test API
🎯 Goal: Verify endpoints return correct responses

💻 Test Cases:
1. Valid request → 200 OK + data
2. Missing API key → 401 Unauthorized
3. Invalid filters → 400 Bad Request
4. Check pagination links

🔍 Response Format:
All responses should be JSON with consistent structure: { "data": [...], "meta": {...}, "error": null }

✅ Validation: Test with the API tester to verify your implementation!

API Builder - DataServe Platform

Builder

Tester

Results

Create Endpoint

HTTP Method:

Endpoint Path:

Query Logic (SQL):

Enable Pagination (page, page_size)

Require API Key Authentication

Configured Endpoints

No endpoints created yet

Progress: 0/6 tasks completed

Score: 0/100

0%

Lab Completed!

Excellent API development skills!

Data Architecture & APIs - Module 3

Learning Objectives:

📋 Step-by-Step Instructions

Schema Designer - ShopHub Data Warehouse

Table Builder

Relationship Builder

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

Streaming Console - SmartCity Platform

Stream Status

Lab Completed!

Learning Objectives:

📋 Step-by-Step Instructions

API Builder - DataServe Platform

Create Endpoint

Configured Endpoints

API Tester

Lab Completed!