Master advanced data engineering concepts including data modeling, real-time streaming, and API development. Build scalable, production-grade data systems.
Data Architecture & APIs - Module 3
Advanced-level labs covering data modeling, streaming analytics, and API development for data services.
Lab 7: Data Modeling & Schema Design
Modeling / Advanced
Scenario: E-Commerce Data Warehouse Design
ShopHub Inc. is building a new data warehouse for their e-commerce platform. You'll design the dimensional model, define fact and dimension tables, establish relationships, and implement slowly changing dimensions (SCD). Your design must support business analytics for sales, inventory, and customer behavior.
Learning Objectives:
Dimensional Modeling: Design star and snowflake schemas
Fact Tables: Model measurable business events
Dimension Tables: Create descriptive context for analysis
Relationships: Define primary/foreign keys and cardinality
📋 Step-by-Step Instructions
Step 1: Create Fact Table
🎯 Goal: Design the central fact table for sales transactions
📝 What is a Fact Table?
Fact tables store quantitative business measurements (metrics). They contain foreign keys to dimensions and measure columns like quantity, amount, cost. Fact tables are typically the largest tables in a warehouse.
💻 Configuration: 1. Table name: fact_sales 2. Type: Fact 3. Grain: One row per order line item 4. Click "Create Table"
💡 Pro Tip: Always define grain first - what does one row represent? This prevents design mistakes.
Step 2: Add Dimension - Date
🎯 Goal: Create date dimension for time-based analysis
📝 Why Date Dimensions?
Date dimensions enable time intelligence - analyzing by year, quarter, month, week, day. They also store attributes like holiday flags, fiscal periods, and business day indicators.
📝 SCD Type 2 Explained:
Type 2 tracks historical changes. When a customer's address changes, insert a new row with effective dates. This preserves history - you can report on where customer lived at time of purchase.
📖 SCD Types: Type 1 overwrites (no history), Type 2 inserts new row (full history), Type 3 adds column (limited history).
Step 4: Add Dimension - Product
🎯 Goal: Create product dimension with hierarchy
📝 Hierarchies in Dimensions:
Product dimensions often have hierarchies: Category > Subcategory > Product. This enables drill-down analysis from high-level to detail.
🎯 Goal: Connect fact table to dimensions via foreign keys
📝 Star Schema Pattern:
Fact table in center, dimension tables around it. Each dimension connects to fact via FK. This creates "star" shape - optimized for query performance.
💡 Performance Tip: Use surrogate keys (integers) instead of natural keys for better join performance.
Step 6: Validate Schema
🎯 Goal: Verify model completeness and correctness
✅ Validation Checklist:
• All fact measures have appropriate data types • Every fact FK has corresponding dimension • Date dimension spans full business range • SCD columns present where needed • Grain clearly defined
💻 Action: Click "Validate Model" to check your design
✅ Success Criteria: Schema passes validation with 1 fact table, 3+ dimensions, and correct relationships.
Schema Designer - ShopHub Data Warehouse
Builder
Results
Table Builder
Relationship Builder
Create tables to see your schema design
Create tables and relationships to see schema results
Progress:0/6 tasks completed
Score: 0/100
0%
Lab Completed!
Excellent data modeling skills!
Lab 8: Real-time Streaming Analytics
Streaming / Advanced
Scenario: IoT Sensor Data Processing
SmartCity Ltd. collects real-time data from 10,000+ IoT sensors (traffic cameras, air quality monitors, parking meters). You'll build a streaming pipeline using Kafka for ingestion and Spark Streaming for real-time analytics. Process events, apply windowed aggregations, and detect anomalies in near real-time.
Learning Objectives:
Kafka Topics: Create and manage message streams
Streaming Joins: Enrich events with reference data
Windowing: Aggregate data over time windows
State Management: Track stateful computations
📋 Step-by-Step Instructions
Step 1: Create Kafka Topic
🎯 Goal: Set up topic for sensor data ingestion
📝 What is Kafka?
Apache Kafka is a distributed event streaming platform. Topics are categories for organizing messages. Partitions enable parallel processing. Replication provides fault tolerance.
💡 Pro Tip: Set partitions = number of consumer instances for max parallelism.
Step 2: Start Stream Consumer
🎯 Goal: Initialize Spark streaming consumer
📝 Spark Streaming Basics:
Spark Streaming processes data in micro-batches (typically 1-10 seconds). It provides same API as batch Spark, making code reusable. DStreams represent continuous data streams.
🎯 Goal: Filter and transform incoming sensor events
📝 Stream Transformations:
• map(): Transform each record • filter(): Keep records matching condition • flatMap(): Split records into multiple • reduce(): Aggregate within batch
📖 Best Practice: Apply filters early in pipeline to reduce data volume downstream.
Step 4: Configure Time Window
🎯 Goal: Aggregate sensor readings over 5-minute windows
📝 Windowing Types:
• Tumbling: Non-overlapping fixed intervals • Sliding: Overlapping windows • Session: Dynamic based on inactivity gap • Event-time vs Processing-time
💡 Exam Tip: Choose window size based on latency requirements and data volume trade-offs.
Step 5: Join with Reference Data
🎯 Goal: Enrich sensor events with location metadata
📝 Stream-Static Join:
Join streaming data with static lookup tables (sensor locations, device types). Static data cached in memory for fast lookups. Refresh periodically if reference data changes.
🎯 Goal: Output processed results to time-series database
📝 Output Sinks:
• Databases: TimescaleDB, InfluxDB for metrics • Data Warehouses: Snowflake, BigQuery for analytics • Kafka: Chain to another streaming pipeline • File Systems: Parquet on S3/HDFS
💻 Command: write.format(timescaledb).save()
✅ Checkpoint: Use checkpointing for fault tolerance - enables recovery from failures.
Streaming Console - SmartCity Platform
Console
Results
Stream Status
Kafka Cluster:STOPPED
Spark Stream:STOPPED
Events Processed:0
Throughput:0 events/sec
SmartCity Streaming Platform v2.1.0
Connected to: kafka://smartcity-cluster:9092
---
streaming$
Configure pipeline components to see streaming results
Progress:0/6 tasks completed
Score: 0/100
0%
Lab Completed!
Excellent streaming skills!
Lab 9: API Development for Data Services
API / Advanced
Scenario: Data Analytics REST API
DataServe Inc. needs to expose their data warehouse through a RESTful API for external partners and applications. You'll design and implement endpoints for data retrieval, implement pagination and filtering, add authentication, and write endpoint handlers that query the underlying database efficiently.
📝 REST Conventions:
• GET: Retrieve resources (read-only) • Use nouns for resources: /api/sales not /api/getSales • Return proper HTTP status codes • Support filtering via query parameters
💡 Pro Tip: Always validate query parameters to prevent SQL injection and invalid requests.
Step 2: Implement Query Logic
🎯 Goal: Write SQL query for the endpoint handler
📝 Query Requirements:
Your query must filter by region and date range, join with customer dimension, and return aggregated sales metrics.
💻 Required Query: SELECT region, SUM(amount) as total FROM sales WHERE date BETWEEN ? AND ? GROUP BY region
📖 Security: Always use parameterized queries (?) - never concatenate user input into SQL!
Step 3: Add Pagination
🎯 Goal: Implement limit/offset pagination
📝 Why Paginate?
Without pagination, returning millions of records crashes clients and overloads servers. Pagination splits results into manageable chunks (pages).
💡 Best Practice: Include pagination metadata in response: page, page_size, total_pages, total_records.
Step 4: Create POST Endpoint
🎯 Goal: Build endpoint for complex analytics queries
📝 POST vs GET:
Use POST for complex queries with large payloads (filters, aggregations). GET query strings have length limits (~2KB). POST can handle JSON request bodies.
💻 Configuration: 1. Method: POST 2. Path: /api/analytics/query 3. Accept JSON body with filters 4. Click "Create Endpoint"
Step 5: Add Authentication
🎯 Goal: Secure endpoints with API key validation
📝 API Key Auth:
Client sends key in X-API-Key header. Server validates against database. Simple but effective for service-to-service authentication. Consider JWT for user authentication.
💻 Configuration: 1. Enable authentication: API Key 2. Header name: X-API-Key 3. Return 401 for invalid keys 4. Apply to all endpoints
⚠️ Security: Always use HTTPS in production - API keys in plain HTTP can be intercepted!
Step 6: Test API
🎯 Goal: Verify endpoints return correct responses
💻 Test Cases: 1. Valid request → 200 OK + data 2. Missing API key → 401 Unauthorized 3. Invalid filters → 400 Bad Request 4. Check pagination links
🔍 Response Format:
All responses should be JSON with consistent structure: { "data": [...], "meta": {...}, "error": null }
✅ Validation: Test with the API tester to verify your implementation!