Skip to content

Latest commit

 

History

History
197 lines (137 loc) · 9.17 KB

File metadata and controls

197 lines (137 loc) · 9.17 KB

🏗️ IBM Data Engineering Capstone Project

This capstone project showcases the practical application of key data engineering skills by simulating a real-world scenario in which I served as a Junior Data Engineer. I designed and implemented a scalable data analytics platform by working across various technologies in the data engineering lifecycle.


🚀 Project Overview

This capstone project simulates the role of a Junior Data Engineer tasked with designing and implementing an end-to-end data analytics platform using multiple data engineering tools and technologies.
It’s the final course in the IBM Data Engineering Professional Certificate, combining all prior learning into one practical project.


🧠 What I Learned

✅ Design and build data platforms using OLTP & OLAP architectures
✅ Implement data pipelines with ETL processes using Python and Apache Airflow
✅ Query structured and unstructured data using MySQL, PostgreSQL, and MongoDB
✅ Perform big data analytics and ML predictions using Apache Spark
✅ Visualize insights via dashboards in Google Looker Studio and IBM Cognos Analytics


🧰 Skills & Tools Used

  • 🐍 Python & SQL
  • 🐘 PostgreSQL | 🐬 MySQL | 🍃 MongoDB
  • 🛠️ Apache Airflow
  • 🔍 Apache Spark (MLlib)
  • 📊 IBM Cognos Analytics | Google Looker Studio
  • 🗃️ OLTP & Data Warehousing
  • 🧱 ETL & Data Pipelines
  • 🐧 Linux Shell Scripting
  • 📂 JSON, CSV, .tar.gz, and data transformations

📦 Modules Breakdown

Module Description
📁 1. Data Platform Architecture & OLTP Designed OLTP schemas & created MySQL databases
🍃 2. NoSQL with MongoDB Queried JSON documents and used MongoDB indexes
🗄️ 3. Data Warehouse Built dimensional models & populated warehouse tables
📈 4. Data Analytics & Reporting Wrote complex SQL queries with ROLLUP, CUBE, and aggregations
🔁 5. ETL & Pipelines Built ETL flows with Python scripts and Apache Airflow DAGs
6. Big Data Analytics with Spark Trained and deployed ML models using Spark MLlib
7. Final Submission Delivered final reports, dashboards, and peer-reviewed projects

📊 Dashboard Samples

Tool Preview
Google Looker Studio Looker Dashboard
IBM Cognos Analytics Cognos Dashboard

📂 Project Assets


📁 OLTP Database Design
📁 NoSQL Queries & Exports
📁 Data Warehouse Scripts & CSVs
📁 Airflow DAGs & Python Scripts
📁 SparkML Model & Predictions
📁 Dashboards (Google Looker, Cognos)

📌 Key Skills Demonstrated

  • 🗃️ Relational & NoSQL Database Design (MySQL, MongoDB)
  • 🏗️ Data Warehouse Modeling and Querying (PostgreSQL, IBM Db2)
  • 🔄 ETL Pipeline Development (Python, Shell, Apache Airflow)
  • 🔥 Big Data Analytics with Apache Spark
  • 📊 Data Visualization (Google Looker Studio, IBM Cognos Analytics)
  • 🐧 Linux Shell Scripting
  • 🧪 SQL queries using ROLLUP, CUBE, GROUPING SETS, and Materialized Query Tables (MQTs)

🧪 Capstone Modules & Labs Overview

📁 Module 1: Data Platform Architecture & OLTP

  • Designed an OLTP schema and created MySQL tables.
  • Imported and exported data using SQL and shell scripts.
  • Defined primary keys and indexes for optimized access.

🍃 Module 2: Querying Data in NoSQL (MongoDB)

  • Loaded product catalog data into MongoDB.
  • Performed filter queries and aggregation pipelines.
  • Exported collections using mongoexport.

🏗️ Module 3: Building a Data Warehouse

  • Created star schema with dimensions and fact tables in PostgreSQL.
  • Imported e-commerce sales data.
  • Performed OLAP queries with CUBE, ROLLUP, and GROUPING SETS.

📈 Module 4: Data Analytics

  • Wrote analytical SQL queries to uncover trends in sales data.
  • Used Materialized Query Tables to improve performance.

🔁 Module 5: ETL & Data Pipelines

  • Wrote Python scripts for extract, transform, and load processes.
  • Automated the pipeline using Apache Airflow DAGs.
  • Processed and cleaned web logs into structured format.

⚡ Module 6: Big Data Analytics with Apache Spark

  • Used Spark to load and transform product review data.
  • Built a machine learning model using Spark MLlib.
  • Saved and reloaded the trained model for prediction tasks.

📊 Module 7: Dashboards & Final Submission

  • Built sales dashboards using:
    • Google Looker Studio: Interactive charts, filters, KPIs.
    • IBM Cognos Analytics: Custom visualizations and report generation.
  • Submitted final project artifacts for peer review.

🧠 Summary

This project helped solidify my knowledge of:

  • Building data infrastructure from ground up
  • Managing both structured and semi-structured data
  • Automating and scaling data workflows
  • Communicating data insights through visual tools

🏁 Outcome

Proficiency in end-to-end data engineering workflows
Prepared for real-world junior-level data engineering roles


🧠 Reflections

This project was a culmination of weeks of learning and hands-on practice. I strengthened my data engineering foundations and became confident in building real-world data solutions end-to-end. 🧩💡


💼 Ideal For

  • Hiring managers evaluating full-stack data engineers
  • Recruiters seeking professionals skilled in data architecture, pipelines, and analytics
  • Anyone interested in practical data engineering workflows

🔗 Looker Dashboards


🏁 Let's Connect!

If you're interested in my other data projects or collaborations:
🌐 My Portfolio | 💼 LinkedIn | 📂 GitHub Projects