Local Data Lakehouse Iceberg

Figure 1: Data Lakehouse Architecture.

Project Summary

This project is about running and experimenting with local data lakehouse.

Project Scope

The goal is to create local data lakehouse using Iceberg open table format.

Project Documentation Link

You can visit about the project explanation in this Medium blog.

Technologies and Libraries

Technologies

Apache Gravitino – for Iceberg REST Catalog.
Apache Spark – for distributed compute engine.
Trino – for distributed query engine.
PostgreSQL – for Iceberg REST Catalog storage systems.
MiniIO – for data lake.
Apache Iceberg - open table format.

Python Libraries for Apache Spark

ipython – for better interactive Python environment for testing Spark code.
pandas – for working with tabular data easily (DataFrames).
pyarrow – for faster data transfer between Spark and Pandas.
numpy – for numerical and array operations.
pyspark – a Python API to intract with Apache Spark.

Installation and Setup

This setup guide assumes you have Docker.

Clone the repository

git clone https://github.com/marcellinus-witarsah/local-data-lakehouse-iceberg.git
cd local-data-lakehouse-iceberg

Run all of the infrastructure
```
docker compose up --detach --build
```

Create schema if not exists

docker exec -it spark-master bash
spark-sql -f /opt/spark/apps/setup/create_schema.sql

Experiment

You can create your python pipeline inside the pipelines/ folder. Usually it contains transformation operations but for this I just perform a table creation. You only need to run this command inside bash shell of spark-master container.

python /opt/spark/apps/pipelines/create_example_table.py

To see the table that you just create, you can go inside either the Trino CLI or Spark CLI.

docker exec -it spark-master spark-sql
select * from catalog_iceberg.schema_iceberg.table_iceberg;

or

docker exec -it trino trino
select * from catalog_iceberg.schema_iceberg.table_iceberg;

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
infrastructure		infrastructure
pipelines		pipelines
setup		setup
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Data Lakehouse Iceberg

Project Summary

Project Scope

Project Documentation Link

Technologies and Libraries

Technologies

Python Libraries for Apache Spark

Installation and Setup

Experiment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local Data Lakehouse Iceberg

Project Summary

Project Scope

Project Documentation Link

Technologies and Libraries

Technologies

Python Libraries for Apache Spark

Installation and Setup

Experiment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages