Figure 1: Data Lakehouse Architecture.
This project is about running and experimenting with local data lakehouse.
The goal is to create local data lakehouse using Iceberg open table format.
You can visit about the project explanation in this Medium blog.
- Apache Gravitino – for Iceberg REST Catalog.
- Apache Spark – for distributed compute engine.
- Trino – for distributed query engine.
- PostgreSQL – for Iceberg REST Catalog storage systems.
- MiniIO – for data lake.
- Apache Iceberg - open table format.
- ipython – for better interactive Python environment for testing Spark code.
- pandas – for working with tabular data easily (DataFrames).
- pyarrow – for faster data transfer between Spark and Pandas.
- numpy – for numerical and array operations.
- pyspark – a Python API to intract with Apache Spark.
This setup guide assumes you have Docker.
-
Clone the repository
git clone https://github.com/marcellinus-witarsah/local-data-lakehouse-iceberg.git cd local-data-lakehouse-iceberg -
Run all of the infrastructure
docker compose up --detach --build
-
Create schema if not exists
docker exec -it spark-master bash spark-sql -f /opt/spark/apps/setup/create_schema.sql
You can create your python pipeline inside the pipelines/ folder. Usually it contains transformation operations but for this I just perform a table creation. You only need to run this command inside bash shell of spark-master container.
python /opt/spark/apps/pipelines/create_example_table.pyTo see the table that you just create, you can go inside either the Trino CLI or Spark CLI.
docker exec -it spark-master spark-sql
select * from catalog_iceberg.schema_iceberg.table_iceberg;or
docker exec -it trino trino
select * from catalog_iceberg.schema_iceberg.table_iceberg;