How One Team Built a Vendor-Free On-Premise Data Lakehouse Using Open-Source Tools
A development team has detailed how they built a fully on-premise Data Lakehouse without proprietary software or cloud dependency, addressing budget and compliance constraints. The stack combines MinIO for storage, Apache Iceberg as the table format, Project Nessie for metadata cataloging, and Trino as the SQL engine, running on bare metal servers alongside Docker-hosted support services. The architecture follows a three-tier Medallion model — Bronze, Silver, and Gold layers — with governance responsibilities split between IT and business-facing teams like QA and BI. Pipeline orchestration is handled by Dagster, while dlt and dbt manage data ingestion and transformation respectively. The team plans to evolve toward real-time data ingestion in a future version by introducing Change Data Capture via Debezium and Kafka, with future posts planned on securing AI-generated access to the platform.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in