Boring Data
Template: AWS+Iceberg
Template: AWS+Iceberg
  • Introduction
    • Overview
    • Key Concepts
    • Get Started
  • Project Structure
    • pipelines/
      • Ingestion: dlt + lambda
      • Transformation: dbt
    • base/aws/
    • live/
  • Guides
    • Add a New Pipeline
    • CI Deployment
  • Help
    • FAQ
Powered by GitBook
On this page
  • Code Structure
  • Data Flow
  • Data Pipeline Architecture
  • 1. Data Ingestion Layer
  • 2. Data Transformation Layer
  • 3. Workflow Orchestration Layer
  • Deployment
  • Makefile
Edit on GitHub
  1. Introduction

Key Concepts

Understand template's structure

PreviousOverviewNextGet Started

Last updated 2 months ago

This section explains the core concepts and architecture of this template.

Code Structure

The template's code is organized into three main components:

📁
├── 📁 pipelines/             # Data pipelines:
│   ├── 📁 ingest/                      # Data ingestion layer
│   ├── 📁 transform/                   # Data transformation layer
│   └── 📁 orchestrate/                 # Workflow orchestration layer
│
├── 📁 base/                  # Cloud infrastructure (VPC, roles, users, compute cluster, etc.)
│
└── 📁 live/                  # Environment-specific deployment configuration

Each component is documented separately here:

Data Flow

Data Pipeline Architecture

Our data platform follows a layered architecture:

1. Data Ingestion Layer

For each source, the ingestion layer is structured as follows:

📁 pipelines/
├── 📁 ingest/
│   ├── 📁 <source>-ingestion/      # Core ingestion logic
│   │
│   └── 📁 <source>-schema/         # Iceberg Table schema definitions
│       └── <table_name>.py
│       └── ...
│
└── <source>_*.tf                   # Infrastructure definition (serverless functions, containers, etc.)

Each source has:

  • A folder pipelines/ingest/<source>-ingestion/ containing the core ingestion logic packaged in a container

  • A folder for the management of the landing tables (<source>-schema/)

2. Data Transformation Layer

This project is located in the pipelines/transform folder:

📁 pipelines/
├── 📁 transform/                   # SQL transformation project
│   ├── 📁 models/
│   │   ├── 📁 staging/            # Raw table connections
│   │   └── 📁 marts/              # Transformations
│   │
│   ├── dbt_project.yml
│   └── Dockerfile                  # For container deployment
│
└── ecs_task_dbt.tf                 # Infrastructure definition for running dbt container

More details on how this transformation project is structured here:

3. Workflow Orchestration Layer

The orchestration layer coordinates the execution of the ingestion and transformation layers using workflow automation.

📁 pipelines/
├── 📁 orchestrate/
│   └── <source>_step_function.json  # Workflow definition
│
└── <source>_step_function.tf        # Creates an orchestration workflow in AWS Step Functions

Deployment

This template is ready to be deployed.

The stack deployment is structured in 3 steps:

  • Then, the containers for the ingestion and transformation layers are built and pushed to the container registry

  • Finally, the schema evolution scripts of the Iceberg landing tables are run

If you want to get started quickly and deploy the template from your machine, follow this guide:

Makefile

The template is composed of many Makefiles providing utilities.

Here are some examples:

  • make deploy in the root folder will deploy the template from your machine

  • make build in a folder with a Dockerfile will build the container

  • make local-run in a serverless function folder will test the function locally

  • etc

Everywhere you see a Makefile, run make and the list of possible actions will be listed

Source data is ingested into landing tables: code in pipelines/ingest/<source_name>-*/

Data transformations are applied to create staging tables using SQL engine (): code in pipelines/transform/

Infrastructure as Code files in pipelines/*tf for deploying this ingestion container (as serverless functions (AWS lambda) or container tasks ())

More info about landing table schema evolution in

The template comes with an example data ingestion pipeline deployed as a serverless function (lambda) using ; more details here:

The transformation layer is a project that transforms the data into Iceberg staging tables using the SQL query engine .

This transformation project runs on container infrastructure ( Fargate).

This template proposes an example orchestration using :

First, the infrastructure modules (base/ and pipelines/) are deployed using for infrastructure management

To get started deploying from CI/CD, head there:

pipelines/
base/aws/
live/
Apache Iceberg
Amazon Athena
Amazon ECS
dlt
Ingestion: dlt + lambda
dbt
Amazon Athena
Amazon ECS
Transformation: dbt
AWS Step Functions
Terragrunt
Get Started
GitHub Actions
CI Deployment
Get Started
Iceberg Landing Table Schema Evolution
Chess Pipeline Workflow