Key Concepts

Understand template's structure

This section explains the core concepts and architecture of this template.

Code Structure

The template's code is organized into three main components:

📁
├── 📁 pipelines/             # Data pipelines:
│   ├── 📁 ingest/                      # Data ingestion layer
│   ├── 📁 transform/                   # Data transformation layer
│   └── 📁 orchestrate/                 # Workflow orchestration layer

├── 📁 base/                  # Cloud infrastructure (VPC, roles, users, compute cluster, etc.)

└── 📁 live/                  # Environment-specific deployment configuration

Each component is documented separately here:

pipelines/base/aws/live/

Data Flow

  1. Source data is ingested into Apache Iceberg landing tables: code in pipelines/ingest/<source_name>-*/

  2. Data transformations are applied to create staging tables using SQL engine (Amazon Athena): code in pipelines/transform/

Data Pipeline Architecture

Our data platform follows a layered architecture:

1. Data Ingestion Layer

For each source, the ingestion layer is structured as follows:

📁 pipelines/
├── 📁 ingest/
│   ├── 📁 <source>-ingestion/      # Core ingestion logic
│   │
│   └── 📁 <source>-schema/         # Iceberg Table schema definitions
│       └── <table_name>.py
│       └── ...

└── <source>_*.tf                   # Infrastructure definition (serverless functions, containers, etc.)

Each source has:

  • A folder pipelines/ingest/<source>-ingestion/ containing the core ingestion logic packaged in a container

  • Infrastructure as Code files in pipelines/*tf for deploying this ingestion container (as serverless functions (AWS lambda) or container tasks (Amazon ECS))

  • A folder for the management of the landing tables (<source>-schema/)

More info about landing table schema evolution in Iceberg Landing Table Schema Evolution

The template comes with an example data ingestion pipeline deployed as a serverless function (lambda) using dlt; more details here:

Ingestion: dlt + lambda

2. Data Transformation Layer

The transformation layer is a dbt project that transforms the data into Iceberg staging tables using the SQL query engine Amazon Athena.

This project is located in the pipelines/transform folder:

📁 pipelines/
├── 📁 transform/                   # SQL transformation project
│   ├── 📁 models/
│   │   ├── 📁 staging/            # Raw table connections
│   │   └── 📁 marts/              # Transformations
│   │
│   ├── dbt_project.yml
│   └── Dockerfile                  # For container deployment

└── ecs_task_dbt.tf                 # Infrastructure definition for running dbt container

This transformation project runs on container infrastructure (Amazon ECS Fargate).

More details on how this transformation project is structured here:

Transformation: dbt

3. Workflow Orchestration Layer

The orchestration layer coordinates the execution of the ingestion and transformation layers using workflow automation.

This template proposes an example orchestration using AWS Step Functions:

📁 pipelines/
├── 📁 orchestrate/
│   └── <source>_step_function.json  # Workflow definition

└── <source>_step_function.tf        # Creates an orchestration workflow in AWS Step Functions
Chess Pipeline Workflow

Deployment

This template is ready to be deployed.

The stack deployment is structured in 3 steps:

  • First, the infrastructure modules (base/ and pipelines/) are deployed using Terragrunt for infrastructure management

  • Then, the containers for the ingestion and transformation layers are built and pushed to the container registry

  • Finally, the schema evolution scripts of the Iceberg landing tables are run

If you want to get started quickly and deploy the template from your machine, follow this guide:

Get Started

To get started deploying from GitHub Actions CI/CD, head there:

CI Deployment

Makefile

The template is composed of many Makefiles providing utilities.

Here are some examples:

  • make deploy in the root folder will deploy the template from your machine

  • make build in a folder with a Dockerfile will build the container

  • make local-run in a serverless function folder will test the function locally

  • etc

Everywhere you see a Makefile, run make and the list of possible actions will be listed

Get Started

Last updated