Transformation: dbt
Overview
This directory contains a data transformation pipeline that:
Takes data from Iceberg tables in the landing zone
Transforms it using dbt (data build tool)
Creates analytics-ready tables in staging and mart schemas
The pipeline runs as an AWS ECS Fargate task using a Docker container.
How It Works
Infrastructure Components
AWS Athena: SQL query engine for data transformation
Amazon S3: Hosts the Iceberg tables for both source and transformed data
AWS Glue: Provides the catalog for Iceberg tables
Amazon ECS: Orchestrates the dbt container execution
Amazon ECR: Stores the dbt docker container image
Terraform: Provisions and manages all infrastructure
Project Structure
pipelines/
├── transform/ # dbt project root
│ ├── Dockerfile
│ ├── dbt_project.yml # dbt project configuration
│ ├── sources/
│ │ ├──<source_name>.yml # List all landing tables for a source
│ ├── models/
│ │ ├── staging/ # Staging models (first transformation layer)
│ │ └── mart/ # Final business-ready models
│ └── ...
└── ecs_task_dbt.tf # Terraform creating the ECS task
Data Transformation Flow
The pipeline follows these transformation layers:
Sources: Raw data from landing tables created by ingestion pipelines
Staging: Initial cleaning, type conversion, deduplication and renaming
Mart: Final models organized by business domain, ready for analytics and reporting
Sources
Sources are defined in the sources/
folder and reference the landing tables created by the ingestion pipelines:
sources:
- name: <source_name>
schema: <landing_schema>
tables:
- name: <source_name>__dlt_version
- name: <source_name>__dlt_loads
...
You can generate this file automatically using the BoringData CLI:
cd pipelines/transform
uvx boringdata dbt import-source --source ../ingest/<source_name>-schema/
Models Structure
The dbt models follow a layered architecture pattern:
Each folder in the
models
directory corresponds to a distinct schema in Athenamodels/staging/
➡️<environment>_staging
schema in Athenamodels/mart/
➡️<environment>_mart
schema in Athena
Development Guide
Option 1: Execute dbt Locally
For rapid development with local dbt execution:
Setup your environment:
uv venv --python=python3.12 uv pip install -r requirements.txt uv run dbt deps
Configure dbt profile: Create or update
~/.dbt/profiles.yml
with:local: target: <environment> outputs: <environment>: type: athena database: awsdatacatalog region_name: "{{ env_var('AWS_REGION') }}" schema: "<environment>_staging" s3_staging_dir: "s3://<environment>-<region>-staging-bucket/athena" s3_data_dir: "s3://<environment>-<region>-staging-bucket/data" s3_tmp_table_dir: "s3://<environment>-<region>-staging-bucket/tmp"
Run dbt commands:
export DBT_PROFILE=local export AWS_PROFILE=<your_profile> export AWS_REGION=<your_region> # Run a specific model uv run dbt run --select model_name # Run with Makefile shortcut make run-local cmd="run --select model_name"
Option 2: Execute in AWS ECS Fargate
Once your template is deployed to AWS you can run dbt in the cloud environment:
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
make run cmd="run"
This will trigger an ECS Fargate task to execute the specified dbt command and store results in Iceberg.
Deployment
For manual deployment:
# Set required environment variables
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
cd pipelines/transform
# Build and deploy
make deploy
This process:
Builds the Docker image locally
Pushes it to ECR
The next time you trigger an ECS task, it will use the latest image.
Common Commands
# Development
make run-local cmd="run" # Run dbt locally with specified command
make run-local cmd="test" # Run dbt tests locally
make run-local cmd="docs generate" # Generate dbt documentation
# Cloud Execution
make run cmd="run" # Run dbt in ECS Fargate
make run cmd="test" # Run tests in ECS Fargate
# Deployment
make build # Build Docker image
make deploy # Build and deploy to ECR
Resources
Last updated