> For the complete documentation index, see [llms.txt](https://docs.boringdata.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.boringdata.io/template-aws-iceberg/project-structure/pipelines/transform.md).

# Transformation: dbt

## Overview

This directory contains a data transformation pipeline that:

1. Takes data from Iceberg tables in the landing zone
2. Transforms it using [dbt](https://www.getdbt.com/) (data build tool)
3. Creates analytics-ready tables in staging and mart schemas

The pipeline runs as an AWS ECS Fargate task using a Docker container.

## How It Works

### Infrastructure Components

* **AWS Athena**: SQL query engine for data transformation
* **Amazon S3**: Hosts the Iceberg tables for both source and transformed data
* **AWS Glue**: Provides the catalog for Iceberg tables
* **Amazon ECS**: Orchestrates the dbt container execution
* **Amazon ECR**: Stores the dbt docker container image
* **Terraform**: Provisions and manages all infrastructure

### Project Structure

```
pipelines/
├── transform/                     # dbt project root
│   ├── Dockerfile
│   ├── dbt_project.yml            # dbt project configuration
│   ├── sources/
│   │   ├──<source_name>.yml       # List all landing tables for a source
│   ├── models/
│   │   ├── staging/               # Staging models (first transformation layer)
│   │   └── mart/                  # Final business-ready models
│   └── ...
└── ecs_task_dbt.tf                # Terraform creating the ECS task
```

### Data Transformation Flow

The pipeline follows these transformation layers:

1. **Sources**: Raw data from landing tables created by ingestion pipelines
2. **Staging**: Initial cleaning, type conversion, deduplication and renaming
3. **Mart**: Final models organized by business domain, ready for analytics and reporting

## Sources

Sources are defined in the `sources/` folder and reference the landing tables created by the ingestion pipelines:

{% code title="sources/\<source\_name>.yml" %}

```yaml
sources:
  - name: <source_name>
    schema: <landing_schema>
    tables:
      - name: <source_name>__dlt_version
      - name: <source_name>__dlt_loads
      ...
```

{% endcode %}

You can generate this file automatically using the BoringData CLI:

```bash
cd pipelines/transform
uvx boringdata dbt import-source --source ../ingest/<source_name>-schema/
```

## Models Structure

The dbt models follow a layered architecture pattern:

* Each folder in the `models` directory corresponds to a distinct schema in Athena
* `models/staging/` ➡️ `<environment>_staging` schema in Athena
* `models/mart/` ➡️ `<environment>_mart` schema in Athena

## Development Guide

### Option 1: Execute dbt Locally

For rapid development with local dbt execution:

1. **Setup your environment**:

   ```bash
   uv venv --python=python3.12
   uv pip install -r requirements.txt
   uv run dbt deps
   ```
2. **Configure dbt profile**:\
   Create or update `~/.dbt/profiles.yml` with:

   ```yaml
   local:
     target: <environment>
     outputs:
       <environment>:
         type: athena
         database: awsdatacatalog
         region_name: "{{ env_var('AWS_REGION') }}"
         schema: "<environment>_staging"
         s3_staging_dir: "s3://<environment>-<region>-staging-bucket/athena"
         s3_data_dir: "s3://<environment>-<region>-staging-bucket/data"
         s3_tmp_table_dir: "s3://<environment>-<region>-staging-bucket/tmp"
   ```
3. **Run dbt commands**:

   ```bash
   export DBT_PROFILE=local
   export AWS_PROFILE=<your_profile>
   export AWS_REGION=<your_region>

   # Run a specific model
   uv run dbt run --select model_name

   # Run with Makefile shortcut
   make run-local cmd="run --select model_name"
   ```

### Option 2: Execute in AWS ECS Fargate

Once your template is deployed to AWS you can run dbt in the cloud environment:

```bash
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
make run cmd="run"
```

This will trigger an ECS Fargate task to execute the specified dbt command and store results in Iceberg.

## Deployment

For manual deployment:

```bash
# Set required environment variables
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
cd pipelines/transform

# Build and deploy
make deploy
```

This process:

1. Builds the Docker image locally
2. Pushes it to ECR

The next time you trigger an ECS task, it will use the latest image.

## Common Commands

```bash
# Development
make run-local cmd="run"              # Run dbt locally with specified command
make run-local cmd="test"             # Run dbt tests locally
make run-local cmd="docs generate"    # Generate dbt documentation

# Cloud Execution
make run cmd="run"                    # Run dbt in ECS Fargate
make run cmd="test"                   # Run tests in ECS Fargate

# Deployment
make build                            # Build Docker image
make deploy                           # Build and deploy to ECR
```

## Resources

* [dbt Documentation](https://docs.getdbt.com/)
* [AWS Athena User Guide](https://aws.amazon.com/athena/)
* [Apache Iceberg Documentation](https://iceberg.apache.org/)
* [BoringData CLI Guide](https://docs.boringdata.io/)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.boringdata.io/template-aws-iceberg/project-structure/pipelines/transform.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
