# Transformation: dbt

## Overview

This directory contains a data transformation pipeline that:

1. Takes data from Iceberg tables in the landing zone
2. Transforms it using [dbt](https://www.getdbt.com/) (data build tool)
3. Creates analytics-ready tables in staging and mart schemas

The pipeline runs as an AWS ECS Fargate task using a Docker container.

## How It Works

### Infrastructure Components

* **AWS Athena**: SQL query engine for data transformation
* **Amazon S3**: Hosts the Iceberg tables for both source and transformed data
* **AWS Glue**: Provides the catalog for Iceberg tables
* **Amazon ECS**: Orchestrates the dbt container execution
* **Amazon ECR**: Stores the dbt docker container image
* **Terraform**: Provisions and manages all infrastructure

### Project Structure

```
pipelines/
├── transform/                     # dbt project root
│   ├── Dockerfile
│   ├── dbt_project.yml            # dbt project configuration
│   ├── sources/
│   │   ├──<source_name>.yml       # List all landing tables for a source
│   ├── models/
│   │   ├── staging/               # Staging models (first transformation layer)
│   │   └── mart/                  # Final business-ready models
│   └── ...
└── ecs_task_dbt.tf                # Terraform creating the ECS task
```

### Data Transformation Flow

The pipeline follows these transformation layers:

1. **Sources**: Raw data from landing tables created by ingestion pipelines
2. **Staging**: Initial cleaning, type conversion, deduplication and renaming
3. **Mart**: Final models organized by business domain, ready for analytics and reporting

## Sources

Sources are defined in the `sources/` folder and reference the landing tables created by the ingestion pipelines:

{% code title="sources/\<source\_name>.yml" %}

```yaml
sources:
  - name: <source_name>
    schema: <landing_schema>
    tables:
      - name: <source_name>__dlt_version
      - name: <source_name>__dlt_loads
      ...
```

{% endcode %}

You can generate this file automatically using the BoringData CLI:

```bash
cd pipelines/transform
uvx boringdata dbt import-source --source ../ingest/<source_name>-schema/
```

## Models Structure

The dbt models follow a layered architecture pattern:

* Each folder in the `models` directory corresponds to a distinct schema in Athena
* `models/staging/` ➡️ `<environment>_staging` schema in Athena
* `models/mart/` ➡️ `<environment>_mart` schema in Athena

## Development Guide

### Option 1: Execute dbt Locally

For rapid development with local dbt execution:

1. **Setup your environment**:

   ```bash
   uv venv --python=python3.12
   uv pip install -r requirements.txt
   uv run dbt deps
   ```
2. **Configure dbt profile**:\
   Create or update `~/.dbt/profiles.yml` with:

   ```yaml
   local:
     target: <environment>
     outputs:
       <environment>:
         type: athena
         database: awsdatacatalog
         region_name: "{{ env_var('AWS_REGION') }}"
         schema: "<environment>_staging"
         s3_staging_dir: "s3://<environment>-<region>-staging-bucket/athena"
         s3_data_dir: "s3://<environment>-<region>-staging-bucket/data"
         s3_tmp_table_dir: "s3://<environment>-<region>-staging-bucket/tmp"
   ```
3. **Run dbt commands**:

   ```bash
   export DBT_PROFILE=local
   export AWS_PROFILE=<your_profile>
   export AWS_REGION=<your_region>

   # Run a specific model
   uv run dbt run --select model_name

   # Run with Makefile shortcut
   make run-local cmd="run --select model_name"
   ```

### Option 2: Execute in AWS ECS Fargate

Once your template is deployed to AWS you can run dbt in the cloud environment:

```bash
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
make run cmd="run"
```

This will trigger an ECS Fargate task to execute the specified dbt command and store results in Iceberg.

## Deployment

For manual deployment:

```bash
# Set required environment variables
export AWS_PROFILE=<your_profile>
export ENVIRONMENT=<your_environment>
cd pipelines/transform

# Build and deploy
make deploy
```

This process:

1. Builds the Docker image locally
2. Pushes it to ECR

The next time you trigger an ECS task, it will use the latest image.

## Common Commands

```bash
# Development
make run-local cmd="run"              # Run dbt locally with specified command
make run-local cmd="test"             # Run dbt tests locally
make run-local cmd="docs generate"    # Generate dbt documentation

# Cloud Execution
make run cmd="run"                    # Run dbt in ECS Fargate
make run cmd="test"                   # Run tests in ECS Fargate

# Deployment
make build                            # Build Docker image
make deploy                           # Build and deploy to ECR
```

## Resources

* [dbt Documentation](https://docs.getdbt.com/)
* [AWS Athena User Guide](https://aws.amazon.com/athena/)
* [Apache Iceberg Documentation](https://iceberg.apache.org/)
* [BoringData CLI Guide](https://docs.boringdata.io/)
