FAQ

How do I integrate it into my existing Terraform stack?

Our templates are organized into two types of modules:

• Base modules (base/aws) – Infrastructure components.

• Pipeline modules (pipeline/) – Pipeline-specific components.

Typically, a data team manages the pipelines/ module.

The company's infra team usually manages the resources defined in the base/aws module.

Having this split already done in this template makes it easy to use the base/aws as "spec" for your infra team.

There are too many files—I don't know where to start!

For codebase discovery, LLMs are our best allies.

Get Cursor or Copilot and start asking questions in the chat interface.

The documentation is included in the repo as Markdown files, and LLMs usually find the necessary information independently.

What is an "environment" ?

Throughout this documentation, you will see references to the ENVIRONMENT. In our template, the environment represents a specific version or instance of your project, such as prod, dev, or ctlq.

This value is used as a prefix for all resources created in AWS, ensuring that each deployment is isolated and clearly identified.

How Environments are Used

Resource Naming: Every resource (e.g., S3 buckets, Lambda functions) is prefixed with the environment name. This makes it easy to distinguish between resources belonging to different environments.
Deployment Isolation: With Terragrunt, you can deploy the project to multiple environments concurrently. Each environment can have its own set of custom input values and configuration settings. For example, you can deploy the same project in different AWS regions or accounts.
Configuration Customization: Different environments allow you to adjust resource configurations according to your needs. You might choose different Lambda settings in production compared to development.

Choosing a Name for Your Environment

When selecting a name for your environment, follow these guidelines:

Keep it Short and Lowercase: Use concise, lowercase names such as dev, prod, or qa.
Avoid Special Characters or Spaces: Stick to alphanumeric characters and simple words to ensure compatibility across all resource naming conventions.

Using clear and consistent environment names helps maintain organization, prevents resource conflicts, and simplifies management across your AWS deployments.

Iceberg Landing Table Schema Evolution

Overview

Schema evolution in our Iceberg landing tables is managed through Python files in the pipelines/ingest/<source>-schema/ directory.

How It Works

Schema Definition Files
- Each table has a dedicated Python file (e.g., chess__dlt_version.py)
- Schemas are defined using PyArrow and managed by PyIceberg
- Files are designed to be idempotent (safe to run multiple times)
Making Schema Changes
- Add new PyIceberg operations at the bottom of the schema file
- Never modify existing operations to maintain backward compatibility
- Only use idempotent operations

Example schema file:

...

catalog.create_table_if_not_exists(
    (NAMESPACE, "table_name"),
    pa.schema([
        pa.field("column1", "string", nullable=False),
        pa.field("column2", "int64", nullable=True),
    ]),
    location=f"s3://{os.environ.get('S3_BUCKET_NAME')}/path/to/table",
)

# New schema evolution operations go here
# Example: Adding a new column
catalog.update_schema(
    (NAMESPACE, "table_name"),
    [("add", "new_column", pa.string(), True)]
)

You can also add partitioning to your table.

# Update partitioning on existing table
with table.update_spec() as update:
    update.add_field("id", BucketTransform(16), "bucketed_id")
    update.add_field("event_ts", DayTransform(), "day_ts")

Applying Schema Changes

Use the Makefile in the schema directory:

# Migrate all tables
cd pipelines/ingest/<source>-schema/
make migrate

# Migrate specific table
make migrate table_name=<table_name>

CI/CD Integration
- Schema migrations run automatically in CI/CD pipelines. The typical CI workflow is the following: 1: deploy terraform 2: deploy docker images 3: run schema migration

Best Practices

Ensure changes don't break existing data pipelines. Ideally, you should always add a new column and never delete or modify an existing column.
Add new columns as nullable to avoid breaking existing writes
Consider the impact on downstream consumers

PreviousCI Deployment

Last updated 4 months ago

... catalog.create_table_if_not_exists( (NAMESPACE, "table_name"), pa.schema([ pa.field("column1", "string", nullable=False), pa.field("column2", "int64", nullable=True), ]), location=f"s3://{os.environ.get('S3_BUCKET_NAME')}/path/to/table", ) # New schema evolution operations go here # Example: Adding a new column catalog.update_schema( (NAMESPACE, "table_name"), [("add", "new_column", pa.string(), True)] )