FAQ
How do I integrate it into my existing Terraform stack?
Our templates are organized into two types of modules:
• Base modules (base/aws) – Infrastructure components.
• Pipeline modules (pipeline/) – Pipeline-specific components.
Typically, a data team manages the pipelines/
module.
The company's infra team usually manages the resources defined in the base/aws module.
Having this split already done in this template makes it easy to use the base/aws
as "spec" for your infra team.
There are too many files—I don't know where to start!
For codebase discovery, LLMs are our best allies.
Get Cursor or Copilot and start asking questions in the chat interface.
The documentation is included in the repo as Markdown files, and LLMs usually find the necessary information independently.
What is an "environment" ?
Throughout this documentation, you will see references to the ENVIRONMENT. In our template, the environment represents a specific version or instance of your project, such as prod
, dev
, or ctlq
.
This value is used as a prefix for all resources created in AWS, ensuring that each deployment is isolated and clearly identified.
How Environments are Used
Resource Naming: Every resource (e.g., S3 buckets, Lambda functions) is prefixed with the environment name. This makes it easy to distinguish between resources belonging to different environments.
Deployment Isolation: With Terragrunt, you can deploy the project to multiple environments concurrently. Each environment can have its own set of custom input values and configuration settings. For example, you can deploy the same project in different AWS regions or accounts.
Configuration Customization: Different environments allow you to adjust resource configurations according to your needs. You might choose different Lambda settings in production compared to development.
Choosing a Name for Your Environment
When selecting a name for your environment, follow these guidelines:
Keep it Short and Lowercase: Use concise, lowercase names such as
dev
,prod
, orqa
.Avoid Special Characters or Spaces: Stick to alphanumeric characters and simple words to ensure compatibility across all resource naming conventions.
Using clear and consistent environment names helps maintain organization, prevents resource conflicts, and simplifies management across your AWS deployments.
Iceberg Landing Table Schema Evolution
Overview
Schema evolution in our Iceberg landing tables is managed through Python files in the pipelines/ingest/<source>-schema/
directory.
How It Works
Schema Definition Files
Each table has a dedicated Python file (e.g.,
chess__dlt_version.py
)Schemas are defined using PyArrow and managed by PyIceberg
Files are designed to be idempotent (safe to run multiple times)
Making Schema Changes
Add new PyIceberg operations at the bottom of the schema file
Never modify existing operations to maintain backward compatibility
Only use idempotent operations
Example schema file:
...
catalog.create_table_if_not_exists(
(NAMESPACE, "table_name"),
pa.schema([
pa.field("column1", "string", nullable=False),
pa.field("column2", "int64", nullable=True),
]),
location=f"s3://{os.environ.get('S3_BUCKET_NAME')}/path/to/table",
)
# New schema evolution operations go here
# Example: Adding a new column
catalog.update_schema(
(NAMESPACE, "table_name"),
[("add", "new_column", pa.string(), True)]
)
You can also add partitioning to your table.
# Update partitioning on existing table
with table.update_spec() as update:
update.add_field("id", BucketTransform(16), "bucketed_id")
update.add_field("event_ts", DayTransform(), "day_ts")
Applying Schema Changes
Use the Makefile in the schema directory:
# Migrate all tables cd pipelines/ingest/<source>-schema/ make migrate # Migrate specific table make migrate table_name=<table_name>
CI/CD Integration
Schema migrations run automatically in CI/CD pipelines. The typical CI workflow is the following: 1: deploy terraform 2: deploy docker images 3: run schema migration
Best Practices
Ensure changes don't break existing data pipelines. Ideally, you should always add a new column and never delete or modify an existing column.
Add new columns as nullable to avoid breaking existing writes
Consider the impact on downstream consumers
Last updated