FAQ

How do I integrate the template into my existing Terraform stack?

Our templates are organized into two types of modules:

• Base modules (base/aws) – Infrastructure components.

• Pipeline modules (pipeline/) – Pipeline-specific components.

Typically, a data team manages the pipelines/ module.

The company's infra team usually manages the resources defined in the base/aws module.

Having this split already done in this template makes it easy to use the base/aws as "spec" for your infra team.

There are too many files—I don’t know where to start!

For codebase discovery, LLMs are our best allies.

Get Cursor or Copilot and start asking questions in the chat interface.

The documentation is included in the repo as Markdown files, and LLMs usually find the necessary information independently.

What is an "environment" ?

Throughout this documentation, you will see references to the ENVIRONMENT.

In our template, the environment represents a specific version or instance of your project, such as prod, dev, or ctlq.

This value is used as a prefix for all resources created in both AWS and Snowflake, ensuring that each deployment is isolated and clearly identified.

How Environments are Used

Resource Naming: Every resource (e.g., S3 buckets, Lambda functions, Snowflake databases) is prefixed with the environment name. This makes it easy to distinguish between resources belonging to different environments.
Deployment Isolation: With Terragrunt, you can deploy the project to multiple environments concurrently. Each environment can have its own set of custom input values and configuration settings. For example, you can deploy the same project in different AWS regions or accounts.
Configuration Customization: Different environments allow you to adjust resource configurations according to your needs. You might choose a larger warehouse size or different Lambda settings in production compared to development.

Choosing a Name for Your Environment

When selecting a name for your environment, follow these guidelines:

Keep it Short and Lowercase: Use concise, lowercase names such as dev, prod, or qa.
Avoid Special Characters or Spaces: Stick to alphanumeric characters and simple words to ensure compatibility across all resource naming conventions.

Using clear and consistent environment names helps maintain organization, prevents resource conflicts, and simplifies management across your AWS and Snowflake deployments.

What are <source>source_schema.yml files ?

We maintain a YAML file for every data source that defines the structure of the landing tables in Snowflake.

This YAML schema ensures each source's data model is clearly documented and versioned.

The format of this YAML is inspired by the data contract cli project.

Why We Use a YAML Schema File

Documentation of the Data Model Having a human-readable YAML file allows any team member to quickly see the tables, columns, and data types for a particular source, making it easier to understand how the data flows through the pipeline.
Automated Table Creation Our Terraform Snowpipe configuration can automatically read the YAML file to create landing tables for each source. You don't have to manually create or update your Snowflake tables whenever you change your schema.
Data Contract Enforcement The YAML schema acts as a contract between the ingestion layer and Snowflake. If the data in your pipeline doesn't match the declared schema, it can trigger validation rules or highlight mismatches, preventing corrupt or malformed data from being loaded.
Version Control By checking the YAML schema into Git, you can track when and why schema changes occur (adding, removing, or modifying columns). This history helps you audit and review changes before they reach production.
Data Quality Visibility Discrepancies between the defined schema and the real data can indicate potential data issues. Because schema mismatches are surfaced early, issues can be caught quickly—before they cause bigger downstream problems.

Example YAML Schema

Below is an abbreviated example of a YAML schema file for a Chess data source.

It illustrates how multiple tables (or "models") are defined within a single file, listing each column's data type and whether it's required, unique, or part of a primary key.

dataContractSpecification: 1.1.0
id: chess
info:
  title: chess
  version: 1.1.0
models:
  players_profiles:
    description: ''
    fields:
      username:
        type: VARCHAR
        required: false
        primaryKey: false
        unique: false
        description: ''
      last_online:
        type: TIMESTAMP_TZ
        required: false
        primaryKey: false
        unique: false
        description: ''
      joined:
        type: TIMESTAMP_TZ
        required: false
        primaryKey: false
        unique: false
        description: ''
    file_format: PARQUET

  players_games:
    description: ''
    fields:
      end_time:
        type: TIMESTAMP_TZ
        required: false
        primaryKey: false
        unique: false
        description: ''
      white__username:
        type: VARCHAR
        required: false
        primaryKey: false
        unique: false
        description: ''
      black__username:
        type: VARCHAR
        required: false
        primaryKey: false
        unique: false
        description: ''
    file_format: PARQUET

  _dlt_loads:
    description: Created by DLT. Tracks completed loads
    fields:
      load_id:
        type: VARCHAR
        required: true
        primaryKey: false
        unique: false
        description: ''
      status:
        type: NUMBER(19,0)
        required: true
        primaryKey: false
        unique: false
        description: ''
    file_format: JSON

PreviousCI Deployment

Last updated 3 months ago