FAQ
Last updated
Last updated
Our templates are organized into two types of modules:
• Base modules (base/aws) – Infrastructure components.
• Pipeline modules (pipeline/) – Pipeline-specific components.
Typically, a data team manages the pipelines/
module.
The company's infra team usually manages the resources defined in the base/aws module.
Having this split already done in this template makes it easy to use the base/aws
as "spec" for your infra team.
For codebase discovery, LLMs are our best allies.
Get Cursor or Copilot and start asking questions in the chat interface.
The documentation is included in the repo as Markdown files, and LLMs usually find the necessary information independently.
Throughout this documentation, you will see references to the ENVIRONMENT.
In our template, the environment represents a specific version or instance of your project, such as prod
, dev
, or ctlq
.
This value is used as a prefix for all resources created in both AWS and Snowflake, ensuring that each deployment is isolated and clearly identified.
Resource Naming: Every resource (e.g., S3 buckets, Lambda functions, Snowflake databases) is prefixed with the environment name. This makes it easy to distinguish between resources belonging to different environments.
Deployment Isolation: With Terragrunt, you can deploy the project to multiple environments concurrently. Each environment can have its own set of custom input values and configuration settings. For example, you can deploy the same project in different AWS regions or accounts.
Configuration Customization: Different environments allow you to adjust resource configurations according to your needs. You might choose a larger warehouse size or different Lambda settings in production compared to development.
When selecting a name for your environment, follow these guidelines:
Keep it Short and Lowercase:
Use concise, lowercase names such as dev
, prod
, or qa
.
Avoid Special Characters or Spaces: Stick to alphanumeric characters and simple words to ensure compatibility across all resource naming conventions.
Using clear and consistent environment names helps maintain organization, prevents resource conflicts, and simplifies management across your AWS and Snowflake deployments.
We maintain a YAML file for every data source that defines the structure of the landing tables in Snowflake.
This YAML schema ensures each source's data model is clearly documented and versioned.
The format of this YAML is inspired by the .
Documentation of the Data Model Having a human-readable YAML file allows any team member to quickly see the tables, columns, and data types for a particular source, making it easier to understand how the data flows through the pipeline.
Automated Table Creation Our Terraform Snowpipe configuration can automatically read the YAML file to create landing tables for each source. You don't have to manually create or update your Snowflake tables whenever you change your schema.
Data Contract Enforcement The YAML schema acts as a contract between the ingestion layer and Snowflake. If the data in your pipeline doesn't match the declared schema, it can trigger validation rules or highlight mismatches, preventing corrupt or malformed data from being loaded.
Version Control By checking the YAML schema into Git, you can track when and why schema changes occur (adding, removing, or modifying columns). This history helps you audit and review changes before they reach production.
Data Quality Visibility Discrepancies between the defined schema and the real data can indicate potential data issues. Because schema mismatches are surfaced early, issues can be caught quickly—before they cause bigger downstream problems.
Below is an abbreviated example of a YAML schema file for a Chess data source.
It illustrates how multiple tables (or "models") are defined within a single file, listing each column's data type and whether it's required, unique, or part of a primary key.
dataContractSpecification: 1.1.0
id: chess
info:
title: chess
version: 1.1.0
models:
players_profiles:
description: ''
fields:
username:
type: VARCHAR
required: false
primaryKey: false
unique: false
description: ''
last_online:
type: TIMESTAMP_TZ
required: false
primaryKey: false
unique: false
description: ''
joined:
type: TIMESTAMP_TZ
required: false
primaryKey: false
unique: false
description: ''
file_format: PARQUET
players_games:
description: ''
fields:
end_time:
type: TIMESTAMP_TZ
required: false
primaryKey: false
unique: false
description: ''
white__username:
type: VARCHAR
required: false
primaryKey: false
unique: false
description: ''
black__username:
type: VARCHAR
required: false
primaryKey: false
unique: false
description: ''
file_format: PARQUET
_dlt_loads:
description: Created by DLT. Tracks completed loads
fields:
load_id:
type: VARCHAR
required: true
primaryKey: false
unique: false
description: ''
status:
type: NUMBER(19,0)
required: true
primaryKey: false
unique: false
description: ''
file_format: JSON