pipelines/
Contents of the pipelines/ folder
Data Pipelines are built using a 2-layer architecture:
ingest/
for data ingestion from source to S3transform/
for data transformation in Snowflake via dbt
For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.
pipelines/
├── ingest/
├── transform/
├── *.tf
A typical data flow looks like this:
Ingestion Layer
The ingest layer is composed of three artifacts:
the ingestion code in
pipelines/ingest/{SOURCE_NAME}-ingestion/
the data source schema in
ingest/{SOURCE_NAME}_source_schema.yml
the infrastructure code (terraform) in
pipelines/*.tf
Let's take the example of the chess.com pipeline example provided in this repo:
pipelines/
├── ingest/
│ ├── chess-lambda/
│ │ ├── lambda_handler.py # Lambda code embedding DLT for Chess.com ingestion
│ │ └── ...
│ └── chess_source_schema.yml. # YAML file defining the Chess.com data schema
├── chess_lambda.tf # Terraform creating the lambda function
├── ingestion_bucket.tf # Terraform creating target S3 bucket
...
The ingestion is done in a lambda function embedding dlt with:
Source code in
pipelines/ingest/chess-ingestion
Terraform in
pipelines/chess_lambda.tf
This lambda writes to a bucket defined in ingestion_bucket.tf.
We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml
to track the source schema and automatically create landing tables in the Snowflake Warehouse.
Transform Layer
The transform layer is composed of two artifacts:
The transformation code in
transform/
(typically a dbt project)The infrastructure code (terraform) in
pipelines/*.tf
S3 -> Snowflake
ingestion_snowpipe.tf
automatically reads all the yml files in the ingest/
folder and creates:
all landing tables in Snowflake
all Snowflake's pipes to copy automatically the data from S3 to these tables
dbt
transform/
is a standard dbt project with models split into two folders (schemas in Snowflake):
STAGING: for the transformed data
MART: for the data ready to be used by the business
The dbt project is run in an ECS task (ecs_task_dbt.tf
) .
Let's take the example of the chess pipeline provided in this repo:
├── pipelines/
│ ├── models/
│ │ └── staging/
│ │ └── chess/ # Chess staging models
│ │ ├── stg_chess_games.sql
│ │ ├── stg_chess_players.sql
│ │ ├── stg_chess_players_games.sql
│ │ ├── stg_chess_...
│ │
│ └── sources/
│ └── chess.yml
├── ingestion_snowpipe.tf # Terraform for Snowflake landing table + snowpipe creation
├── ecs_task_dbt.tf # Terraform for creating the ECS task running dbt in AWS
...
Terraform Module
Example Usage
module "chess_lambda" {
source = "git::https://github.com/boringdata/boringdata-template-aws-snowflake.git//modules/chess_lambda"
environment = "prod"
vpc_name = "vpc-12345678"
ecs_cluster_name = "ecs-cluster-12345678"
}
Diagram
Requirements
Providers
Modules
Resources
resource
resource
resource
data source
data source
data source
data source
data source
Inputs
Outputs
No outputs.
Last updated