pipelines/

Contents of the pipelines/ folder

Data Pipelines are built using a 2-layer architecture:

ingest/ for data ingestion from source to S3
transform/ for data transformation in Snowflake via dbt

For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.

pipelines/
├── ingest/
├── transform/
├── *.tf

A typical data flow looks like this:

Ingestion Layer

The ingest layer is composed of three artifacts:

the ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/
the data source schema in ingest/{SOURCE_NAME}_source_schema.yml
the infrastructure code (terraform) in pipelines/*.tf

Let's take the example of the chess.com pipeline example provided in this repo:

pipelines/
├── ingest/
│   ├── chess-lambda/
│   │   ├── lambda_handler.py     # Lambda code embedding DLT for Chess.com ingestion
│   │   └── ...
│   └── chess_source_schema.yml.  # YAML file defining the Chess.com data schema
├── chess_lambda.tf               # Terraform creating the lambda function
├── ingestion_bucket.tf           # Terraform creating target S3 bucket
...

The ingestion is done in a lambda function embedding dlt with:

Source code in pipelines/ingest/chess-ingestion
Terraform inpipelines/chess_lambda.tf

This lambda writes to a bucket defined in ingestion_bucket.tf.

Get more info on how to run/test this lambda here

We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml to track the source schema and automatically create landing tables in the Snowflake Warehouse.

Why do you need to generate a yaml for each source ?

Transform Layer

The transform layer is composed of two artifacts:

The transformation code in transform/ (typically a dbt project)
The infrastructure code (terraform) in pipelines/*.tf

S3 -> Snowflake

ingestion_snowpipe.tf automatically reads all the yml files in the ingest/ folder and creates:

all landing tables in Snowflake
all Snowflake's pipes to copy automatically the data from S3 to these tables

dbt

transform/ is a standard dbt project with models split into two folders (schemas in Snowflake):

STAGING: for the transformed data
MART: for the data ready to be used by the business

The dbt project is run in an ECS task (ecs_task_dbt.tf) .

You can get more info on this project and how to run dbt locally and remotely here

Let's take the example of the chess pipeline provided in this repo:


├── pipelines/
│       ├── models/
│       │   └── staging/
│       │       └── chess/                    # Chess staging models
│       │           ├── stg_chess_games.sql
│       │           ├── stg_chess_players.sql
│       │           ├── stg_chess_players_games.sql
│       │           ├── stg_chess_...
│       │
│       └── sources/
│           └── chess.yml
├── ingestion_snowpipe.tf       # Terraform for Snowflake landing table + snowpipe creation
├── ecs_task_dbt.tf             # Terraform for creating the ECS task running dbt in AWS
...

Terraform Module

Example Usage

module "chess_lambda" {
  source = "git::https://github.com/boringdata/boringdata-template-aws-snowflake.git//modules/chess_lambda"
  environment = "prod"
  vpc_name = "vpc-12345678"
  ecs_cluster_name = "ecs-cluster-12345678"
}