pipelines/

Contents of the pipelines/ folder

Data Pipelines are built using a 2-layer architecture:

  • ingest/ for data ingestion from source to S3

  • transform/ for data transformation in Snowflake via dbt

For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.

pipelines/
├── ingest/
├── transform/
├── *.tf

A typical data flow looks like this:

Ingestion Layer

The ingest layer is composed of three artifacts:

  • the ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/

  • the data source schema in ingest/{SOURCE_NAME}_source_schema.yml

  • the infrastructure code (terraform) in pipelines/*.tf

Let's take the example of the chess.com pipeline example provided in this repo:

pipelines/
├── ingest/
│   ├── chess-lambda/
│   │   ├── lambda_handler.py     # Lambda code embedding DLT for Chess.com ingestion
│   │   └── ...
│   └── chess_source_schema.yml.  # YAML file defining the Chess.com data schema
├── chess_lambda.tf               # Terraform creating the lambda function
├── ingestion_bucket.tf           # Terraform creating target S3 bucket
...

The ingestion is done in a lambda function embedding dlt with:

  • Source code in pipelines/ingest/chess-ingestion

  • Terraform inpipelines/chess_lambda.tf

This lambda writes to a bucket defined in ingestion_bucket.tf.

Get more info on how to run/test this lambda here

We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml to track the source schema and automatically create landing tables in the Snowflake Warehouse.

Transform Layer

The transform layer is composed of two artifacts:

  • The transformation code in transform/ (typically a dbt project)

  • The infrastructure code (terraform) in pipelines/*.tf

S3 -> Snowflake

ingestion_snowpipe.tf automatically reads all the yml files in the ingest/ folder and creates:

  • all landing tables in Snowflake

  • all Snowflake's pipes to copy automatically the data from S3 to these tables

dbt

transform/ is a standard dbt project with models split into two folders (schemas in Snowflake):

  • STAGING: for the transformed data

  • MART: for the data ready to be used by the business

The dbt project is run in an ECS task (ecs_task_dbt.tf) .

You can get more info on this project and how to run dbt locally and remotely here

Let's take the example of the chess pipeline provided in this repo:


├── pipelines/
│       ├── models/
│       │   └── staging/
│       │       └── chess/                    # Chess staging models
│       │           ├── stg_chess_games.sql
│       │           ├── stg_chess_players.sql
│       │           ├── stg_chess_players_games.sql
│       │           ├── stg_chess_...
│       │
│       └── sources/
│           └── chess.yml
├── ingestion_snowpipe.tf       # Terraform for Snowflake landing table + snowpipe creation
├── ecs_task_dbt.tf             # Terraform for creating the ECS task running dbt in AWS
...

Terraform Module

Example Usage

module "chess_lambda" {
  source = "git::https://github.com/boringdata/boringdata-template-aws-snowflake.git//modules/chess_lambda"
  environment = "prod"
  vpc_name = "vpc-12345678"
  ecs_cluster_name = "ecs-cluster-12345678"
}

Diagram

Requirements

Name
Version

>=1.5.7

>=5.63.1

>=1.0.0

Providers

Name
Version

5.92.0

3.2.3

0.13.0

Modules

Name
Source
Version

terraform-aws-modules/iam/aws//modules/iam-policy

5.39.1

terraform-aws-modules/ecr/aws

n/a

terraform-aws-modules/lambda/aws

7.2.1

terraform-aws-modules/step-functions/aws

4.2.1

terraform-aws-modules/secrets-manager/aws

1.1.2

terraform-aws-modules/ecr/aws

n/a

terraform-aws-modules/ssm-parameter/aws

1.1.1

terraform-aws-modules/ecs/aws///modules/service

5.11.2

terraform-aws-modules/iam/aws//modules/iam-assumable-role

5.39.1

terraform-aws-modules/s3-bucket/aws

4.1.0

terraform-aws-modules/s3-bucket/aws//modules/notification

n/a

Resources

Inputs

Name
Description
Type
Default
Required

The name of the ECS cluster

string

null

no

The environment to deploy to - will prefix the name of all resources

string

n/a

yes

The name of the VPC to deploy the ECS cluster in

string

null

no

Outputs

No outputs.

Last updated