Boring Data
Template: AWS+Snowflake
Template: AWS+Snowflake
  • Introduction
    • Overview
    • Key Concepts
    • Get Started
  • Project Structure
    • pipelines/
      • Ingestion: dlt + lambda
      • Transformation: dbt
    • base/aws/
    • base/snowflake/
    • live/
  • Guides
    • Add a New Pipeline
    • CI Deployment
  • Help
    • FAQ
Powered by GitBook
On this page
Edit on GitHub
  1. Project Structure

pipelines/

Contents of the pipelines/ folder

PreviousGet StartedNextIngestion: dlt + lambda

Last updated 4 months ago

CtrlK
  • Ingestion Layer
  • Transform Layer
  • S3 -> Snowflake
  • dbt
  • Terraform Module
  • Example Usage
  • Diagram
  • Requirements
  • Providers
  • Modules
  • Resources
  • Inputs
  • Outputs

Data Pipelines are built using a 2-layer architecture:

  • ingest/ for data ingestion from source to S3

  • transform/ for data transformation in Snowflake via dbt

For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.

pipelines/
├── ingest/
├── transform/
├── *.tf

A typical data flow looks like this:

Ingestion Layer

The ingest layer is composed of three artifacts:

  • the ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/

  • the data source schema in ingest/{SOURCE_NAME}_source_schema.yml

  • the infrastructure code (terraform) in pipelines/*.tf

Let's take the example of the chess.com pipeline example provided in this repo:

pipelines/
├── ingest/
│   ├── chess-lambda/
│   │   ├── lambda_handler.py     # Lambda code embedding DLT for Chess.com ingestion
│   │   └── ...
│   └── chess_source_schema.yml.  # YAML file defining the Chess.com data schema
├── chess_lambda.tf               # Terraform creating the lambda function
├── ingestion_bucket.tf           # Terraform creating target S3 bucket
...

The ingestion is done in a lambda function embedding dlt with:

  • Source code in pipelines/ingest/chess-ingestion

  • Terraform inpipelines/chess_lambda.tf

This lambda writes to a bucket defined in ingestion_bucket.tf.

Get more info on how to run/test this lambda here

We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml to track the source schema and automatically create landing tables in the Snowflake Warehouse.

Why do you need to generate a yaml for each source ?

Transform Layer

The transform layer is composed of two artifacts:

  • The transformation code in transform/ (typically a dbt project)

  • The infrastructure code (terraform) in pipelines/*.tf

S3 -> Snowflake

ingestion_snowpipe.tf automatically reads all the yml files in the ingest/ folder and creates:

  • all landing tables in Snowflake

  • all Snowflake's pipes to copy automatically the data from S3 to these tables

dbt

transform/ is a standard dbt project with models split into two folders (schemas in Snowflake):

  • STAGING: for the transformed data

  • MART: for the data ready to be used by the business

The dbt project is run in an ECS task (ecs_task_dbt.tf) .

You can get more info on this project and how to run dbt locally and remotely here

Let's take the example of the chess pipeline provided in this repo:


├── pipelines/
│       ├── models/
│       │   └── staging/
│       │       └── chess/                    # Chess staging models
│       │           ├── stg_chess_games.sql
│       │           ├── stg_chess_players.sql
│       │           ├── stg_chess_players_games.sql
│       │           ├── stg_chess_...
│       │
│       └── sources/
│           └── chess.yml
├── ingestion_snowpipe.tf       # Terraform for Snowflake landing table + snowpipe creation
├── ecs_task_dbt.tf             # Terraform for creating the ECS task running dbt in AWS
...

Terraform Module

Example Usage

module "chess_lambda" {
  source = "git::https://github.com/boringdata/boringdata-template-aws-snowflake.git//modules/chess_lambda"
  environment = "prod"
  vpc_name = "vpc-12345678"
  ecs_cluster_name = "ecs-cluster-12345678"
}

Diagram

Requirements

Name
Version

terraform

>=1.5.7

aws

>=5.63.1

snowflake

>=1.0.0

Providers

Name
Version

aws

5.92.0

null

3.2.3

snowflake

1.0.4

time

0.13.0

Modules

Name
Source
Version

bucket_ingestion_read_policies

terraform-aws-modules/iam/aws//modules/iam-policy

5.39.1

chess_ecr

terraform-aws-modules/ecr/aws

n/a

chess_lambda_function

terraform-aws-modules/lambda/aws

7.2.1

chess_pipeline

terraform-aws-modules/step-functions/aws

4.2.1

chess_secrets

terraform-aws-modules/secrets-manager/aws

1.1.2

dbt_ecr

terraform-aws-modules/ecr/aws

n/a

dbt_task_definition

terraform-aws-modules/ssm-parameter/aws

1.1.1

ecs_task_definition_dbt

terraform-aws-modules/ecs/aws///modules/service

5.11.2

iam_role_assumable_snowflake

terraform-aws-modules/iam/aws//modules/iam-assumable-role

5.39.1

ingestion_bucket

terraform-aws-modules/s3-bucket/aws

4.1.0

s3_notifications

terraform-aws-modules/s3-bucket/aws//modules/notification

n/a

Resources

Name
Type

null_resource.chess_empty_image

resource

snowflake_pipe.ingestion_pipes

resource

snowflake_stage.snowflake_landing_stage

resource

snowflake_storage_integration.storage_integration

resource

snowflake_table.snowflake_landing_tables

resource

time_sleep.wait_for_iam_role

resource

aws_caller_identity.this

data source

aws_ecs_cluster.ecs-cluster

data source

aws_iam_policy_document.bucket_ingestion_read_write_policy

data source

aws_region.this

data source

aws_subnets.subnets

data source

aws_vpc.vpc

data source

Inputs

Name
Description
Type
Default
Required

ecs_cluster_name

The name of the ECS cluster

string

null

no

environment

The environment to deploy to - will prefix the name of all resources

string

n/a

yes

vpc_name

The name of the VPC to deploy the ECS cluster in

string

null

no

Outputs

No outputs.