Project Structure folder pipelines/Contents of the pipelines/ folder
Data Pipelines are built using a 2-layer architecture:
ingest/ for data ingestion from source to S3
transform/ for data transformation in Snowflake via dbt
For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.
Copy pipelines/
├── ingest/
├── transform/
├── *.tf A typical data flow looks like this:
spinner
Ingestion Layer
The ingest layer is composed of three artifacts:
the ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/
the data source schema in ingest/{SOURCE_NAME}_source_schema.yml
the infrastructure code (terraform) in pipelines/*.tf
Let's take the example of the chess.com pipeline example provided in this repo:
Copy pipelines/
├── ingest/
│ ├── chess-lambda/
│ │ ├── lambda_handler.py # Lambda code embedding DLT for Chess.com ingestion
│ │ └── ...
│ └── chess_source_schema.yml. # YAML file defining the Chess.com data schema
├── chess_lambda.tf # Terraform creating the lambda function
├── ingestion_bucket.tf # Terraform creating target S3 bucket
... The ingestion is done in a lambda function embedding dlt with:
Source code in pipelines/ingest/chess-ingestion
Terraform inpipelines/chess_lambda.tf
This lambda writes to a bucket defined in ingestion_bucket.tf .
We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml to track the source schema and automatically create landing tables in the Snowflake Warehouse.
The transform layer is composed of two artifacts:
The transformation code in transform/ (typically a dbt project)
The infrastructure code (terraform) in pipelines/*.tf
S3 -> Snowflake
ingestion_snowpipe.tf automatically reads all the yml files in the ingest/ folder and creates:
all landing tables in Snowflake
all Snowflake's pipes to copy automatically the data from S3 to these tables
transform/ is a standard dbt project with models split into two folders (schemas in Snowflake):
STAGING: for the transformed data
MART: for the data ready to be used by the business
The dbt project is run in an ECS task (ecs_task_dbt.tf) .
Let's take the example of the chess pipeline provided in this repo:
spinner
terraform-aws-modules/iam/aws//modules/iam-policy
terraform-aws-modules/ecr/aws
terraform-aws-modules/lambda/aws
terraform-aws-modules/step-functions/aws
terraform-aws-modules/secrets-manager/aws
terraform-aws-modules/ecr/aws
terraform-aws-modules/ssm-parameter/aws
terraform-aws-modules/ecs/aws///modules/service
terraform-aws-modules/iam/aws//modules/iam-assumable-role
terraform-aws-modules/s3-bucket/aws
terraform-aws-modules/s3-bucket/aws//modules/notification
Name
Description
Type
Default
Required
The name of the ECS cluster
The environment to deploy to - will prefix the name of all resources
The name of the VPC to deploy the ECS cluster in
No outputs.
Last updated 11 months ago