pipelines/
Contents of the pipelines/ folder
Last updated
Contents of the pipelines/ folder
Last updated
Data Pipelines are built using a 2-layer architecture:
ingest/
for data ingestion from source to S3
transform/
for data transformation in Snowflake via dbt
For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.
A typical data flow looks like this:
The ingest layer is composed of three artifacts:
the ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/
the data source schema in ingest/{SOURCE_NAME}_source_schema.yml
the infrastructure code (terraform) in pipelines/*.tf
Let's take the example of the chess.com pipeline example provided in this repo:
The ingestion is done in a lambda function embedding dlt with:
Source code in pipelines/ingest/chess-ingestion
Terraform inpipelines/chess_lambda.tf
This lambda writes to a bucket defined in ingestion_bucket.tf.
We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml
to track the source schema and automatically create landing tables in the Snowflake Warehouse.
The transform layer is composed of two artifacts:
The transformation code in transform/
(typically a dbt project)
The infrastructure code (terraform) in pipelines/*.tf
ingestion_snowpipe.tf
automatically reads all the yml files in the ingest/
folder and creates:
all landing tables in Snowflake
all Snowflake's pipes to copy automatically the data from S3 to these tables
transform/
is a standard dbt project with models split into two folders (schemas in Snowflake):
STAGING: for the transformed data
MART: for the data ready to be used by the business
The dbt project is run in an ECS task (ecs_task_dbt.tf
) .
Let's take the example of the chess pipeline provided in this repo:
>=1.5.7
>=5.63.1
>=1.0.0
5.92.0
3.2.3
1.0.4
0.13.0
terraform-aws-modules/iam/aws//modules/iam-policy
5.39.1
terraform-aws-modules/ecr/aws
n/a
terraform-aws-modules/lambda/aws
7.2.1
terraform-aws-modules/step-functions/aws
4.2.1
terraform-aws-modules/secrets-manager/aws
1.1.2
terraform-aws-modules/ecr/aws
n/a
terraform-aws-modules/ssm-parameter/aws
1.1.1
terraform-aws-modules/ecs/aws///modules/service
5.11.2
terraform-aws-modules/iam/aws//modules/iam-assumable-role
5.39.1
terraform-aws-modules/s3-bucket/aws
4.1.0
terraform-aws-modules/s3-bucket/aws//modules/notification
n/a
resource
resource
resource
resource
resource
resource
data source
data source
data source
data source
data source
data source
The name of the ECS cluster
string
null
no
The environment to deploy to - will prefix the name of all resources
string
n/a
yes
The name of the VPC to deploy the ECS cluster in
string
null
no
No outputs.
Get more info on how to run/test this lambda
You can get more info on this project and how to run dbt locally and remotely