pipelines/
Contents of the pipelines/ folder
Data Pipelines are built using a 2-layer architecture:
ingest/for data ingestion from source to S3transform/for data transformation in Snowflake via dbt
For each layer, the application code is in a separate folder while the underlying infrastructure is defined in terraform files in pipelines/.
pipelines/
├── ingest/
├── transform/
├── *.tfA typical data flow looks like this:
Ingestion Layer
The ingest layer is composed of three artifacts:
the ingestion code in
pipelines/ingest/{SOURCE_NAME}-ingestion/the data source schema in
ingest/{SOURCE_NAME}_source_schema.ymlthe infrastructure code (terraform) in
pipelines/*.tf
Let's take the example of the chess.com pipeline example provided in this repo:
pipelines/
├── ingest/
│ ├── chess-lambda/
│ │ ├── lambda_handler.py # Lambda code embedding DLT for Chess.com ingestion
│ │ └── ...
│ └── chess_source_schema.yml. # YAML file defining the Chess.com data schema
├── chess_lambda.tf # Terraform creating the lambda function
├── ingestion_bucket.tf # Terraform creating target S3 bucket
...The ingestion is done in a lambda function embedding dlt with:
Source code in
pipelines/ingest/chess-ingestionTerraform in
pipelines/chess_lambda.tf
This lambda writes to a bucket defined in ingestion_bucket.tf.
We maintain a YAML file for each data source ingest/{source_name}_source_schema.yml to track the source schema and automatically create landing tables in the Snowflake Warehouse.
Transform Layer
The transform layer is composed of two artifacts:
The transformation code in
transform/(typically a dbt project)The infrastructure code (terraform) in
pipelines/*.tf
S3 -> Snowflake
ingestion_snowpipe.tf automatically reads all the yml files in the ingest/ folder and creates:
all landing tables in Snowflake
all Snowflake's pipes to copy automatically the data from S3 to these tables
dbt
transform/ is a standard dbt project with models split into two folders (schemas in Snowflake):
STAGING: for the transformed data
MART: for the data ready to be used by the business
The dbt project is run in an ECS task (ecs_task_dbt.tf) .
Let's take the example of the chess pipeline provided in this repo:
Terraform Module
Example Usage
Diagram
Requirements
Providers
Modules
Resources
resource
resource
resource
data source
data source
data source
data source
data source
Inputs
Outputs
No outputs.
Last updated