pipelines/

Contents of the pipelines/ folder

This template implements a modern data architecture using AWS services and Apache Iceberg, featuring a clean separation between ingestion and transformation layers.

Architecture Overview

Data pipelines are built using a 2-layer architecture:

  • Ingestion Layer (ingest/): Extracts data from external sources and loads it into Apache Iceberg tables

  • Transformation Layer (transform/): Transforms raw data using dbt and AWS Athena into analytics-ready tables

Project Structure

pipelines/
├── ingest/                    # Ingestion layer code
│   ├── {source}-ingestion/    # Source-specific ingestion code (Lambda)
│   └── {source}-schema/       # Schema definitions for landing tables
├── transform/                 # Transformation layer code (dbt project)
│   ├── models/                # dbt models
│   ├── sources/               # dbt source definitions
│   └── ...
└── *.tf                       # Terraform infrastructure definitions

Data Flow Diagram

The following diagram illustrates how data flows through the system:

Ingestion Layer

The ingestion layer extracts data from external sources and loads it into Apache Iceberg landing tables. It consists of three main components:

  1. Source-specific ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/

  2. Schema definitions in pipelines/ingest/{SOURCE_NAME}-schema/

  3. Infrastructure as code in Terraform files (pipelines/*.tf)

Example: Chess.com Pipeline

This repository includes an example pipeline that ingests data from Chess.com:

Ingestion Process

The data ingestion process follows these steps:

  1. Extraction & Load: A Lambda function uses Data Load Tool (DLT) to extract data from external sources and store it as Parquet files in S3

  2. Table Management: The same Lambda then uses PyIceberg to add these files to Iceberg tables

For detailed instructions on running and testing the Chess.com Lambda function, see the chess-ingestion README.

Landing Table Management

The landing tables are defined and managed through schema scripts in pipelines/ingest/{SOURCE_NAME}-schema/. These scripts are automatically executed during deployment to:

  • Create new tables if they don't exist

  • Update existing table schemas when needed

  • Maintain table properties and metadata

When schema changes are required, you modify and redeploy these definition files.

More details about schema evolution here: Iceberg Landing Table Schema Evolution

Transformation Layer

The transformation layer processes data from landing tables into analytics-ready formats using dbt. It consists of:

  1. dbt project in transform/

  2. Infrastructure code in pipelines/*.tf (especially ecs_task_dbt.tf)

For details on developing and running dbt models, see the transform README.

Infrastructure Overview

The following diagram shows the AWS infrastructure components and their relationships:

Module documentation

Requirements

Name
Version

>=1.5.7

>=5.63.1

Providers

Name
Version

5.92.0

3.2.3

Modules

Name
Source
Version

terraform-aws-modules/iam/aws//modules/iam-policy

5.39.1

terraform-aws-modules/iam/aws//modules/iam-policy

5.39.1

terraform-aws-modules/ecr/aws

n/a

terraform-aws-modules/lambda/aws

7.2.1

terraform-aws-modules/step-functions/aws

4.2.1

terraform-aws-modules/secrets-manager/aws

1.1.2

terraform-aws-modules/ecr/aws

n/a

terraform-aws-modules/ssm-parameter/aws

1.1.1

terraform-aws-modules/ecs/aws///modules/service

5.11.2

terraform-aws-modules/s3-bucket/aws

4.1.0

terraform-aws-modules/s3-bucket/aws

4.1.0

Resources

Inputs

Name
Description
Type
Default
Required

The name of the ECS cluster

string

null

no

The environment to deploy to - will prefix the name of all resources

string

n/a

yes

The name of the VPC to deploy the ECS cluster in

string

null

no

Outputs

No outputs.

Last updated