pipelines/

Contents of the pipelines/ folder

This template implements a modern data architecture using AWS services and Apache Iceberg, featuring a clean separation between ingestion and transformation layers.

Architecture Overview

Data pipelines are built using a 2-layer architecture:

Ingestion Layer (ingest/): Extracts data from external sources and loads it into Apache Iceberg tables
Transformation Layer (transform/): Transforms raw data using dbt and AWS Athena into analytics-ready tables

Project Structure

pipelines/
├── ingest/                    # Ingestion layer code
│   ├── {source}-ingestion/    # Source-specific ingestion code (Lambda)
│   └── {source}-schema/       # Schema definitions for landing tables
├── transform/                 # Transformation layer code (dbt project)
│   ├── models/                # dbt models
│   ├── sources/               # dbt source definitions
│   └── ...
└── *.tf                       # Terraform infrastructure definitions

Data Flow Diagram

The following diagram illustrates how data flows through the system:

Ingestion Layer

The ingestion layer extracts data from external sources and loads it into Apache Iceberg landing tables. It consists of three main components:

Source-specific ingestion code in pipelines/ingest/{SOURCE_NAME}-ingestion/
Schema definitions in pipelines/ingest/{SOURCE_NAME}-schema/
Infrastructure as code in Terraform files (pipelines/*.tf)

Example: Chess.com Pipeline

This repository includes an example pipeline that ingests data from Chess.com:

pipelines/
├── ingest/
│   ├── chess-ingestion/
│   │   ├── lambda_handler.py      # Lambda code using DLT for Chess.com ingestion
│   │   ├── Dockerfile             # Container image definition
│   │   └── ...
│   ├── chess-schema/
│   │   ├── chess_players_games.py # Schema definition for players_games table
│   │   ├── chess_players.py       # Schema definition for players table
│   │   └── ...
├── chess_lambda.tf                # Terraform creating the Lambda function
├── ingestion_bucket.tf            # S3 bucket for landing zone
├── staging_bucket.tf              # S3 bucket for staging/analytics zone
└── ...

Ingestion Process

The data ingestion process follows these steps:

Extraction & Load: A Lambda function uses Data Load Tool (DLT) to extract data from external sources and store it as Parquet files in S3
Table Management: The same Lambda then uses PyIceberg to add these files to Iceberg tables

For detailed instructions on running and testing the Chess.com Lambda function, see the chess-ingestion README.

Landing Table Management

The landing tables are defined and managed through schema scripts in pipelines/ingest/{SOURCE_NAME}-schema/. These scripts are automatically executed during deployment to: