pipelines/
Contents of the pipelines/ folder
This template implements a modern data architecture using AWS services and Apache Iceberg, featuring a clean separation between ingestion and transformation layers.
Architecture Overview
Data pipelines are built using a 2-layer architecture:
Ingestion Layer (
ingest/): Extracts data from external sources and loads it into Apache Iceberg tablesTransformation Layer (
transform/): Transforms raw data using dbt and AWS Athena into analytics-ready tables
Project Structure
pipelines/
├── ingest/ # Ingestion layer code
│ ├── {source}-ingestion/ # Source-specific ingestion code (Lambda)
│ └── {source}-schema/ # Schema definitions for landing tables
├── transform/ # Transformation layer code (dbt project)
│ ├── models/ # dbt models
│ ├── sources/ # dbt source definitions
│ └── ...
└── *.tf # Terraform infrastructure definitionsData Flow Diagram
The following diagram illustrates how data flows through the system:
Ingestion Layer
The ingestion layer extracts data from external sources and loads it into Apache Iceberg landing tables. It consists of three main components:
Source-specific ingestion code in
pipelines/ingest/{SOURCE_NAME}-ingestion/Schema definitions in
pipelines/ingest/{SOURCE_NAME}-schema/Infrastructure as code in Terraform files (
pipelines/*.tf)
Example: Chess.com Pipeline
This repository includes an example pipeline that ingests data from Chess.com:
Ingestion Process
The data ingestion process follows these steps:
Extraction & Load: A Lambda function uses Data Load Tool (DLT) to extract data from external sources and store it as Parquet files in S3
Table Management: The same Lambda then uses PyIceberg to add these files to Iceberg tables
Landing Table Management
The landing tables are defined and managed through schema scripts in pipelines/ingest/{SOURCE_NAME}-schema/. These scripts are automatically executed during deployment to:
Create new tables if they don't exist
Update existing table schemas when needed
Maintain table properties and metadata
When schema changes are required, you modify and redeploy these definition files.
Transformation Layer
The transformation layer processes data from landing tables into analytics-ready formats using dbt. It consists of:
dbt project in
transform/Infrastructure code in
pipelines/*.tf(especiallyecs_task_dbt.tf)
Infrastructure Overview
The following diagram shows the AWS infrastructure components and their relationships:
Module documentation
Requirements
Providers
Modules
Resources
resource
data source
data source
data source
data source
data source
Inputs
Outputs
No outputs.
Last updated