githubEdit

folderIngestion: dlt + lambda

Overview

This example demonstrates a serverless data ingestion pipeline that:

  1. Fetches chess data from an external source and processes it using dltarrow-up-right

  2. Writes the data to Apache Iceberg tables on S3

The pipeline runs as an AWS Lambda function packaged in a Docker container.

How It Works

Infrastructure Components

  • AWS Lambda: Executes the ingestion code on demand

  • Amazon ECR: Stores the Docker container image

  • Amazon Glue: Hosts the Iceberg Catalog

  • Amazon S3: Hosts the Iceberg tables

  • AWS Secrets Manager: Stores credentials and configuration

  • Terraform: Provisions and manages all infrastructure

Code Structure

Data Flow Process

The pipeline follows these steps:

  1. Extraction and Loading to S3: DLT loads data as Parquet files to an ingestion S3 bucket following this path pattern:

  2. Iceberg Integration: PyIceberg adds these files to Iceberg tables. Iceberg tables are located in the same ingestion S3 bucket under:

Files are inserted in append mode.

Development Guide

1. Local Development with DuckDB

For rapid iteration without AWS resources, use DuckDB as the destination:

  1. Create a .env.local file with:

  2. Run the pipeline locally:

  3. Examine results in the local .duckdb database

2. Local Development with S3

To run the lambda with Iceberg destination:

  1. Configure .env.local with:

  2. Run with the same command:

3. Testing on AWS

Once your code is deployed to AWS you can run the lambda with:

4. VSCode Debugging

For interactive debugging, add this to .vscode/launch.json:

Schema Management

dlt needs to run a first time to provide the target schema definition.

After running the pipeline locally (see above), generate a source schema definition:

This will generate one schema file per table in the ingest/chess-schema folder to create the iceberg table in the AWS Glue Catalog.

These scripts will be run automatically by the CI/CD pipeline.

More details about schema management can be found herearrow-up-right.

Manual Deployment

For manual deployment:

This process:

  1. Builds the Docker image locally

  2. Pushes it to ECR

  3. Updates the Lambda to use the new image

Common Commands

Resources

Last updated