Ingestion: dlt + lambda

Overview

This example demonstrates a serverless data ingestion pipeline that:

  1. Fetches chess data from an external source and processes it using dlt

  2. Writes the data to Apache Iceberg tables on S3

The pipeline runs as an AWS Lambda function packaged in a Docker container.

How It Works

Infrastructure Components

  • AWS Lambda: Executes the ingestion code on demand

  • Amazon ECR: Stores the Docker container image

  • Amazon Glue: Hosts the Iceberg Catalog

  • Amazon S3: Hosts the Iceberg tables

  • AWS Secrets Manager: Stores credentials and configuration

  • Terraform: Provisions and manages all infrastructure

Code Structure

Data Flow Process

The pipeline follows these steps:

  1. Extraction and Loading to S3: DLT loads data as Parquet files to an ingestion S3 bucket following this path pattern:

  2. Iceberg Integration: PyIceberg adds these files to Iceberg tables. Iceberg tables are located in the same ingestion S3 bucket under:

Files are inserted in append mode.

Development Guide

1. Local Development with DuckDB

For rapid iteration without AWS resources, use DuckDB as the destination:

  1. Create a .env.local file with:

  2. Run the pipeline locally:

  3. Examine results in the local .duckdb database

2. Local Development with S3

To run the lambda with Iceberg destination:

  1. Configure .env.local with:

  2. Run with the same command:

3. Testing on AWS

Once your code is deployed to AWS you can run the lambda with:

4. VSCode Debugging

For interactive debugging, add this to .vscode/launch.json:

Schema Management

dlt needs to run a first time to provide the target schema definition.

After running the pipeline locally (see above), generate a source schema definition:

This will generate one schema file per table in the ingest/chess-schema folder to create the iceberg table in the AWS Glue Catalog.

These scripts will be run automatically by the CI/CD pipeline.

More details about schema management can be found here.

Manual Deployment

For manual deployment:

This process:

  1. Builds the Docker image locally

  2. Pushes it to ECR

  3. Updates the Lambda to use the new image

Common Commands

Resources

Last updated