Ingestion: dlt + lambda
Overview
This example demonstrates a serverless data ingestion pipeline that:
Fetches chess data from an external source and processes it using dlt
Writes the data to Apache Iceberg tables on S3
The pipeline runs as an AWS Lambda function packaged in a Docker container.

How It Works
Infrastructure Components
AWS Lambda: Executes the ingestion code on demand
Amazon ECR: Stores the Docker container image
Amazon Glue: Hosts the Iceberg Catalog
Amazon S3: Hosts the Iceberg tables
AWS Secrets Manager: Stores credentials and configuration
Terraform: Provisions and manages all infrastructure
Code Structure
pipelines/
├── chess_lambda.tf # Terraform creating the lambda function and ECR repository
├── ingestion_bucket.tf # Terraform creating S3 bucket
└── ingest/
└── chess-ingestion/ # Lambda function code
├── Dockerfile
├── lambda_handler.py # Lambda code with DLT pipeline
└── ...
└── chess-schema/ # Iceberg schema definition in Glue Catalog
├── table_schema.py
└── ...
Data Flow Process
The pipeline follows these steps:
Extraction and Loading to S3: DLT loads data as Parquet files to an ingestion S3 bucket following this path pattern:
{source_name}/raw/{table_name}/{load_id}.{file_id}.{ext}
Iceberg Integration: PyIceberg adds these files to Iceberg tables. Iceberg tables are located in the same ingestion S3 bucket under:
{source_name}/landing/{table_name}/
Files are inserted in append mode.
Development Guide
1. Local Development with DuckDB
For rapid iteration without AWS resources, use DuckDB as the destination:
Create a
.env.local
file with:DESTINATION=duckdb # Add any source-specific credentials here
Run the pipeline locally:
make run-local
Examine results in the local .duckdb database
2. Local Development with S3
To run the lambda with Iceberg destination:
Configure
.env.local
with:DESTINATION=filesystem AWS_REGION=<your-aws-region> S3_BUCKET_NAME=<your-s3-bucket-name> AWS_PROFILE=<your-aws-profile> # Add any source-specific credentials here
Run with the same command:
make run-local
3. Testing on AWS
Once your code is deployed to AWS you can run the lambda with:
export AWS_PROFILE=<your_profile>
make run-lambda env=<your_environment>
4. VSCode Debugging
For interactive debugging, add this to .vscode/launch.json
:
{
"name": "Debug chess lambda",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/pipelines/ingest/chess-ingestion/lambda_handler.py",
"console": "integratedTerminal",
"cwd": "${workspaceFolder}/pipelines/ingest/chess-ingestion",
"justMyCode": false
}
Schema Management
dlt needs to run a first time to provide the target schema definition.
After running the pipeline locally (see above), generate a source schema definition:
cd pipelines/
uvx boringdata dlt get-schema chess
This will generate one schema file per table in the ingest/chess-schema
folder to create the iceberg table in the AWS Glue Catalog.
These scripts will be run automatically by the CI/CD pipeline.
More details about schema management can be found here.
Manual Deployment
For manual deployment:
# Set required environment variables
export AWS_PROFILE=<your_profile>
# Build and deploy
make deploy env=<your_environment>
This process:
Builds the Docker image locally
Pushes it to ECR
Updates the Lambda to use the new image
Common Commands
# Development
make run-local # Run locally with settings from .env.local
make run-lambda env=<environment> # Execute on AWS Lambda
# Deployment
make build env=<environment> # Build Docker image
make deploy env=<environment> # Build and deploy to ECR
# Utilities
make help # Show all available commands
Resources
Last updated