Ingestion: dlt + lambda
Last updated
Last updated
This example demonstrates a serverless data ingestion pipeline that:
Fetches chess data from an external source and processes it using
Writes the data to Apache Iceberg tables on S3
The pipeline runs as an AWS Lambda function packaged in a Docker container.
AWS Lambda: Executes the ingestion code on demand
Amazon ECR: Stores the Docker container image
Amazon Glue: Hosts the Iceberg Catalog
Amazon S3: Hosts the Iceberg tables
AWS Secrets Manager: Stores credentials and configuration
Terraform: Provisions and manages all infrastructure
The pipeline follows these steps:
Extraction and Loading to S3: DLT loads data as Parquet files to an ingestion S3 bucket following this path pattern:
Iceberg Integration: PyIceberg adds these files to Iceberg tables. Iceberg tables are located in the same ingestion S3 bucket under:
Files are inserted in append mode.
For rapid iteration without AWS resources, use DuckDB as the destination:
Create a .env.local
file with:
Run the pipeline locally:
Examine results in the local .duckdb database
To run the lambda with Iceberg destination:
Configure .env.local
with:
Run with the same command:
Once your code is deployed to AWS you can run the lambda with:
For interactive debugging, add this to .vscode/launch.json
:
dlt needs to run a first time to provide the target schema definition.
After running the pipeline locally (see above), generate a source schema definition:
This will generate one schema file per table in the ingest/chess-schema
folder to create the iceberg table in the AWS Glue Catalog.
These scripts will be run automatically by the CI/CD pipeline.
For manual deployment:
This process:
Builds the Docker image locally
Pushes it to ECR
Updates the Lambda to use the new image
More details about schema management can be found .