# Ingestion: dlt + lambda

## Overview

This example demonstrates a serverless data ingestion pipeline that:

1. Fetches chess data from an external source and processes it using [dlt](https://dlthub.com/)
2. Writes the data to Apache Iceberg tables on S3

The pipeline runs as an AWS Lambda function packaged in a Docker container.

<figure><img src="https://762120491-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FryeUyIxiKpsTawnfUoTV%2Fuploads%2Fgit-blob-de38ead041a6b63d990dd4bde3f35a35ad5f0c23%2Farchitecture.png?alt=media" alt=""><figcaption></figcaption></figure>

## How It Works

### Infrastructure Components

* **AWS Lambda**: Executes the ingestion code on demand
* **Amazon ECR**: Stores the Docker container image
* **Amazon Glue**: Hosts the Iceberg Catalog
* **Amazon S3**: Hosts the Iceberg tables
* **AWS Secrets Manager**: Stores credentials and configuration
* **Terraform**: Provisions and manages all infrastructure

### Code Structure

```
pipelines/
├── chess_lambda.tf           # Terraform creating the lambda function and ECR repository
├── ingestion_bucket.tf       # Terraform creating S3 bucket
└── ingest/
    └── chess-ingestion/      # Lambda function code
        ├── Dockerfile
        ├── lambda_handler.py # Lambda code with DLT pipeline
        └── ...
    └── chess-schema/         # Iceberg schema definition in Glue Catalog
        ├── table_schema.py
        └── ...
```

### Data Flow Process

The pipeline follows these steps:

1. **Extraction and Loading to S3**: DLT loads data as Parquet files to an ingestion S3 bucket following this path pattern:

   ```
   {source_name}/raw/{table_name}/{load_id}.{file_id}.{ext}
   ```
2. **Iceberg Integration**: PyIceberg adds these files to Iceberg tables.\
   Iceberg tables are located in the same ingestion S3 bucket under:

   ```
   {source_name}/landing/{table_name}/
   ```

Files are inserted in append mode.

## Development Guide

### 1. Local Development with DuckDB

For rapid iteration without AWS resources, use DuckDB as the destination:

1. Create a `.env.local` file with:

   ```
   DESTINATION=duckdb
   # Add any source-specific credentials here
   ```
2. Run the pipeline locally:

   ```bash
   make run-local
   ```
3. Examine results in the local .duckdb database

### 2. Local Development with S3

To run the lambda with Iceberg destination:

1. Configure `.env.local` with:

   ```
   DESTINATION=filesystem
   AWS_REGION=<your-aws-region>
   S3_BUCKET_NAME=<your-s3-bucket-name>
   AWS_PROFILE=<your-aws-profile>
   # Add any source-specific credentials here
   ```
2. Run with the same command:

   ```bash
   make run-local
   ```

### 3. Testing on AWS

Once your code is deployed to AWS you can run the lambda with:

```bash
export AWS_PROFILE=<your_profile>
make run-lambda env=<your_environment>
```

### 4. VSCode Debugging

For interactive debugging, add this to `.vscode/launch.json`:

```json
{
    "name": "Debug chess lambda",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/pipelines/ingest/chess-ingestion/lambda_handler.py",
    "console": "integratedTerminal",
    "cwd": "${workspaceFolder}/pipelines/ingest/chess-ingestion",
    "justMyCode": false
}
```

## Schema Management

dlt needs to run a first time to provide the target schema definition.

After running the pipeline locally (see above), generate a source schema definition:

```bash
cd pipelines/
uvx boringdata dlt get-schema chess
```

This will generate one schema file per table in the `ingest/chess-schema` folder to create the iceberg table in the AWS Glue Catalog.

These scripts will be run automatically by the CI/CD pipeline.

More details about schema management can be found [here](https://dlthub.com/docs/guides/schema-management).

## Manual Deployment

For manual deployment:

```bash
# Set required environment variables
export AWS_PROFILE=<your_profile>

# Build and deploy
make deploy env=<your_environment>
```

This process:

1. Builds the Docker image locally
2. Pushes it to ECR
3. Updates the Lambda to use the new image

## Common Commands

```bash
# Development
make run-local                        # Run locally with settings from .env.local
make run-lambda env=<environment>     # Execute on AWS Lambda

# Deployment
make build env=<environment>          # Build Docker image
make deploy env=<environment>         # Build and deploy to ECR

# Utilities
make help                             # Show all available commands
```

## Resources

* [DLT Documentation](https://dlthub.com/docs/)
* [PyIceberg Documentation](https://py.iceberg.apache.org/)
