# Ingestion: dlt + lambda

## Overview

This example demonstrates a serverless data ingestion pipeline that:

1. Fetches chess data from an external source and processes it using [dlt](https://dlthub.com/)
2. Writes the data to S3

The pipeline runs as an AWS Lambda function packaged in a Docker container.

<figure><img src="/files/9Ho9dQB2YawHq63zrGKf" alt=""><figcaption></figcaption></figure>

## How It Works

### Infrastructure Components

* **AWS Lambda**: Executes the ingestion code on demand
* **Amazon ECR**: Stores the Docker container image
* **Amazon S3**: Temporary storage for data files before loading to Snowflake
* **AWS Secrets Manager**: Stores credentials and configuration
* **Terraform**: Provisions and manages all infrastructure

### Code Structure

```
pipelines/
├── chess_lambda.tf           # Terraform creating the lambda function and ECR repository
└── ingest/
    ├── chess_source_schema.yml   # Snowflake table schema definitions in YAML format
    └── chess-ingestion/          # Lambda function code
        ├── Dockerfile
        ├── lambda_handler.py # Lambda code with DLT pipeline
        └── ...
```

### Data Flow Process

The pipeline follows these steps:

1. **Extraction**: DLT extracts data from the source
2. **Transformation**: DLT performs basic transformations (typing, normalization)
3. **Loading**: DLT loads the data directly to S3
4. **Schema Management**: Table schemas are defined in YAML files and managed by the pipeline

## Development Guide

### 1. Local Development with DuckDB

For rapid iteration without AWS resources, use DuckDB as the destination:

1. Create a `.env.local` file with:

   ```
   DESTINATION=duckdb
   # Add any source-specific credentials here
   ```
2. Run the pipeline locally:

   ```bash
   make run-local
   ```
3. Examine results in the local .duckdb database

### 2. Local Development with S3

To run the lambda with S3 as a temporary destination:

1. Configure `.env.local` with:

   ```
   DESTINATION=filesystem
   AWS_REGION=<your-aws-region>
   S3_BUCKET_NAME=<your-s3-bucket-name>
   AWS_PROFILE=<your-aws-profile>
   # Add any source-specific credentials here
   ```
2. Run with the same command:

   ```bash
   make run-local
   ```

### 3. Testing on AWS

Once your code is deployed to AWS you can run the lambda with:

```bash
export AWS_PROFILE=<your_profile>
make run-lambda env=<your_environment>
```

### 4. VSCode Debugging

For interactive debugging, add this to `.vscode/launch.json`:

```json
{
    "name": "Debug chess lambda",
    "type": "debugpy",
    "request": "launch",
    "program": "${workspaceFolder}/pipelines/ingest/chess-ingestion/lambda_handler.py",
    "console": "integratedTerminal",
    "cwd": "${workspaceFolder}/pipelines/ingest/chess-ingestion",
    "justMyCode": false
}
```

## Schema Management

Snowflake landing table schemas are defined in YAML files in the `pipelines/ingest/<source-name>_source_schema.yml` file.

After running the pipeline locally, generate a source schema definition:

```bash
cd pipelines/
uvx boringdata dlt get-schema chess
```

This will generate a schema file `chess_source_schema.yml` in the pipelines folder to define the Snowflake tables.

## Manual Deployment

For manual deployment:

```bash
# Set required environment variables
export AWS_PROFILE=<your_profile>

# Build and deploy
make deploy env=<your_environment>
```

This process:

1. Builds the Docker image locally
2. Pushes it to ECR
3. Updates the Lambda to use the new image

## Common Commands

```bash
# Development
make run-local                        # Run locally with settings from .env.local
make run-lambda env=<environment>     # Execute on AWS Lambda

# Deployment
make build env=<environment>          # Build Docker image
make deploy env=<environment>         # Build and deploy to ECR

# Utilities
make help                             # Show all available commands
```

## Resources

* [DLT Documentation](https://dlthub.com/docs/)
* [Snowflake SQL API Documentation](https://docs.snowflake.com/en/developer-guide/sql-api/index)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.boringdata.io/template-aws-snowflake/project-structure/pipelines/chess-ingestion.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
