# Add a New Pipeline

This guide explains how to add a new data pipeline to the template.

The pipeline architecture includes:

1. Data ingestion using serverless functions (AWS Lambda) and an ELT tool (dlt)
2. Data lake storage in cloud object storage (AWS S3)
3. Data transformation using an SQL transformation engine ([Amazon Athena](https://aws.amazon.com/athena/)) and dbt.

The boringdata CLI automates many steps along the way.

Before you start, make sure you have installed the boringdata CLI:

{% tabs %}
{% tab title="SSH GitHub auth" %}
{% code overflow="wrap" %}

```bash
uv tool install git+ssh://git@github.com/boringdata/boringdata-cli.git --python 3.12
```

{% endcode %}
{% endtab %}

{% tab title="HTTPS GitHub auth" %}
{% code overflow="wrap" %}

```bash
uv tool install https://github.com/boringdata/boringdata-cli.git --python 3.12
```

{% endcode %}
{% endtab %}
{% endtabs %}

You can then use the boringdata CLI from any directory:

<pre class="language-bash"><code class="lang-bash"><strong>uvx boringdata --help
</strong></code></pre>

## Step 1: Add a New Data Source

Let's start by adding a new data source for ingestion.

The template uses [dlt](https://dlthub.com/docs/intro) as the ingestion framework. Check the [dlt ecosystem](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) to find the connector you want.

You can then generate a full ingestion pipeline for this connector by running:

```bash
cd pipelines && uvx boringdata dlt add-source <connector_name> --destination iceberg
```

This command will create the following files:

`pipelines/<source_name>-lambda.tf` = serverless function infrastructure

`pipelines/ingest/<source_name>-ingestion/*` = ingestion code embedded in a serverless function

Boringdata will also run some helpful operations:

* Set up a Python virtual environment and install necessary dependencies
* Copy `.env.example` to `.env.local`
* Initialize the data connector
* Parse required secrets from configuration files and update both environment variables and infrastructure configurations

Example using the [Notion API](https://developers.notion.com/) as a source:

```
cd pipelines && uvx boringdata dlt add-source notion --destination iceberg
```

{% hint style="info" %}
You can assign a different name to your source than the connector name.

To do so, add the CLI option: --source-name \<source\_name>
{% endhint %}

## Step 2: Configure Secrets

If your source requires secrets (for example, an API key), update the <kbd>.env.example</kbd>.

After deployment, update these secrets manually in [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) if needed.

Example for Notion integration:

The following lines should be present in the .env file:

```bash
SOURCES__NOTION__API_KEY="your_api_key_here"
```

## Step 3: Customize the Ingestion Logic

Edit `pipelines/ingest/<source_name>-ingestion/lambda_handler.py`

{% code title="pipelines/ingest/\<source\_name>-lambda/lambda\_handler.py" %}

```python
#Add missing imports
from <source_name> import <source_functions>
...

#Update the scope of data to be loaded
load_data =
```

{% endcode %}

Example for Notion integration:

```python
from notion import notion_databases
...

#Update the scope of data to be loaded
load_data = notion_databases(database_ids=["your_database_id"])
```

{% hint style="info" %}
Use the <kbd>\<connector\_name>\_pipeline.py</kbd> generated by the framework as inspiration
{% endhint %}

## Step 4: Test the Ingestion Function Locally

To verify your changes, run the function locally (using [DuckDB](https://duckdb.org/) as a local target):

```bash
cd pipelines/ingest/<source_name>-ingestion/ && make run-local
```

This step allows you to test the function and inspect the output data format.

## Step 5: Generate the Source Schema

After running the pipeline locally (see above), generate a source schema definition:

```bash
cd pipelines/
uvx boringdata dlt get-schema <source_name> \
    --engine iceberg \
    --output-folder ingest
```

## Step 6: Create Transformation Models

Based on the schema files generated in step 5, boringdata can automatically generate corresponding SQL transformation models for each of the tables using [Amazon Athena](https://aws.amazon.com/athena/):

```bash
cd pipelines/transform
uvx boringdata dbt import-source \
    --source-yml ../ingest/<source_name>-schema/
```

## Step 7: (Optional) Add Workflow Automation

To coordinate the ingestion and transformation steps, add workflow automation using [AWS Step Functions](https://aws.amazon.com/step-functions/):

```bash
cd pipelines
uvx boringdata aws step-function lambda-dbt \
    --source-name <source_name>
```

## Step 8: Deploy the Infrastructure

Finally, deploy the project from the root directory:

```bash
export AWS_PROFILE=your_aws_profile
export ENVIRONMENT=dev
make deploy
```
