Add a New Pipeline
This guide explains how to add a new data pipeline to the template.
The pipeline architecture includes:
Data ingestion using serverless functions (AWS Lambda) and an ELT tool (dlt)
Data lake storage in cloud object storage (AWS S3)
Data transformation using an SQL transformation engine (Amazon Athena) and dbt.
The boringdata CLI automates many steps along the way.
Before you start, make sure you have installed the boringdata CLI:
uv tool install git+ssh://[email protected]/boringdata/boringdata-cli.git --python 3.12
You can then use the boringdata CLI from any directory:
uvx boringdata --help
Step 1: Add a New Data Source
Let's start by adding a new data source for ingestion.
The template uses dlt as the ingestion framework. Check the dlt ecosystem to find the connector you want.
You can then generate a full ingestion pipeline for this connector by running:
cd pipelines && uvx boringdata dlt add-source <connector_name> --destination iceberg
This command will create the following files:
pipelines/<source_name>-lambda.tf
= serverless function infrastructure
pipelines/ingest/<source_name>-ingestion/*
= ingestion code embedded in a serverless function
Boringdata will also run some helpful operations:
Set up a Python virtual environment and install necessary dependencies
Copy
.env.example
to.env.local
Initialize the data connector
Parse required secrets from configuration files and update both environment variables and infrastructure configurations
Example using the Notion API as a source:
cd pipelines && uvx boringdata dlt add-source notion --destination iceberg
Step 2: Configure Secrets
If your source requires secrets (for example, an API key), update the .env.example.
After deployment, update these secrets manually in AWS Secrets Manager if needed.
Example for Notion integration:
The following lines should be present in the .env file:
SOURCES__NOTION__API_KEY="your_api_key_here"
Step 3: Customize the Ingestion Logic
Edit pipelines/ingest/<source_name>-ingestion/lambda_handler.py
#Add missing imports
from <source_name> import <source_functions>
...
#Update the scope of data to be loaded
load_data =
Example for Notion integration:
from notion import notion_databases
...
#Update the scope of data to be loaded
load_data = notion_databases(database_ids=["your_database_id"])
Step 4: Test the Ingestion Function Locally
To verify your changes, run the function locally (using DuckDB as a local target):
cd pipelines/ingest/<source_name>-ingestion/ && make run-local
This step allows you to test the function and inspect the output data format.
Step 5: Generate the Source Schema
After running the pipeline locally (see above), generate a source schema definition:
cd pipelines/
uvx boringdata dlt get-schema <source_name> \
--engine iceberg \
--output-folder ingest
Step 6: Create Transformation Models
Based on the schema files generated in step 5, boringdata can automatically generate corresponding SQL transformation models for each of the tables using Amazon Athena:
cd pipelines/transform
uvx boringdata dbt import-source \
--source-yml ../ingest/<source_name>-schema/
Step 7: (Optional) Add Workflow Automation
To coordinate the ingestion and transformation steps, add workflow automation using AWS Step Functions:
cd pipelines
uvx boringdata aws step-function lambda-dbt \
--source-name <source_name>
Step 8: Deploy the Infrastructure
Finally, deploy the project from the root directory:
export AWS_PROFILE=your_aws_profile
export ENVIRONMENT=dev
make deploy
Last updated