githubEdit

pipe-valveAdd a New Pipeline

This guide explains how to add a new data pipeline to the template.

The pipeline architecture includes:

  1. Data ingestion using serverless functions (AWS Lambda) and an ELT tool (dltarrow-up-right)

  2. Staging in cloud object storage (Amazon S3arrow-up-right)

  3. Automated data loading into Snowflakearrow-up-right landing tables

  4. Data transformation using SQL analytics with dbtarrow-up-right

The boringdata CLI automates many steps along the way.

Before you start, make sure you have installed the boringdata CLI:

uv tool install git+ssh://[email protected]/boringdata/boringdata-cli.git --python 3.12

You can then use the boringdata CLI from any directory:

uvx boringdata --help

Step 1: Add a New Data Source

Let's start by adding a new data source for ingestion.

The template uses dltarrow-up-right as the ingestion framework. Check the dlt ecosystemarrow-up-right to find the connector you want.

You can then generate a full ingestion pipeline for this connector by running:

cd pipelines && uvx boringdata dlt add-source <connector_name>

This command will create the following files:

pipelines/<source_name>_lambda.tf = serverless function (AWS lambda) infrastructure

pipelines/ingest/<source_name>-ingestion/* = Lambda's dockerized code

Boringdata will also run some helpful operations:

  • Set up a Python virtual environment and install the necessary dependencies

  • Copy .env.example to .env.local

  • Initialize the dltarrow-up-right data connector

  • Parse required secrets from configuration files and update both environment variables and infrastructure configurations

Example using the Notion APIarrow-up-right as a source:

circle-info

You can assign a different name to your source than the connector name.

To do so, add the CLI option: --source-name <source_name>

Step 2: Configure Secrets

If your source requires secrets (for example, an API key), update the .env.example.

After deployment, update these secrets manually in AWS Secrets Managerarrow-up-right if needed.

Example for Notion integration:

The following lines should be present in the .env file:

Step 3: Customize the Ingestion Logic

Edit pipelines/ingest/<source_name>-ingestion/lambda_handler.py

Example for Notion integration:

circle-info

Use the <connector_name>_pipeline.py generated by dltarrow-up-right as an inspiration

Step 4: Test the Ingestion Function Locally

To verify your changes, run the function locally (using DuckDBarrow-up-right as a local target):

This step allows you to test the function and inspect the output data format.

Step 5: Generate the Source Schema

Generate a YAML file that defines your source's data structure (used to create data warehouse tables in Snowflakearrow-up-right):

Step 6: Create Transformation Models

Based on the YAML file generated in step 5, boringdata can automatically generate corresponding SQL transformation models for each of the tables:

Step 7: (Optional) Add Workflow Automation

To coordinate the ingestion and transformation steps, add workflow automation using AWS Step Functionsarrow-up-right:

Step 8: Deploy the Infrastructure

Finally, deploy the project:

Last updated