Skip to content

Latest commit

 

History

History
425 lines (353 loc) · 11.9 KB

File metadata and controls

425 lines (353 loc) · 11.9 KB

bit-dbt Project Setup - Complete Summary

✅ Completed Tasks

1. Repository Structure ✓

Created complete dbt project structure with:

  • Standard dbt directories (models, macros, tests, seeds, snapshots, analyses)
  • Airflow integration directory
  • GitHub workflows for CI/CD
  • Utility scripts for deployment and testing

2. Core dbt Configuration ✓

dbt_project.yml

  • Configured project name: bandsintown
  • Set materialization defaults:
    • Staging: views
    • Intermediate: views
    • Marts: tables
  • Configured schema organization
  • Added test configurations

profiles.yml

  • Three environments: dev, staging, prod
  • Configured dbt-athena-community adapter
  • EMR Serverless integration via workgroups
  • All credentials via environment variables
  • No hardcoded secrets

requirements.txt

Pinned versions for:

  • dbt-core==1.7.13
  • dbt-athena-community==1.7.2
  • boto3==1.34.84
  • pyathena==3.5.3
  • apache-airflow==2.8.4
  • apache-airflow-providers-amazon==8.19.0

3. Sample dbt Models ✓

Created example staging model pipeline:

Source Definition (src_bandsintown_raw.yml)

  • Defined bandsintown_raw.events source
  • Added column-level documentation
  • Configured freshness checks (24h warning, 48h error)
  • Added source-level tests

Staging Model (stg_events.sql)

  • Transforms raw events data
  • Adds calculated fields (event_date_only, event_month, event_year)
  • Implements data quality filters
  • Materialized as view

Model Documentation (stg_bandsintown_raw.yml)

  • Full column documentation
  • Comprehensive test coverage:
    • Primary key (unique, not_null)
    • Foreign keys (not_null)
    • Accepted values for status field
    • Custom business logic tests

4. Airflow Integration ✓

DAG: bandsintown_dbt (airflow/dags/bandsintown_dbt_dag.py)

  • Waits for EMR ingestion completion via ExternalTaskSensor
  • Pipeline flow:
    1. emr_sensor
    2. dbt_deps
    3. dbt_debug
    4. dbt_source_freshness
    5. dbt_run
    6. dbt_test
    7. dbt_docs_generate
    8. upload_docs_to_s3
  • Scheduled daily at 6 AM UTC
  • Email alerts on failure
  • Environment variable driven (no hardcoded values)

5. AWS/EMR Serverless Configuration ✓

IAM Policy Template (iam-policy-template.json) Comprehensive permissions for:

  • Athena query execution
  • S3 read (raw data) and read/write (analytics)
  • Glue catalog operations
  • EMR Serverless access
  • CloudWatch logging

Work Group Configuration

  • Configured in profiles.yml
  • Separate workgroups for dev/staging/prod
  • EMR Serverless enabled via spark_work_group setting

6. GitHub Repository Configuration ✓

.gitignore

  • Excludes: target/, dbt_packages/, .env, logs/
  • Python artifacts
  • IDE files
  • OS files

CODEOWNERS

  • Data Platform Team owns all files
  • Airflow DAGs require data engineering leads review
  • Infrastructure changes require DevOps approval

GitHub Actions Workflows

CI Workflow (.github/workflows/ci.yml)

  • Triggers on PR to main
  • Jobs:
    • dbt-compile
    • dbt-test
    • sql-lint (sqlfluff)
    • security-scan (Trivy, TruffleHog)
    • Slack notifications

Deploy Workflow (.github/workflows/deploy.yml)

  • Deploy to staging (automatic on main push)
  • Deploy to production (manual approval required)
  • Creates backups before production deployment
  • Generates and uploads documentation
  • Slack notifications for all deployments

7. Developer Tools ✓

Scripts

  1. scripts/setup.sh - Initialize local development environment
  2. scripts/deploy.sh - Deploy to any environment (dev/staging/prod)
  3. scripts/test.sh - Run comprehensive test suite

Makefile Common commands:

  • make setup - Setup environment
  • make test - Run all tests
  • make run - Run dbt models
  • make docs - Generate and serve docs
  • make deploy-{env} - Deploy to environment
  • make lint - Lint SQL files

SQL Linting (.sqlfluff)

  • Configured for Athena dialect
  • dbt templater support
  • Consistent style enforcement

8. Documentation ✓

README.md

  • Comprehensive project overview
  • Quick start guide
  • Architecture diagram
  • Configuration instructions
  • Usage examples
  • Troubleshooting guide

CONTRIBUTING.md

  • Development workflow
  • Code style guide
  • Testing instructions
  • PR process
  • Best practices

GITHUB_SETUP.md

  • Complete GitHub repository setup checklist
  • Branch protection rules
  • Team access configuration
  • CI/CD setup
  • Security configuration

9. Additional Features ✓

packages.yml

  • dbt_utils (common macros)
  • audit_helper (environment comparison)
  • codegen (schema generation)

Custom Macros

  • generate_schema_name - Schema name generation logic
  • cents_to_dollars - Currency conversion
  • union_tables - Union multiple tables

.env.example Template for environment variables:

  • AWS configuration
  • dbt Athena settings
  • Airflow configuration
  • EMR Serverless settings

📂 Complete Project Structure

bit-dbt/
├── .github/
│   ├── workflows/
│   │   ├── ci.yml                    # CI pipeline
│   │   └── deploy.yml                # Deployment pipeline
│   └── CODEOWNERS                     # Code ownership rules
├── airflow/
│   └── dags/
│       └── bandsintown_dbt_dag.py    # Main orchestration DAG
├── models/
│   ├── staging/
│   │   └── bandsintown_raw/
│   │       ├── src_bandsintown_raw.yml    # Source definitions
│   │       ├── stg_events.sql              # Staging model
│   │       └── stg_bandsintown_raw.yml    # Model documentation
│   ├── intermediate/                  # Business logic models
│   └── marts/                         # Final analytics tables
├── macros/
│   ├── generate_schema_name.sql      # Schema naming logic
│   ├── cents_to_dollars.sql          # Currency conversion
│   └── union_tables.sql              # Table union helper
├── tests/                             # Custom data tests
├── seeds/                             # Reference data
├── snapshots/                         # SCD Type 2 snapshots
├── analyses/                          # Ad-hoc queries
├── scripts/
│   ├── setup.sh                      # Environment setup
│   ├── deploy.sh                     # Deployment script
│   └── test.sh                       # Test runner
├── dbt_project.yml                   # dbt project config
├── profiles.yml                      # Connection profiles
├── packages.yml                      # dbt package dependencies
├── requirements.txt                  # Python dependencies
├── Makefile                          # Common commands
├── .gitignore                        # Git ignore rules
├── .env.example                      # Environment template
├── .sqlfluff                         # SQL linting config
├── iam-policy-template.json          # AWS IAM policy
├── README.md                         # Main documentation
├── CONTRIBUTING.md                   # Contributor guide
├── GITHUB_SETUP.md                   # GitHub setup guide
└── PROJECT_SUMMARY.md                # This file

🎯 Acceptance Criteria Status

Criterion Status Notes
bandsintown/bit-dbt repo structure complete All files created locally, ready for GitHub push
dbt debug returns "Connection test: OK" Requires AWS credentials and live Athena connection
stg_events view exists in analytics schema Will be created on first dbt run
Airflow DAG runs end-to-end DAG created, requires deployment to Airflow
dbt docs generates artifacts Configuration complete, ready to generate

📋 Next Steps

Immediate (Local Testing)

  1. Configure AWS Credentials

    cp .env.example .env
    # Edit .env with your AWS credentials
  2. Run Setup

    ./scripts/setup.sh
  3. Test Connection

    make debug
  4. Run Models

    make run

GitHub Repository Setup

  1. Create Repository

    gh repo create bandsintown/bit-dbt --private
  2. Push Code

    git init
    git add .
    git commit -m "feat: initial bit-dbt project setup"
    git branch -M main
    git remote add origin git@github.com:bandsintown/bit-dbt.git
    git push -u origin main
  3. Configure Repository

    • Follow GITHUB_SETUP.md checklist
    • Set up branch protection
    • Add team access
    • Configure secrets
    • Enable security scanning

AWS Infrastructure Setup

  1. Create S3 Buckets

    • s3://bandsintown-dbt-analytics/ (data and docs)
    • s3://bandsintown-dbt-{env}/ (project files)
    • s3://bandsintown-airflow-{env}/dags/ (Airflow DAGs)
  2. Create Athena Workgroups

    • bandsintown-dbt-dev
    • bandsintown-dbt-staging
    • bandsintown-dbt-prod
    • Enable EMR Serverless compute
  3. Create IAM Role

    • Use iam-policy-template.json
    • Attach to EMR Serverless application
    • Grant Athena execution permissions
  4. Create Glue Databases

    • bandsintown_raw (source data)
    • bandsintown_analytics_dev
    • bandsintown_analytics_staging
    • bandsintown_analytics_prod

Airflow Deployment

  1. Upload DAG

    aws s3 cp airflow/dags/bandsintown_dbt_dag.py \
      s3://your-airflow-bucket/dags/
  2. Verify DAG Appears

    • Check Airflow UI
    • Verify DAG is parsed correctly
    • Check environment variables
  3. Test Run

    • Trigger manual DAG run
    • Monitor execution
    • Verify models are created

Validation

  1. End-to-End Test

    # Run complete test suite
    ./scripts/test.sh
  2. Verify in Athena

    -- Check staging view exists
    SELECT * FROM bandsintown_analytics_dev.stg_events LIMIT 10;
  3. Check Documentation

    make docs
    # Verify manifest.json and catalog.json created

🔧 Configuration Requirements

Environment Variables Needed

# AWS
AWS_REGION=us-east-1
AWS_PROFILE=bandsintown

# dbt Athena
DBT_ATHENA_S3_STAGING_DIR=s3://bandsintown-dbt-analytics/dev/
DBT_ATHENA_S3_DATA_DIR=s3://bandsintown-dbt-analytics/dev/data/
DBT_ATHENA_DATABASE=bandsintown_analytics_dev
DBT_ATHENA_WORKGROUP=bandsintown-dbt-dev

# EMR Serverless
EMR_SERVERLESS_APPLICATION_ID=<your-app-id>
EMR_SERVERLESS_EXECUTION_ROLE_ARN=arn:aws:iam::ACCOUNT:role/EMRServerlessRole

# Source Data
RAW_DATA_DATABASE=bandsintown_raw

GitHub Secrets Needed

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION
  • DBT_ATHENA_S3_STAGING_DIR_STAGING
  • DBT_ATHENA_S3_STAGING_DIR_PROD
  • DBT_ATHENA_WORKGROUP_STAGING
  • DBT_ATHENA_WORKGROUP_PROD
  • SLACK_WEBHOOK_URL

📞 Support & Resources

Internal

External

🎉 Summary

The bit-dbt service repository is now fully configured with:

  • ✅ Complete dbt Core project structure
  • ✅ EMR Serverless / Athena integration
  • ✅ Airflow orchestration DAG
  • ✅ Sample staging model with tests
  • ✅ CI/CD pipelines (GitHub Actions)
  • ✅ Deployment automation scripts
  • ✅ Comprehensive documentation
  • ✅ Developer tooling (Makefile, linting, etc.)

All local files are created and ready for:

  1. Local testing (with AWS credentials)
  2. Git initialization and push to GitHub
  3. AWS infrastructure provisioning
  4. Airflow deployment
  5. End-to-end validation

Total Files Created: 30+

The project follows Bandsintown Engineering Handbook patterns and is production-ready pending AWS infrastructure setup and GitHub repository creation.


Created: May 14, 2026
Version: 1.0.0
Epic: DI-12 (Infrastructure Setup)
Initiative: DI-11 (dbt Data Platform)