Skip to content

Latest commit

 

History

History
211 lines (164 loc) · 10.1 KB

File metadata and controls

211 lines (164 loc) · 10.1 KB

AI Agent Instructions for mollerdb

This document provides comprehensive instructions for AI agents working on the mollerdb project. It summarizes the design, decisions, and current state of the project.

Last Updated: 2025-11-01 18:54:00 UTC Last User: wdconinc

1. Project Overview

The mollerdb project is a high-performance, dual-language Software Development Kit (SDK) for accessing the MOLLER experiment's analysis database. MOLLER (Measurement Of a Lepton Lepton Electroweak Reaction) is a precision physics experiment at Jefferson Lab that will measure parity-violating asymmetry in electron-electron scattering. This SDK provides convenient access to the experiment's database for collaborators who may not be proficient in SQL.

2. Core Design Decisions

2.1. Overall SDK Architecture

The SDK will support both C++ and Python users without duplicating core logic. The chosen architecture is:

  • C++ Core Library (libmollerdb): A central, high-performance C++ library that contains all database interaction logic.
  • Python Bindings: A thin wrapper around the C++ core, exposing its functionality to Python.

2.2. Key Technologies

  • Build System: scikit-build-core was chosen as the modern PEP 517 build backend. It will orchestrate the build process by invoking CMake. This replaces an earlier suggestion of using setuptools with a custom CMakeBuild class.
  • C++ Database Driver: sqlpp23 is the designated library for connecting to and interacting with the PostgreSQL database from C++. It was chosen over libpqxx because it is a header-only library with fewer dependencies, making it more platform-independent and easier to manage. It will be included as a git submodule.
  • Python Bindings: pybind11 is the standard tool chosen to create the bindings between C++ and Python.
  • Data Interchange Format: Apache Arrow was identified as the critical technology for efficient, zero-copy data transfer between the C++ core and Python. C++ functions will query the database and construct Arrow Table objects, which can be converted to Pandas DataFrames in Python with minimal overhead.

2.3. Naming and Structure

After some discussion, the following naming convention was finalized for consistency:

  • Repository: JeffersonLab/mollerdb
  • Python Package Name: mollerdb
  • Compiled Python Module: mollerdb (This is the .so or .pyd file generated by pybind11).
  • The final Python package structure will be located in python/mollerdb/.

3. Associated Database Schema Design

The SDK design was informed by a proposed redesign of the underlying database schema (qwparity_schema.dbml). Key schema suggestions include:

  • Unified Results Table: Merging detector-specific tables (md_data, lumi_data, beam) into a single, flexible results table.
  • Generalized Sensitivities Table: Abstracting "slopes" into a generic sensitivities table to store any linear correlation (detector-monitor, detector-detector, etc.). This requires a master quantity lookup table.
  • Versioning: Adding versioning (e.g., valid_from_run, valid_to_run) to detector and quantity tables to ensure long-term reproducibility.

4. Project Structure

Status: Schema integration is complete. The database schema from MOLLER-parity-schema is integrated as a git submodule and C++ headers are generated during the build process.

Documentation Structure:

  • Place topical documentation files in the docs/ directory
  • Keep high-level project overview in README.md
  • Keep agent-specific guidance in AGENTS.md
  • See docs/SCHEMA_INTEGRATION.md for detailed information on the schema integration approach
mollerdb/
├── include/           # C++ header files
├── src/              # C++ source files
│   └── Database.cpp  # Core database interaction logic
├── python/           # Python package
│   ├── mollerdb/     # Python package directory
│   │   └── __init__.py
│   └── bindings.cpp  # pybind11 bindings
├── thirdparty/       # Third-party dependencies (git submodules)
├── docs/             # Documentation (published to GitHub Pages)
├── CMakeLists.txt    # CMake build configuration
└── pyproject.toml    # Python package configuration

5. Development Guidelines

5.1. Building the Project

Python Package:

pip install -e .

C++ Library (via CMake):

mkdir build && cd build
cmake ..
make

5.2. Git Submodules

The project uses git submodules for dependencies like sqlpp23. Always ensure submodules are initialized:

git submodule update --init --recursive

5.3. Code Style

  • Follow the existing code style in each file
  • C++ code uses C++23 standard (as specified in CMakeLists.txt)
  • Python code should follow PEP 8 guidelines
  • Keep the C++ core focused on database operations
  • Keep Python bindings thin and delegate to C++ core

5.4. Data Flow

  1. C++ functions query the database using sqlpp23
  2. Results are converted to Apache Arrow Table objects
  3. Arrow tables are passed to Python with zero-copy
  4. Python users can convert Arrow tables to Pandas DataFrames

5.5. Naming Conventions

  • Python Package: mollerdb
  • Compiled Module: mollerdb
  • C++ Namespace: mollerdb
  • Use snake_case for Python code
  • Use camelCase for C++ code (following existing conventions)

6. Dependencies

6.1. C++ Dependencies

  • sqlpp23 (git submodule)
  • Apache Arrow C++ library
  • PostgreSQL client library (libpq)
  • pybind11

6.2. Python Dependencies

  • scikit-build-core (build)
  • pybind11 (build)
  • pyarrow (runtime - for Arrow table interaction)
  • pandas (optional - for DataFrame conversion in user code and examples)

7. CI/CD

The project uses GitHub Actions for continuous integration (see .github/workflows/). When modifying code:

  • Ensure builds pass on supported platforms
  • Git submodules must be checked out in CI workflows
  • Both C++ and Python components should be tested

8. Documentation

8.1. Documentation Structure

The project documentation is located in the docs/ directory and is published to GitHub Pages using Docsify. The documentation structure includes:

  • docs/README.md: Main documentation file with installation and usage instructions
  • docs/_sidebar.md: Sidebar navigation configuration
  • docs/index.html: Docsify configuration file
  • docs/.nojekyll: Disables Jekyll processing for GitHub Pages

8.2. Documentation Maintenance Guidelines

IMPORTANT: When developing new features or making changes, agents MUST keep the documentation up to date:

  1. When Adding New Features:

    • Update docs/README.md with usage examples for the new feature
    • Add API reference documentation for new classes/methods
    • Update installation instructions if new dependencies are required
    • Add to the appropriate section in docs/_sidebar.md if creating new documentation pages
  2. When Modifying Existing Features:

    • Update all affected examples in docs/README.md
    • Ensure API reference documentation reflects the changes
    • Update any affected usage instructions
  3. Periodic Documentation Review:

    • Before completing major features, review the documentation for accuracy
    • Verify that all code examples are correct and runnable
    • Check that installation instructions are current
    • Ensure API reference documentation matches the actual implementation
  4. Documentation Testing:

    • When making significant documentation changes, verify that:
      • All code examples are syntactically correct
      • Installation instructions work on supported platforms
      • Links to external resources are valid
      • The docsify sidebar navigation works correctly

The documentation is automatically deployed to GitHub Pages via the .github/workflows/pages.yml workflow when changes are pushed to the main branch.

9. Common Tasks

9.1. Adding a New Database Query Function

  1. Implement the C++ function in src/Database.cpp
  2. Use sqlpp23 for database interaction
  3. Return results as Arrow Table objects
  4. Expose the function in python/bindings.cpp using pybind11
  5. Update Python package __init__.py if needed
  6. Update documentation in docs/README.md with usage examples

9.2. Modifying Build Configuration

  • C++ build: Edit CMakeLists.txt
  • Python package: Edit pyproject.toml
  • Dependencies: Update both files as needed

9.3. Working with Arrow Tables

  • Construct Arrow tables in C++ using the Arrow C++ API
  • Pass table pointers through pybind11
  • Python receives Arrow tables that can be converted to Pandas DataFrames

10. Important Notes

  • Minimal Changes: Make the smallest possible changes to achieve goals
  • Testing: Run existing tests before and after changes
  • Dependencies: Avoid adding new dependencies unless absolutely necessary
  • Platform Independence: Keep code portable (Linux, macOS, Windows)
  • Performance: The C++ core is designed for high performance; maintain this in all changes
  • Documentation: Update docs/README.md and docstrings when changing public APIs

11. Security Considerations

  • Never commit database credentials or connection strings
  • Use environment variables for sensitive configuration
  • Validate all database inputs to prevent SQL injection (sqlpp23 provides protection)
  • Review dependencies for known vulnerabilities

12. Next Steps

The project is ready for ongoing development:

  1. Implementing Core Logic: Flesh out the src/Database.cpp file to perform actual database queries using sqlpp23.
  2. Integrating Apache Arrow: Add the logic to build Arrow Table objects from the query results and implement the C++-to-Python type conversions for these tables.
  3. CI/CD Setup: Create a GitHub Actions workflow to build and test the C++ and Python components on various platforms, ensuring all dependencies (arrow) are correctly handled. The workflow must ensure git submodules are checked out to provide sqlpp23.
  4. Documentation and Examples: Expand the docs/README.md and add an examples/ directory showing how to use the SDK in both Python and C++.

13. Contact

For questions about design decisions or project direction, contact the maintainers at wdconinc@jlab.org.