This document provides comprehensive instructions for AI agents working on the mollerdb project. It summarizes the design, decisions, and current state of the project.
Last Updated: 2025-11-01 18:54:00 UTC Last User: wdconinc
The mollerdb project is a high-performance, dual-language Software Development Kit (SDK) for accessing the MOLLER experiment's analysis database. MOLLER (Measurement Of a Lepton Lepton Electroweak Reaction) is a precision physics experiment at Jefferson Lab that will measure parity-violating asymmetry in electron-electron scattering. This SDK provides convenient access to the experiment's database for collaborators who may not be proficient in SQL.
The SDK will support both C++ and Python users without duplicating core logic. The chosen architecture is:
- C++ Core Library (
libmollerdb): A central, high-performance C++ library that contains all database interaction logic. - Python Bindings: A thin wrapper around the C++ core, exposing its functionality to Python.
- Build System:
scikit-build-corewas chosen as the modern PEP 517 build backend. It will orchestrate the build process by invoking CMake. This replaces an earlier suggestion of usingsetuptoolswith a customCMakeBuildclass. - C++ Database Driver:
sqlpp23is the designated library for connecting to and interacting with the PostgreSQL database from C++. It was chosen overlibpqxxbecause it is a header-only library with fewer dependencies, making it more platform-independent and easier to manage. It will be included as a git submodule. - Python Bindings:
pybind11is the standard tool chosen to create the bindings between C++ and Python. - Data Interchange Format: Apache Arrow was identified as the critical technology for efficient, zero-copy data transfer between the C++ core and Python. C++ functions will query the database and construct Arrow
Tableobjects, which can be converted to Pandas DataFrames in Python with minimal overhead.
After some discussion, the following naming convention was finalized for consistency:
- Repository:
JeffersonLab/mollerdb - Python Package Name:
mollerdb - Compiled Python Module:
mollerdb(This is the.soor.pydfile generated bypybind11). - The final Python package structure will be located in
python/mollerdb/.
The SDK design was informed by a proposed redesign of the underlying database schema (qwparity_schema.dbml). Key schema suggestions include:
- Unified Results Table: Merging detector-specific tables (
md_data,lumi_data,beam) into a single, flexibleresultstable. - Generalized Sensitivities Table: Abstracting "slopes" into a generic
sensitivitiestable to store any linear correlation (detector-monitor, detector-detector, etc.). This requires a masterquantitylookup table. - Versioning: Adding versioning (e.g.,
valid_from_run,valid_to_run) to detector and quantity tables to ensure long-term reproducibility.
Status: Schema integration is complete. The database schema from MOLLER-parity-schema is integrated as a git submodule and C++ headers are generated during the build process.
Documentation Structure:
- Place topical documentation files in the
docs/directory - Keep high-level project overview in
README.md - Keep agent-specific guidance in
AGENTS.md - See
docs/SCHEMA_INTEGRATION.mdfor detailed information on the schema integration approach
mollerdb/
├── include/ # C++ header files
├── src/ # C++ source files
│ └── Database.cpp # Core database interaction logic
├── python/ # Python package
│ ├── mollerdb/ # Python package directory
│ │ └── __init__.py
│ └── bindings.cpp # pybind11 bindings
├── thirdparty/ # Third-party dependencies (git submodules)
├── docs/ # Documentation (published to GitHub Pages)
├── CMakeLists.txt # CMake build configuration
└── pyproject.toml # Python package configuration
Python Package:
pip install -e .C++ Library (via CMake):
mkdir build && cd build
cmake ..
makeThe project uses git submodules for dependencies like sqlpp23. Always ensure submodules are initialized:
git submodule update --init --recursive- Follow the existing code style in each file
- C++ code uses C++23 standard (as specified in CMakeLists.txt)
- Python code should follow PEP 8 guidelines
- Keep the C++ core focused on database operations
- Keep Python bindings thin and delegate to C++ core
- C++ functions query the database using sqlpp23
- Results are converted to Apache Arrow
Tableobjects - Arrow tables are passed to Python with zero-copy
- Python users can convert Arrow tables to Pandas DataFrames
- Python Package:
mollerdb - Compiled Module:
mollerdb - C++ Namespace:
mollerdb - Use snake_case for Python code
- Use camelCase for C++ code (following existing conventions)
- sqlpp23 (git submodule)
- Apache Arrow C++ library
- PostgreSQL client library (libpq)
- pybind11
- scikit-build-core (build)
- pybind11 (build)
- pyarrow (runtime - for Arrow table interaction)
- pandas (optional - for DataFrame conversion in user code and examples)
The project uses GitHub Actions for continuous integration (see .github/workflows/). When modifying code:
- Ensure builds pass on supported platforms
- Git submodules must be checked out in CI workflows
- Both C++ and Python components should be tested
The project documentation is located in the docs/ directory and is published to GitHub Pages using Docsify. The documentation structure includes:
- docs/README.md: Main documentation file with installation and usage instructions
- docs/_sidebar.md: Sidebar navigation configuration
- docs/index.html: Docsify configuration file
- docs/.nojekyll: Disables Jekyll processing for GitHub Pages
IMPORTANT: When developing new features or making changes, agents MUST keep the documentation up to date:
-
When Adding New Features:
- Update
docs/README.mdwith usage examples for the new feature - Add API reference documentation for new classes/methods
- Update installation instructions if new dependencies are required
- Add to the appropriate section in
docs/_sidebar.mdif creating new documentation pages
- Update
-
When Modifying Existing Features:
- Update all affected examples in
docs/README.md - Ensure API reference documentation reflects the changes
- Update any affected usage instructions
- Update all affected examples in
-
Periodic Documentation Review:
- Before completing major features, review the documentation for accuracy
- Verify that all code examples are correct and runnable
- Check that installation instructions are current
- Ensure API reference documentation matches the actual implementation
-
Documentation Testing:
- When making significant documentation changes, verify that:
- All code examples are syntactically correct
- Installation instructions work on supported platforms
- Links to external resources are valid
- The docsify sidebar navigation works correctly
- When making significant documentation changes, verify that:
The documentation is automatically deployed to GitHub Pages via the .github/workflows/pages.yml workflow when changes are pushed to the main branch.
- Implement the C++ function in
src/Database.cpp - Use sqlpp23 for database interaction
- Return results as Arrow
Tableobjects - Expose the function in
python/bindings.cppusing pybind11 - Update Python package
__init__.pyif needed - Update documentation in
docs/README.mdwith usage examples
- C++ build: Edit
CMakeLists.txt - Python package: Edit
pyproject.toml - Dependencies: Update both files as needed
- Construct Arrow tables in C++ using the Arrow C++ API
- Pass table pointers through pybind11
- Python receives Arrow tables that can be converted to Pandas DataFrames
- Minimal Changes: Make the smallest possible changes to achieve goals
- Testing: Run existing tests before and after changes
- Dependencies: Avoid adding new dependencies unless absolutely necessary
- Platform Independence: Keep code portable (Linux, macOS, Windows)
- Performance: The C++ core is designed for high performance; maintain this in all changes
- Documentation: Update docs/README.md and docstrings when changing public APIs
- Never commit database credentials or connection strings
- Use environment variables for sensitive configuration
- Validate all database inputs to prevent SQL injection (sqlpp23 provides protection)
- Review dependencies for known vulnerabilities
The project is ready for ongoing development:
- Implementing Core Logic: Flesh out the
src/Database.cppfile to perform actual database queries usingsqlpp23. - Integrating Apache Arrow: Add the logic to build Arrow
Tableobjects from the query results and implement the C++-to-Python type conversions for these tables. - CI/CD Setup: Create a GitHub Actions workflow to build and test the C++ and Python components on various platforms, ensuring all dependencies (
arrow) are correctly handled. The workflow must ensure git submodules are checked out to providesqlpp23. - Documentation and Examples: Expand the
docs/README.mdand add anexamples/directory showing how to use the SDK in both Python and C++.
For questions about design decisions or project direction, contact the maintainers at wdconinc@jlab.org.