These guidelines apply when writing, reviewing, or modifying code in this repository. Apply the sections relevant to the files being changed.
Style nitpicks are handled by build_tooling/format.py — focus on substantive issues.
- Title clarity: Concise, grammatically correct summary of what the PR does. Avoid vague titles like "Changes" or "WIP".
- Grammar and spelling: Check for typos and unclear phrasing.
- Description completeness: Explain what changed and why. Flag any significant diff changes not mentioned in the description.
- API changes called out: Public Python API changes (signatures, return types, defaults, exceptions) must be explicitly called out.
- Breaking changes labelled: On-disk format or breaking changes must be highlighted
and the PR labelled appropriately (
enhancement,bug, etc.). - Linked issues: Reference related issues (e.g., "Fixes #1234").
- Scope match: Title/description should match the actual scope of changes.
ArcticDB has a layered Python API: Arctic -> Library -> NativeVersionStore
-> C++ PythonVersionStore via pybind11.
- Breaking changes to
Arctic,Library, orNativeVersionStoremust be called out in PR descriptions and labelled. - Deprecation protocol: Removed or renamed parameters must go through a deprecation cycle.
- Library option defaults: Changes to defaults like rows_per_segment, columns_per_segment, dynamic_schema affect all users silently.
- QueryBuilder: New operations must follow the existing fluent builder pattern.
- Return type changes from read/write operations affect downstream code.
- Exception hierarchy: Preserve error codes and exception types users may catch.
- Docstrings: Public methods must have accurate docstrings matching signatures.
- GIL management: Long-running C++ operations must release the GIL
(
py::call_guard<py::gil_scoped_release>()). Callbacks into Python must reacquire it. - Exception translation: C++ exceptions must map to appropriate Python exceptions.
- Object lifetime: pybind11
py::object,py::arrayheld across GIL release can cause use-after-free. - Type conversion: Handle None/NaN, empty arrays, and dtype edge cases.
- Normalization checks: Data validation at the boundary must not be bypassed.
- String handling: Python str/bytes to C++ must handle Unicode correctly (UTF-8).
ArcticDB has a custom memory management system with detachable memory for Arrow interop, buffer pools, chunked buffers, and an optional slab allocator.
- Allocator usage: Use correct
AllocationType(DYNAMIC, PRESIZED, DETACHABLE). - Buffer ownership: Buffers must not be used after being moved or returned to a pool.
- ChunkedBuffer: Correct chunk management and iterator validity.
- RAII compliance: All resources must use RAII. No raw
new/delete. - Move semantics: Use
ARCTICDB_MOVE_ONLY_DEFAULT,ARCTICDB_NO_MOVE_OR_COPY, etc. Accidental copies of large buffers are performance bugs. - Slab allocator compatibility: Allocation patterns must be compatible when enabled.
- Memory leaks: New leak sanitizer suppressions need strong justification.
ArcticDB guarantees that newer clients can read data written by older clients and vice versa.
- Binary layout changes: Packed structs (
FixedHeader,Block,EncodedField,FieldStats) have fixed sizes validated bystatic_assert. ANY change breaks stored data. - Magic numbers: Section markers in encoded segments must not be changed.
- Encoding version compatibility: Both V1 (protobuf headers) and V2 (binary headers) paths must remain readable.
- Key type changes:
KeyTypeenum values are persisted. Adding is fine, removing or renumbering breaks storage. - ValueType enum: Must match the protobuf descriptor exactly. New types require fallback handling in older clients.
- Version chain markers:
__write__,__tombstone__,__tombstone_all__, etc. must not change. - Protobuf schema: No removed/renumbered fields, use
optionalfor new fields. - Compat tests: On-disk format changes must update pinned-version compatibility tests.
- Segment structure: Changes to segment format affect all read/write paths (high-risk).
- Codec correctness: LZ4, ZSTD, Passthrough, and PFOR must produce lossless roundtrips.
- Hash integrity: Block-level hashing changes break data validation.
- Encoded field ordering: Must maintain deterministic ordering.
- Default codec changes: Affect storage efficiency and compatibility.
- Buffer size calculations: Must be exact. Off-by-one errors cause buffer overflows.
Each backend (S3, Azure, LMDB, MongoDB, Memory) implements an abstract Storage interface with methods for write, read, remove, iterate, and key_exists.
- Interface compliance: Implement ALL required virtual methods.
- Atomic write support: Classification (NO, YES, NEEDS_TEST) must be correct. Incorrect classification causes data corruption under concurrent access.
- Error handling: Throw appropriate storage exception subtypes, not generic exceptions.
- Key encoding: Consistent format across backends.
- Mock consistency: Mocks must mirror real client behaviour.
- URI parsing: Changes can break user connection strings.
- Credential handling: No hardcoded credentials, tokens, or secrets. No logging of sensitive connection parameters.
The versioning engine manages a version chain (linked list of versions per symbol), a symbol list, and snapshots with a lock-free concurrency model (last-writer-wins).
- Version chain integrity: Incorrect pointer manipulation corrupts the chain.
- Tombstone handling: Incorrect load types cause stale reads or missed tombstones.
- Symbol list consistency: Race conditions cause
list_symbols()to return incorrect results. - Snapshot safety: Snapshots must be atomic — partial snapshots are corrupt.
- De-duplication: Incorrect dedup causes data loss or storage waste.
- Schema checks: Compatibility validation on
append()/update()must not be bypassed. - Batch operations: Must maintain atomicity guarantees.
The pipeline handles read/write data serialization, and the processing layer implements pushdown query execution (filter, project, aggregate, resample).
- Slicing correctness: Default 100K rows x 127 columns. Off-by-one errors cause data loss or duplication.
- Frame reconstruction: Must preserve column ordering, index alignment, and dtypes.
- Query clause composition: Clauses must compose correctly (filter -> project -> aggregate -> concat).
- Expression evaluation: Type promotion must follow NumPy rules.
- Aggregation correctness: Handle NaN, empty groups, and overflow for sum/mean.
- Resample operations: Handle timezone-aware timestamps, DST transitions, irregular intervals.
- String pool: Pool corruption causes garbled string data.
- Sparse data: Correctly track present/absent values via bitmap format.
The async infrastructure uses Folly executors with CPU and IO thread pools.
- Thread pool sizing: Unbounded task submission can exhaust memory.
- Future composition: Dropped futures silently discard errors.
- Deadlock potential: No
.get()calls from within thread pool callbacks. - Exception propagation: Exceptions must be captured in futures and re-thrown on
.get().
- Arrow output format: Must produce valid Arrow arrays/tables.
- Memory interop: Arrow buffers backed by ArcticDB's detachable allocator must have correct lifetime management.
- Type mapping: Arrow types must map correctly to ArcticDB's
ValueTypeenum. - String format options: Handle both small and large string types.
- Column type handling: All
ValueTypevariants must be handled in switch statements. Missing cases cause undefined behaviour. - Type promotion: Mixed-type columns must promote correctly following descriptor merging logic.
- Empty column handling: Empty DataFrames, all-NaN columns, zero-row appends must not crash or corrupt state.
- Statistics: Must be updated correctly on append/update.
- Memory segment: Correct reference counting, bounds checking, iterator validity.
- Key immutability:
AtomKeyandRefKeyshould be treated as immutable after construction. - StreamDescriptor changes: Must remain backwards-compatible.
- FieldCollection: Bounds-checked index-based access.
- NativeTensor: Correctly handle stride and shape matching numpy's memory layout.
ArcticDB supports protobuf versions 3 through 6 via separate generated bindings.
- Proto backwards compatibility: No removed or renumbered fields. New fields must be
optional. - Multi-version support: Changes to
.protofiles must regenerate bindings for all supported versions. - msgpack compatibility: Preserve wire format.
- Protobuf mapping: C++ proto-to-native type mappings must be correct.
- Copy-pasted logic: Search the repo to confirm before duplicating. Extract shared functions or use existing utilities.
- Parallel implementations: C++ and Python implementations of the same concept must be consistent.
- Storage backend boilerplate: Use shared base class methods or templates.
- Test duplication: Parametrize or extend existing tests instead of near-identical new ones.
- Error handling boilerplate: Use existing error handling macros and utilities.
- Switch statement duplication: Use dispatch tables or templated approaches.
- Test coverage: Every behavioural change needs corresponding tests (Google Test for C++, pytest for Python).
- Edge cases: Empty DataFrames, single-row/single-column, max segment size, Unicode/special characters, concurrent access.
- Integration tests: Required for storage backend or version engine changes.
- Property-based tests: Useful for type promotion, slicing, aggregation.
- Benchmark regression: Watch for O(n^2) patterns, unnecessary copies, allocation hotspots.
- Backwards compat tests: Required for on-disk format changes.
- Mock consistency: Mocks must accurately represent real backend behaviour.
- Error codes: Use correct
ErrorCategoryand unique codes. - Preconditions: Internal assertion macros for invariants, not user input validation.
- User-facing errors: Clear, actionable messages referencing the failed operation.
- Log levels: No debug logging in hot paths. No sensitive data in log output.
- Storage exceptions: Use storage-specific exception types.
- CMake targets: New source files must be added to the correct target.
- Python dependencies: Changes to install_requires/build deps affect all users.
- Submodule updates: Verify new commits are from upstream on official branches/tags.
- Compiler compatibility: C++20 features must work on GCC, Clang, MSVC.
- Sanitizer compatibility: Code must work under ASan, TSan, UBSan.
- vcpkg submodule pin: Must point to a specific commit on
microsoft/vcpkgmaster. - Manifest version pinning: All dependencies must have explicit version pins or overrides.
- Version override consistency: Overrides must match versions at the pinned vcpkg commit.
- Custom overlay ports: Must be updated when dependencies are upgraded.
- Builtin-baseline: Must match the checked-out vcpkg submodule commit hash.
- New dependencies: Must include version pin and not introduce license-incompatible transitive deps.
- Removed dependencies: Also remove overlay ports, CMake find modules, and CMakeLists usage.
- Conda environment sync: Update conda dev environment to match vcpkg version changes.
- No hardcoded credentials: Secrets, tokens, API keys must not appear in code or config.
- Input validation at boundaries: Validate URIs, symbol names, metadata, query expressions.
- Buffer overflows: Validate sizes before accessing memory in C++ code handling raw bytes.
- Unnecessary copies: Use
std::move, preferstd::string_viewoverconst std::stringparameters. Watch for pass-by-value of large types. - Allocation patterns: Minimize allocations in hot paths. Prefer stack allocation, buffer reuse, or pool allocation.
- Algorithmic complexity: Watch for O(n^2) patterns in version chain traversal, segment iteration, and query processing.
- Lazy evaluation: Use lazy DataFrame evaluation where possible.
## ArcticDB Code Review Summary
### API & Compatibility
- [ ] No breaking changes to public Python API (`Arctic`, `Library`, `NativeVersionStore`)
- [ ] Deprecation protocol followed for removed/renamed parameters
- [ ] On-disk format unchanged (or migration path documented)
- [ ] Protobuf schema backwards-compatible
- [ ] Key types and ValueType enum unchanged (or additive only)
### Memory & Safety
- [ ] RAII used for all resource management
- [ ] No use-after-move or use-after-free patterns
- [ ] Buffer size calculations correct (no overflow potential)
- [ ] GIL correctly managed at Python-C++ boundary
- [ ] No accidental copies of large objects
### Correctness
- [ ] All ValueType/KeyType switch statements exhaustive
- [ ] Edge cases handled (empty data, NaN, Unicode, concurrent access)
- [ ] Error codes unique and in correct ErrorCategory
- [ ] Storage backend interface fully implemented
- [ ] Query clause composition correct
### Code Quality
- [ ] No duplicated logic (search repo for existing utilities)
- [ ] C++ and Python implementations of shared concepts are consistent
- [ ] Tests are not duplicating existing test scenarios
### Testing
- [ ] Behavioural changes have corresponding tests
- [ ] Edge cases covered in tests
- [ ] Integration tests for storage/version engine changes
- [ ] No regression in existing test expectations
### Build & Dependencies
- [ ] New source files added to CMakeLists.txt
- [ ] Dependency changes justified and version-pinned
- [ ] Cross-platform compatibility maintained
- [ ] vcpkg submodule points to official upstream commit
- [ ] vcpkg manifest versions pinned and consistent with submodule baseline
- [ ] Custom overlay ports up-to-date with dependency versions
- [ ] Conda environment synced with vcpkg dependency versions
### Security
- [ ] No hardcoded credentials or secrets
- [ ] Input validation at system boundaries
- [ ] No buffer overflow potential in C++ code
### PR Title & Description
- [ ] Title is clear, concise, and uses imperative verb
- [ ] Title and description are free of typos and grammatical errors
- [ ] Description explains what changed and why
- [ ] All significant changes in the diff are mentioned in the description
- [ ] API/breaking changes explicitly called out in the description
- [ ] Linked issues referenced where applicable
- [ ] PR labelled appropriately (enhancement, bug, etc.)
### Documentation
- [ ] Public API docstrings accurate
- [ ] Breaking changes flagged with appropriate labels