You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ArcticDB is a high-performance, serverless DataFrame database for Python data science. It provides a Python API backed by a C++ processing engine with columnar storage, compression, and versioning. Data can be stored on S3, Azure Blob, LMDB, MongoDB, or in-memory.
ArcticDB stores data as keys in the underlying storage. Each key contains a segment with either data or references to other keys, forming a tree called the version chain.
# Build (release/debug)
make build # or: make build-debug
make configure # CMake configure only# Build wheel
make wheel
# CMake presets available
linux-debug, linux-release, linux-conda-debug, linux-conda-release
windows-cl-debug, windows-cl-release, macos-debug, macos-release
Environment Variables
Variable
Purpose
ARCTIC_CMAKE_PRESET
Select CMake preset (e.g., linux-debug)
CMAKE_BUILD_PARALLEL_LEVEL
Number of parallel compile jobs
ARCTICDB_PROTOC_VERS
Protobuf versions to compile (e.g., 4 or 456)
ARCTICDB_USING_CONDA
Use conda-forge dependencies instead of vcpkg
ARCTICDB_KEEP_VCPKG_SOURCES
Retain vcpkg buildtrees after build
VCPKG_BINARY_SOURCES
Control vcpkg binary caching (clear to disable)
7. Testing
Test Type
Location
Framework
Python Unit
python/tests/unit/
pytest
Python Integration
python/tests/integration/
pytest
Property-based
python/tests/hypothesis/
hypothesis
Stress
python/tests/stress/
pytest
C++ Unit
cpp/arcticdb/*/test/
Google Test
C++ Benchmarks
cpp/arcticdb/*/test/benchmark_*.cpp
Google Benchmark
Python Benchmarks
python/benchmarks/
ASV
Running Tests
# Python tests
make test-py # unit tests (default)
make test-py TYPE=integration # integration tests# C++ tests (builds with -DTEST=ON automatically)
make test-cpp-debug FILTER=TestSuite.*# Python benchmarks
make bench-py
8. Core Functionality Areas
Write Path
Normalization - Convert pandas DataFrame to internal representation
Slicing - Split into row/column slices based on segment size (default 100K rows)
Encoding - Apply compression (LZ4/ZSTD) to each segment
Storage - Write TABLE_DATA keys, then TABLE_INDEX, then VERSION
DataFrame ──► Normalize ──► Slice ──► Compress ──► Store
│ │ │ │
▼ ▼ ▼ ▼
Internal Segments Encoded Keys in
Format (row/col) Segments Storage
Read Path
Version Resolution - Find VERSION key (via VERSION_REF or explicit version)
Index Lookup - Read TABLE_INDEX to find required TABLE_DATA keys
Parallel Fetch - Retrieve and decompress relevant segments
Reconstruction - Assemble DataFrame from segments
Request ──► Resolve Version ──► Read Index ──► Fetch Data ──► Decompress ──► DataFrame
│ │ │ │
▼ ▼ ▼ ▼
VERSION_REF TABLE_INDEX TABLE_DATA Segments
+ VERSION (parallel) reassembled
Query Processing (Pushdown)
ArcticDB supports pushing filter and projection operations to the storage layer, avoiding full data materialization:
# Example: Only reads required columns and filters at storage levellib.read("symbol", columns=["price", "volume"], query_builder=q.filter(col("price") >100))
Versioning Operations
Operation
Description
write
Create new version (overwrites symbol)
append
Add rows to latest version
update
Modify rows by index range
delete
Tombstone a version (lazy deletion)
snapshot
Create named point-in-time reference
prune_previous_versions
Remove old versions and reclaim storage
Concurrency Model
No locks for reads or writes - ArcticDB does not use locks for symbol reads or writes
Last writer wins - Concurrent writes to the same symbol use a last-writer-wins policy. This is guaranteed by writing unique atom keys (data keys, index keys, version keys) first, then updating the non-unique VERSION_REF key. The last writer to update VERSION_REF wins.
Concurrent write caveats - Last-writer-wins can have surprising consequences for modification operations like append(). Concurrent appends may appear out of order or one may be dropped. Parallel writes to the same symbol are not recommended.
Symbol list - Lock-free concurrent data structure for list_symbols(). LOCK keys are only used for the compaction phase of the symbol list.
Async I/O - Parallel segment fetches during read
9. Key Source Files Reference
C++ Entry Points
File
Purpose
version/version_store_api.cpp
Main C++ API exposed to Python
version/local_versioned_engine.cpp
Core versioned storage engine
pipeline/write_frame.cpp
DataFrame serialization
pipeline/read_frame.cpp
DataFrame deserialization
processing/clause.cpp
Query clause execution
storage/storage_factory.cpp
Storage backend instantiation
Python Entry Points
File
Purpose
arcticdb/arctic.py
Arctic class - library management
arcticdb/version_store/library.py
Library class - user API
arcticdb/version_store/_store.py
NativeVersionStore - C++ bridge
arcticdb/version_store/processing.py
QueryBuilder for queries
Proto Definitions
File
Purpose
cpp/proto/arcticc/pb2/storage.proto
Library and storage configuration
cpp/proto/arcticc/pb2/descriptors.proto
Data type descriptors
cpp/proto/arcticc/pb2/encoding.proto
Segment encoding format
cpp/proto/arcticc/pb2/s3_storage.proto
S3 storage configuration
10. Documentation
User Documentation: docs/mkdocs/ (MkDocs site)
API Reference: Generated from docstrings
C++ Documentation: docs/doxygen/ (Doxygen)
GitHub Wiki: Development guides and architecture details