Add docs on SQL & Parquet schema / format, as well as a new CLI command: trackio query --sql SQL_QUERY#502
Conversation
🪼 branch checks and previews
|
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/f85a62c1b8bb6bd1a167656c95ce3fbd83525b02/trackio-0.23.0-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull request overview
Adds end-user documentation for Trackio’s SQLite/Parquet storage format and introduces a new CLI/API path to run read-only SQL queries against a project database (locally or via a remote Space), intended as a “catch-all” when trackio list / trackio get aren’t sufficient.
Changes:
- Add
SQLiteStorage.query_project()with a read-only authorizer and expose it via the server API (query_project). - Add
trackio query project --project ... --sql ...CLI command with human-readable and JSON output formatting. - Add new docs pages/entries describing the SQLite schema, Parquet export layout, and query examples (plus agent skill docs updates).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
trackio/sqlite_storage.py |
Implements read-only SQL validation/authorizer and query_project() result shaping |
trackio/server.py |
Exposes query_project as a server API endpoint |
trackio/cli_helpers.py |
Adds table-like formatting for query results |
trackio/cli.py |
Adds trackio query project ... command, local/remote execution paths, and skills doc inclusion |
docs/source/storage_schema.md |
New documentation page for SQLite schema + Parquet layout + query usage |
docs/source/cli_commands.md |
Documents the new query command and notes remote support |
docs/source/_toctree.yml |
Adds the new storage schema doc to the docs nav |
README.md |
Mentions trackio query and links to the storage schema docs |
.changeset/cyan-forks-hang.md |
Announces the feature for release notes |
.agents/skills/trackio/storage_schema.md |
New agent skill reference for storage schema + direct SQL |
.agents/skills/trackio/retrieving_metrics.md |
Updates agent CLI guidance to include trackio query |
.agents/skills/trackio/SKILL.md |
Adds the new storage schema / SQL querying skill link |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cursor = conn.cursor() | ||
| cursor.execute(normalized_query) | ||
| description = cursor.description or [] | ||
| columns = [column[0] for column in description] | ||
| rows = [] | ||
| for row in cursor.fetchall(): | ||
| rows.append({column: row[column] for column in columns}) | ||
| except sqlite3.DatabaseError as e: |
There was a problem hiding this comment.
query_project fetches all rows into memory (cursor.fetchall()) and then materializes a list of dicts. A query like SELECT * FROM metrics on a large project can cause high memory usage / long blocking time (especially on a remote Space), effectively becoming a DoS vector. Consider enforcing a server-side max row limit (e.g., fetchmany(max_rows+1) and error if exceeded) and/or adding a progress handler / timeout to abort overly expensive queries.
There was a problem hiding this comment.
Good catch — added a max_rows cap (default 10,000) via fetchmany(max_rows + 1) that raises ValueError when exceeded. Users can add a LIMIT clause to refine.
| return { | ||
| "project": project, | ||
| "query": normalized_query, | ||
| "columns": columns, | ||
| "rows": rows, | ||
| "row_count": len(rows), | ||
| } |
There was a problem hiding this comment.
The query result may include non-JSON-serializable values (e.g., SELECT randomblob(4) returns bytes). Since the CLI --json path uses json.dumps without a default=..., this will raise TypeError at runtime. Consider normalizing result values before returning (e.g., bytes -> base64/hex string) so JSON output is robust.
There was a problem hiding this comment.
Agreed — added _normalize_query_value that converts bytes/bytearray/memoryview to hex strings so json.dumps works on the CLI --json output.
| def query_project(project: str, query: str) -> dict[str, Any]: | ||
| SQLiteStorage._ensure_hub_loaded() | ||
| db_path = SQLiteStorage.get_project_db_path(project) | ||
| if not db_path.exists(): | ||
| raise FileNotFoundError(f"Project '{project}' not found.") | ||
|
|
||
| normalized_query = SQLiteStorage._validate_read_only_query(query) | ||
| with SQLiteStorage._get_connection(db_path) as conn: | ||
| conn.set_authorizer(SQLiteStorage._query_authorizer) |
There was a problem hiding this comment.
query_project is newly introduced behavior but there are no corresponding unit tests in tests/unit/test_sqlite_storage.py. Adding tests for (1) allowed SELECT/WITH queries, (2) denied write attempts (e.g., INSERT/UPDATE), and (3) safe PRAGMA allowlist behavior would prevent regressions and verify the authorizer is doing what the docs promise.
There was a problem hiding this comment.
Added tests in tests/unit/test_sqlite_storage.py covering allowed SELECT/WITH/safe PRAGMA, denied writes (INSERT/UPDATE/DELETE/DROP and unsafe PRAGMA), the new row limit, bytes normalization, and missing-project handling.
| server.api(fn=get_system_logs, name="get_system_logs") | ||
| server.api(fn=get_snapshot, name="get_snapshot") | ||
| server.api(fn=get_logs, name="get_logs") | ||
| server.api(fn=query_project, name="query_project") |
There was a problem hiding this comment.
Registering query_project as a public server API enables arbitrary read-only SQL execution against the live project DB on a remote Space. Even with the authorizer, this expands the attack surface (large result sets, expensive queries/functions, and access to internal tables like pending_uploads that can contain absolute local file paths). Consider gating this API behind the same auth used for run mutations, and/or restricting accessible tables/adding server-side limits to mitigate data-leak and DoS risk.
There was a problem hiding this comment.
Partially disagree. All the other read endpoints on this server (get_logs, get_metrics_for_run, get_runs_for_project, etc.) are exposed under the same unauthenticated policy — query_project doesn't newly expand the read surface, it just gives a generic SQL handle to the same data. Gating only this one behind the write-token would be inconsistent and would break trackio query against public Spaces, which is the feature's point.
The DoS angle is a fair concern, and that's now addressed with the server-side max_rows cap. pending_uploads is worth flagging separately — happy to scope the authorizer to user-facing tables in a follow-up if we decide the local-path leak is a real risk in practice.
| "trackio": minor | ||
| --- | ||
|
|
||
| feat:Add docs on SQL & Parquet schema / format, as well as a new CLI command: `trackio query --sql SQL_QUERY` |
There was a problem hiding this comment.
The changeset message says the new command is trackio query --sql SQL_QUERY, but the CLI implemented in this PR requires the project subcommand (trackio query project --project ... --sql ...). Please update the changeset text to match the actual CLI interface so release notes aren’t misleading.
There was a problem hiding this comment.
Fixed — updated the changeset to trackio query project --project PROJECT --sql SQL_QUERY.
- query_project: enforce max_rows cap via fetchmany(max_rows+1) - normalize bytes/bytearray/memoryview to hex for JSON safety - add unit tests for allow/deny/limit/bytes/missing-project - update changeset to match actual CLI (trackio query project ...)
|
This is a new CLI addition, but doesn't really touch existing code, so going to merge this in to unblock other PRs, such as #505 |
Adds a new trackio query project command so users and agents can run read-only SQL against Trackio project data locally or via --space. Also documents the SQLite/parquet storage schema and direct-query workflow across docs, README, and Trackio skill references for easier data inspection and automation.