Skip to content

Add docs on SQL & Parquet schema / format, as well as a new CLI command: trackio query --sql SQL_QUERY#502

Merged
abidlabs merged 4 commits into
mainfrom
format
Apr 17, 2026
Merged

Add docs on SQL & Parquet schema / format, as well as a new CLI command: trackio query --sql SQL_QUERY#502
abidlabs merged 4 commits into
mainfrom
format

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 16, 2026

Adds a new trackio query project command so users and agents can run read-only SQL against Trackio project data locally or via --space. Also documents the SQLite/parquet storage schema and direct-query workflow across docs, README, and Trackio skill references for easier data inspection and automation.

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 16, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 16, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio minor

  • Add docs on SQL & Parquet schema / format, as well as a new CLI command: trackio query project --project PROJECT --sql SQL_QUERY

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 16, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/f85a62c1b8bb6bd1a167656c95ce3fbd83525b02/trackio-0.23.0-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-user documentation for Trackio’s SQLite/Parquet storage format and introduces a new CLI/API path to run read-only SQL queries against a project database (locally or via a remote Space), intended as a “catch-all” when trackio list / trackio get aren’t sufficient.

Changes:

  • Add SQLiteStorage.query_project() with a read-only authorizer and expose it via the server API (query_project).
  • Add trackio query project --project ... --sql ... CLI command with human-readable and JSON output formatting.
  • Add new docs pages/entries describing the SQLite schema, Parquet export layout, and query examples (plus agent skill docs updates).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
trackio/sqlite_storage.py Implements read-only SQL validation/authorizer and query_project() result shaping
trackio/server.py Exposes query_project as a server API endpoint
trackio/cli_helpers.py Adds table-like formatting for query results
trackio/cli.py Adds trackio query project ... command, local/remote execution paths, and skills doc inclusion
docs/source/storage_schema.md New documentation page for SQLite schema + Parquet layout + query usage
docs/source/cli_commands.md Documents the new query command and notes remote support
docs/source/_toctree.yml Adds the new storage schema doc to the docs nav
README.md Mentions trackio query and links to the storage schema docs
.changeset/cyan-forks-hang.md Announces the feature for release notes
.agents/skills/trackio/storage_schema.md New agent skill reference for storage schema + direct SQL
.agents/skills/trackio/retrieving_metrics.md Updates agent CLI guidance to include trackio query
.agents/skills/trackio/SKILL.md Adds the new storage schema / SQL querying skill link

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread trackio/sqlite_storage.py
Comment on lines +1230 to +1237
cursor = conn.cursor()
cursor.execute(normalized_query)
description = cursor.description or []
columns = [column[0] for column in description]
rows = []
for row in cursor.fetchall():
rows.append({column: row[column] for column in columns})
except sqlite3.DatabaseError as e:
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_project fetches all rows into memory (cursor.fetchall()) and then materializes a list of dicts. A query like SELECT * FROM metrics on a large project can cause high memory usage / long blocking time (especially on a remote Space), effectively becoming a DoS vector. Consider enforcing a server-side max row limit (e.g., fetchmany(max_rows+1) and error if exceeded) and/or adding a progress handler / timeout to abort overly expensive queries.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a max_rows cap (default 10,000) via fetchmany(max_rows + 1) that raises ValueError when exceeded. Users can add a LIMIT clause to refine.

Comment thread trackio/sqlite_storage.py
Comment on lines +1242 to +1248
return {
"project": project,
"query": normalized_query,
"columns": columns,
"rows": rows,
"row_count": len(rows),
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query result may include non-JSON-serializable values (e.g., SELECT randomblob(4) returns bytes). Since the CLI --json path uses json.dumps without a default=..., this will raise TypeError at runtime. Consider normalizing result values before returning (e.g., bytes -> base64/hex string) so JSON output is robust.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — added _normalize_query_value that converts bytes/bytearray/memoryview to hex strings so json.dumps works on the CLI --json output.

Comment thread trackio/sqlite_storage.py Outdated
Comment on lines +1220 to +1228
def query_project(project: str, query: str) -> dict[str, Any]:
SQLiteStorage._ensure_hub_loaded()
db_path = SQLiteStorage.get_project_db_path(project)
if not db_path.exists():
raise FileNotFoundError(f"Project '{project}' not found.")

normalized_query = SQLiteStorage._validate_read_only_query(query)
with SQLiteStorage._get_connection(db_path) as conn:
conn.set_authorizer(SQLiteStorage._query_authorizer)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_project is newly introduced behavior but there are no corresponding unit tests in tests/unit/test_sqlite_storage.py. Adding tests for (1) allowed SELECT/WITH queries, (2) denied write attempts (e.g., INSERT/UPDATE), and (3) safe PRAGMA allowlist behavior would prevent regressions and verify the authorizer is doing what the docs promise.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in tests/unit/test_sqlite_storage.py covering allowed SELECT/WITH/safe PRAGMA, denied writes (INSERT/UPDATE/DELETE/DROP and unsafe PRAGMA), the new row limit, bytes normalization, and missing-project handling.

Comment thread trackio/server.py Outdated
Comment on lines +739 to +742
server.api(fn=get_system_logs, name="get_system_logs")
server.api(fn=get_snapshot, name="get_snapshot")
server.api(fn=get_logs, name="get_logs")
server.api(fn=query_project, name="query_project")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering query_project as a public server API enables arbitrary read-only SQL execution against the live project DB on a remote Space. Even with the authorizer, this expands the attack surface (large result sets, expensive queries/functions, and access to internal tables like pending_uploads that can contain absolute local file paths). Consider gating this API behind the same auth used for run mutations, and/or restricting accessible tables/adding server-side limits to mitigate data-leak and DoS risk.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially disagree. All the other read endpoints on this server (get_logs, get_metrics_for_run, get_runs_for_project, etc.) are exposed under the same unauthenticated policy — query_project doesn't newly expand the read surface, it just gives a generic SQL handle to the same data. Gating only this one behind the write-token would be inconsistent and would break trackio query against public Spaces, which is the feature's point.

The DoS angle is a fair concern, and that's now addressed with the server-side max_rows cap. pending_uploads is worth flagging separately — happy to scope the authorizer to user-facing tables in a follow-up if we decide the local-path leak is a real risk in practice.

Comment thread .changeset/cyan-forks-hang.md Outdated
"trackio": minor
---

feat:Add docs on SQL & Parquet schema / format, as well as a new CLI command: `trackio query --sql SQL_QUERY`
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changeset message says the new command is trackio query --sql SQL_QUERY, but the CLI implemented in this PR requires the project subcommand (trackio query project --project ... --sql ...). Please update the changeset text to match the actual CLI interface so release notes aren’t misleading.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — updated the changeset to trackio query project --project PROJECT --sql SQL_QUERY.

- query_project: enforce max_rows cap via fetchmany(max_rows+1)
- normalize bytes/bytearray/memoryview to hex for JSON safety
- add unit tests for allow/deny/limit/bytes/missing-project
- update changeset to match actual CLI (trackio query project ...)
@abidlabs
Copy link
Copy Markdown
Member Author

This is a new CLI addition, but doesn't really touch existing code, so going to merge this in to unblock other PRs, such as #505

@abidlabs abidlabs merged commit 3b397df into main Apr 17, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants