Conversation
Avoid duplicate column names when flattening JSON fields for parquet export and use concat to prevent DataFrame fragmentation during export. Made-with: Cursor
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/363b7d90358d308c3dea7d03245949a1a5cfa8a7/trackio-0.21.2-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull request overview
Fixes Parquet export failures caused by duplicate columns when flattening JSON payloads (e.g., JSON keys colliding with structural table columns), and reduces DataFrame fragmentation in the flattening path.
Changes:
- Drop flattened JSON keys that would duplicate existing DataFrame columns before export.
- Use
pd.concat(..., axis=1)to append expanded columns in one operation (avoids per-column assignment fragmentation). - Add a Changesets entry to release/version the change.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
trackio/sqlite_storage.py |
Updates _flatten_json_column() to filter duplicate keys and concatenate expanded columns efficiently before Parquet export. |
.changeset/funny-files-burn.md |
Adds a Changesets release note/version bump entry for the fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| df[c] = expanded[c] | ||
| return df | ||
| expanded = expanded.loc[:, ~expanded.columns.isin(df.columns)] | ||
| return pd.concat([df, expanded], axis=1) |
There was a problem hiding this comment.
The new duplicate-column drop behavior isn’t covered by existing tests. Please add/extend a unit test to exercise _flatten_json_column() when the JSON payload contains a key that collides with an existing structural column (e.g., run_name), and assert that (1) the structural column is not overwritten and (2) the resulting DataFrame has unique columns so to_parquet() succeeds.
| return pd.concat([df, expanded], axis=1) | |
| flattened = pd.concat([df, expanded], axis=1) | |
| return flattened.loc[:, ~flattened.columns.duplicated()] |
| --- | ||
| "trackio": minor | ||
| --- | ||
|
|
||
| feat:Fix duplicate columns in parquet export |
There was a problem hiding this comment.
This changeset reads like a bug fix (and the PR title/description says “Fix duplicate columns …”), but it’s marked as a minor release and categorized as feat:. That will both bump the version more than necessary and place the entry under “Features” in the changelog (see .changeset/changeset.cjs parsing ^(feat|fix|highlight)). Consider changing the bump to patch and the summary prefix to fix: (with a space after the colon).
znation
left a comment
There was a problem hiding this comment.
Looks good, thanks for fixing!
use
pd.concat(..., axis=1)to avoid DataFrame fragmentation in the flattening path, avoids this warning in the terminal: