Skip to content

Fix duplicate columns in parquet export#484

Merged
abidlabs merged 3 commits into
mainfrom
flatten
Apr 10, 2026
Merged

Fix duplicate columns in parquet export#484
abidlabs merged 3 commits into
mainfrom
flatten

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 10, 2026

use pd.concat(..., axis=1) to avoid DataFrame fragmentation in the flattening path, avoids this warning in the terminal:

pandas PerformanceWarning: the DataFrame was highly fragmented...

abidlabs and others added 2 commits April 10, 2026 11:23
Avoid duplicate column names when flattening JSON fields for parquet export and use concat to prevent DataFrame fragmentation during export.

Made-with: Cursor
@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 10, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio patch

  • Fix duplicate columns in parquet export

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 10, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 10, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/363b7d90358d308c3dea7d03245949a1a5cfa8a7/trackio-0.21.2-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Parquet export failures caused by duplicate columns when flattening JSON payloads (e.g., JSON keys colliding with structural table columns), and reduces DataFrame fragmentation in the flattening path.

Changes:

  • Drop flattened JSON keys that would duplicate existing DataFrame columns before export.
  • Use pd.concat(..., axis=1) to append expanded columns in one operation (avoids per-column assignment fragmentation).
  • Add a Changesets entry to release/version the change.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
trackio/sqlite_storage.py Updates _flatten_json_column() to filter duplicate keys and concatenate expanded columns efficiently before Parquet export.
.changeset/funny-files-burn.md Adds a Changesets release note/version bump entry for the fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread trackio/sqlite_storage.py
df[c] = expanded[c]
return df
expanded = expanded.loc[:, ~expanded.columns.isin(df.columns)]
return pd.concat([df, expanded], axis=1)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new duplicate-column drop behavior isn’t covered by existing tests. Please add/extend a unit test to exercise _flatten_json_column() when the JSON payload contains a key that collides with an existing structural column (e.g., run_name), and assert that (1) the structural column is not overwritten and (2) the resulting DataFrame has unique columns so to_parquet() succeeds.

Suggested change
return pd.concat([df, expanded], axis=1)
flattened = pd.concat([df, expanded], axis=1)
return flattened.loc[:, ~flattened.columns.duplicated()]

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
---
"trackio": minor
---

feat:Fix duplicate columns in parquet export
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changeset reads like a bug fix (and the PR title/description says “Fix duplicate columns …”), but it’s marked as a minor release and categorized as feat:. That will both bump the version more than necessary and place the entry under “Features” in the changelog (see .changeset/changeset.cjs parsing ^(feat|fix|highlight)). Consider changing the bump to patch and the summary prefix to fix: (with a space after the colon).

Copilot uses AI. Check for mistakes.
@abidlabs abidlabs merged commit cc05ada into main Apr 10, 2026
8 of 9 checks passed
Copy link
Copy Markdown
Collaborator

@znation znation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants