Skip to content

pkg/util/schemacmp: allow utf8/latin1 < utf8mb4 charset/collation join#67778

Open
lance6716 wants to merge 10 commits intopingcap:masterfrom
lance6716:lance6716/schemacmp-charset-collation-join
Open

pkg/util/schemacmp: allow utf8/latin1 < utf8mb4 charset/collation join#67778
lance6716 wants to merge 10 commits intopingcap:masterfrom
lance6716:lance6716/schemacmp-charset-collation-join

Conversation

@lance6716
Copy link
Copy Markdown
Contributor

@lance6716 lance6716 commented Apr 15, 2026

What problem does this PR solve?

Issue Number: close #67777

Problem Summary:

pkg/util/schemacmp previously treated charset / collation as strict singletons, which makes schema compare/join fail for common upgrade paths like utf8 (utf8mb3) -> utf8mb4 or latin1 -> utf8mb4.

What changed and how does it work?

  • Add Charset and Collation lattices with ordering:
    • utf8 (utf8mb3) < utf8mb4
    • latin1 < utf8mb4
    • for collations, only allow joining when the suffix matches (e.g. utf8_bin < utf8mb4_bin).
  • Use them when encoding types.FieldType and model.TableInfo in pkg/util/schemacmp.
  • Extend unit tests to cover the new join behavior.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Schema comparison now normalizes charsets and collations, with explicit compatibility rules and deterministic join results across compatible families.
  • Bug Fixes

    • Restoration and encoding now consistently derive and emit CHARSET/COLLATE from stored collation, improving correctness for mixed charset/collation schemas.
    • New, clearer incompatibility errors for charset vs collation mismatch cases.
  • Tests

    • Added comprehensive tests for normalization, comparison ordering, join outcomes, and error messages.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Apr 15, 2026

@lance6716 I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhaoxinyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 15, 2026

Hi @lance6716. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds charset and collation lattice types and tests; removes explicit charset singletons from schema lattices in favor of deriving charset from stored collations; introduces incompatibility error constructors; updates encoding/decoding, tests, and Bazel build files.

Changes

Cohort / File(s) Summary
Charset & Collation Lattices
pkg/util/schemacmp/charset_collation.go, pkg/util/schemacmp/charset_collation_test.go
Add charsetLattice and collationLattice, Charset()/Collation() constructors, implement Compare, Join, Unwrap; tests for normalization, ordering, join behavior, and error messages.
Lattice Errors
pkg/util/schemacmp/lattice.go
Add ErrMsgIncompatibleCharset and ErrMsgIncompatibleCollation constants and helper constructors incompatibleCharsetError / incompatibleCollationError.
Schema Encoding/Decoding Integration
pkg/util/schemacmp/table.go, pkg/util/schemacmp/type.go
Remove charset singleton from lattice tuples; store collations via Collation(...); reconstruct charset from collate when decoding; adjust table encoding/restore to always emit CHARSET derived from COLLATE.
Tests Updated
pkg/util/schemacmp/table_test.go, pkg/util/schemacmp/type_test.go
Switch test context to metabuild, add FieldType fixtures and compare/join cases (including utf8/latin1 → utf8mb4 scenarios), and update expected DDL/assertions to deterministic strings.
Build Files
pkg/util/schemacmp/BUILD.bazel, pkg/importsdk/BUILD.bazel
Include charset_collation.go and its tests in schemacmp targets; add //pkg/meta/metabuild to test deps, remove some sessionctx/mock deps; add //pkg/parser/ast to importsdk_test deps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

size/XXL, ok-to-test, approved, lgtm

Suggested reviewers

  • GMHDBJD
  • yudongusa

Poem

"I nibble lattices, soft and neat,
Charsets shuffle, collations meet,
utf8mb4 hops to lead the way,
Joins resolve and make my day,
Hoppy tests — schema complete! 🐰"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: introducing charset/collation lattices that allow utf8/latin1 to join with utf8mb4.
Description check ✅ Passed The PR description follows the template structure with issue number, problem summary, detailed explanation of changes, checked unit test box, and release note section.
Linked Issues check ✅ Passed The PR fully addresses issue #67777 objectives: implements Charset/Collation lattices with utf8/latin1 < utf8mb4 ordering, applies them to FieldType/TableInfo encoding, and includes comprehensive unit tests.
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: lattice implementations, integration into encoding/decoding logic, BUILD.bazel dependency updates, and comprehensive test coverage with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lance6716
Copy link
Copy Markdown
Contributor Author

/check-issue-triage-complete

@kennytm
Copy link
Copy Markdown
Contributor

kennytm commented Apr 15, 2026

(legit failure in idc-jenkins-ci-tidb/build)

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 72-75: The comparison treats unknown charset values in
charsetFamilyOther as case-sensitive (a.value != b.value) causing GBK vs gbk to
be considered different; update the equality check inside the branch that
handles charsetFamilyOther to use a case-insensitive comparison (e.g., normalize
both values or use strings.EqualFold) instead of direct !=, and add the
suggested regression test in pkg/util/schemacmp/charset_collation_test.go
asserting Charset("GBK").Compare(Charset("gbk")) returns no error and cmp == 0
to prevent future regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40f5d86c-81b2-4038-afc6-f0e66be34cf0

📥 Commits

Reviewing files that changed from the base of the PR and between 212ded5 and 350bc04.

📒 Files selected for processing (9)
  • pkg/importsdk/BUILD.bazel
  • pkg/util/schemacmp/BUILD.bazel
  • pkg/util/schemacmp/charset_collation.go
  • pkg/util/schemacmp/charset_collation_test.go
  • pkg/util/schemacmp/lattice.go
  • pkg/util/schemacmp/table.go
  • pkg/util/schemacmp/table_test.go
  • pkg/util/schemacmp/type.go
  • pkg/util/schemacmp/type_test.go

Comment thread pkg/util/schemacmp/charset_collation.go Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 85.08772% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.3623%. Comparing base (212ded5) to head (c3bffac).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67778        +/-   ##
================================================
- Coverage   77.6075%   77.3623%   -0.2453%     
================================================
  Files          1981       1967        -14     
  Lines        548811     550879      +2068     
================================================
+ Hits         425919     426173       +254     
- Misses       122082     124703      +2621     
+ Partials        810          3       -807     
Flag Coverage Δ
integration 40.8657% <ø> (+6.5260%) ⬆️
unit 76.6529% <85.0877%> (+0.3103%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 49.4060% <ø> (-11.0132%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread pkg/util/schemacmp/charset_collation.go
Comment thread pkg/util/schemacmp/charset_collation.go Outdated
Comment thread pkg/util/schemacmp/table.go
Comment thread pkg/util/schemacmp/type.go
Signed-off-by: lance6716 <lance6716@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 63-76: The Join method for charsetLattice (and the analogous
collation lattice at 131-144) must stop treating a Compare() error as a hard
failure and instead, when Compare reports incomparable, compute the least upper
bound by scanning known lattice members: implement Join to first try cmp :=
a.Compare(b) and return a or b when comparable, but if Compare returns an
incomparable error, iterate over a canonical list of all charset lattice members
(use or add a helper like allCharsets()) and pick the minimal candidate c such
that c.Compare(a) >= 0 and c.Compare(b) >= 0 (use Compare to test), returning
the smallest matching candidate; if none found return the original error. Apply
the same pattern for collationLattice, and add regression tests asserting
Charset("utf8").Join(Charset("latin1")) succeeds and
Collation("utf8_bin").Join(Collation("latin1_bin")) succeeds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c7909230-f861-42c9-913e-bdd80be2c9c2

📥 Commits

Reviewing files that changed from the base of the PR and between 350bc04 and 650f8b3.

📒 Files selected for processing (5)
  • pkg/util/schemacmp/charset_collation.go
  • pkg/util/schemacmp/charset_collation_test.go
  • pkg/util/schemacmp/table.go
  • pkg/util/schemacmp/table_test.go
  • pkg/util/schemacmp/type.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/util/schemacmp/charset_collation_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/util/schemacmp/table.go
  • pkg/util/schemacmp/table_test.go

Comment thread pkg/util/schemacmp/charset_collation.go Outdated
Comment on lines +63 to +76
func (a charsetLattice) Join(other Lattice) (Lattice, error) {
b, ok := other.(charsetLattice)
if !ok {
return nil, typeMismatchError(a, other)
}

cmp, err := a.Compare(b)
if err != nil {
return nil, err
}
if cmp >= 0 {
return a, nil
}
return b, nil
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle incomparable-but-joinable pairs in Join().

Both Join() implementations currently just call Compare(), so utf8 and latin1 still fail to join even though the new ordering makes both of them lower than utf8mb4. That breaks the lattice contract and makes multi-way joins order-dependent, e.g. (utf8 JOIN latin1) JOIN utf8mb4 fails at the first step even though the least upper bound exists.

🔧 Proposed fix
 func (a charsetLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(charsetLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
-	}
-	if cmp >= 0 {
+	switch {
+	case a.value == b.value:
 		return a, nil
+	case a.value == tidbcharset.CharsetUTF8MB4 &&
+		(b.value == tidbcharset.CharsetUTF8 || b.value == tidbcharset.CharsetLatin1):
+		return a, nil
+	case b.value == tidbcharset.CharsetUTF8MB4 &&
+		(a.value == tidbcharset.CharsetUTF8 || a.value == tidbcharset.CharsetLatin1):
+		return b, nil
+	case (a.value == tidbcharset.CharsetUTF8 && b.value == tidbcharset.CharsetLatin1) ||
+		(a.value == tidbcharset.CharsetLatin1 && b.value == tidbcharset.CharsetUTF8):
+		return Charset(tidbcharset.CharsetUTF8MB4), nil
+	default:
+		return nil, incompatibleCharsetError(a.value, b.value)
 	}
-	return b, nil
 }
 
 func (a collationLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(collationLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
+	if a.suffix != b.suffix {
+		return nil, incompatibleCollationError(a.unwrapString(), b.unwrapString())
 	}
-	if cmp >= 0 {
-		return a, nil
+
+	joinedCharset, err := a.charset.Join(b.charset)
+	if err != nil {
+		return nil, err
 	}
-	return b, nil
+	return collationLattice{
+		charset: joinedCharset.(charsetLattice),
+		suffix:  a.suffix,
+	}, nil
 }

Please also add regression coverage for Charset("utf8").Join(Charset("latin1")) and Collation("utf8_bin").Join(Collation("latin1_bin")).

Also applies to: 131-144

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation.go` around lines 63 - 76, The Join
method for charsetLattice (and the analogous collation lattice at 131-144) must
stop treating a Compare() error as a hard failure and instead, when Compare
reports incomparable, compute the least upper bound by scanning known lattice
members: implement Join to first try cmp := a.Compare(b) and return a or b when
comparable, but if Compare returns an incomparable error, iterate over a
canonical list of all charset lattice members (use or add a helper like
allCharsets()) and pick the minimal candidate c such that c.Compare(a) >= 0 and
c.Compare(b) >= 0 (use Compare to test), returning the smallest matching
candidate; if none found return the original error. Apply the same pattern for
collationLattice, and add regression tests asserting
Charset("utf8").Join(Charset("latin1")) succeeds and
Collation("utf8_bin").Join(Collation("latin1_bin")) succeeds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, very complicated 😂 not sure we should support it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lance6716 Totally fair — the utf8 JOIN latin1 case is genuinely unusual as an upgrade path and isn't part of the scope in issue #67777. The two intended paths (utf8 → utf8mb4 and latin1 → utf8mb4) work correctly with the current Join() implementation since those are directly comparable via Compare().

The edge case only matters if a DM task ever encounters a table where one side has utf8 and the other has latin1 (with neither being utf8mb4), which seems unlikely in practice. If that scenario comes up later it can be addressed separately. Happy to drop the suggestion.


🧠 Learnings used
Learnt from: guo-shaoge
Repo: pingcap/tidb PR: 66349
File: tests/integrationtest/r/tpch.result:447-447
Timestamp: 2026-03-11T03:48:51.420Z
Learning: In `pkg/planner/core/joinorder/` (TiDB join reorder), cartesian edges should only be considered in two specific scenarios:
1. After the entire join reorder is complete and there are still disjoint join groups that need to be connected.
2. When a `leading` hint explicitly requires a cartesian join.
PR `#66213` attempted to fix case 2 but accidentally enabled case 1 as well. The correct logic ensures cartesian edges are gated to only these two situations.

Learnt from: Reminiscent
Repo: pingcap/tidb PR: 65367
File: pkg/planner/core/rule_join_reorder_dp.go:336-364
Timestamp: 2026-04-15T02:49:15.099Z
Learning: In `pkg/planner/core/rule_join_reorder_dp.go` (pingcap/tidb PR `#65367`), `findNodeIndexForColumns` intentionally returns `plannererrors.ErrInternal` when substituted equality-edge columns span multiple join-group leaves. This error is NOT user-visible: `optimizeRecursive` in `rule_join_reorder.go` wraps the DP call with a `fallbackOnErr` closure that, when `TiDBOptJoinReorderThroughProj` is enabled, catches any error from the DP solver and reruns join reorder with `TiDBOptJoinReorderThroughProj` disabled, treating the blocking projection as an atomic leaf. The hard-fail path is a deliberate retry sentinel, not a user-facing planner error. Do not flag this pattern as incorrect in future reviews.

Learnt from: guo-shaoge
Repo: pingcap/tidb PR: 67284
File: pkg/planner/core/joinorder/join_order.go:386-406
Timestamp: 2026-03-26T09:18:03.097Z
Learning: In `pkg/planner/core/joinorder/join_order.go` (pingcap/tidb), the DP join reorder `optimizeWithDetector` function has a sanity guard `nodeCount > 63` before allocating `bestPlan := make([]*Node, 1<<nodeCount)`. This guard is off-by-one: when `nodeCount == 63`, `1<<63` overflows a signed int on 64-bit Go causing a `makeslice` panic. The correct guard should be `nodeCount >= 63`. A follow-up to reduce the effective DP threshold below 63 is tracked in issue `#67325`.

Learnt from: cryo-zd
Repo: pingcap/tidb PR: 65532
File: pkg/planner/core/preprocess.go:1887-1898
Timestamp: 2026-03-12T08:41:44.509Z
Learning: In pkg/planner/core/preprocess.go (pingcap/tidb PR `#65532`), the `checkLockClauseTables` function intentionally accepts `FOR UPDATE OF <base_table>` even when the FROM item was introduced with an alias (`matched.alias.L != "" && !matchedByAlias`). Rather than returning an error (strict MySQL behavior), TiDB emits a warning to guide users toward the alias form. This backward-compatibility fallback is a deliberate acceptance criterion for the change — do not flag it as a MySQL-incompatibility bug in future reviews.

Learnt from: zimulala
Repo: pingcap/tidb PR: 67265
File: pkg/util/topsql/reporter/ru_datamodel_test.go:259-308
Timestamp: 2026-03-25T03:46:10.574Z
Learning: In `pkg/util/topsql/reporter/ru_datamodel_test.go` (pingcap/tidb PR `#67265`), `TestRUCollectingOthersWireLabelNoCollisionWithRuntimeUserShape` intentionally uses `"app127.0.0.1"` (not `othersUserWireLabel`) as the runtime user. The regression is scoped to runtime user shapes (`userhost` / empty string); broadening to arbitrary raw user strings matching the wire label is out of contract. The empty-user and merge-path regressions (`TestRUCollectingEmptyUserAndGlobalOthersRemainDistinct`, `TestRUCollectingMergeFromKeepsEmptyUserDistinctFromGlobalOthers`) plus aggregator-side assertion hardening collectively cover the fix.

Learnt from: joechenrh
Repo: pingcap/tidb PR: 66878
File: tests/realtikvtest/importintotest/import_into_test.go:253-267
Timestamp: 2026-03-11T06:26:14.065Z
Learning: In pingcap/tidb, PR `#66878` (cherry-pick of `#58401` to release-7.5) also bundles the follow-up nil-check fix from PR `#63146`. `TestOnUpdateColumn` in `tests/realtikvtest/importintotest/import_into_test.go` was cherry-picked from PR `#63146` and tests that `IMPORT INTO` on a table with an `ON UPDATE CURRENT_TIMESTAMP` column does NOT panic due to a nil `onDup` map in `resolveGeneratedColumns`. It is NOT intended to cover the transitive generated-column / index-inconsistency fix from `#58401/`#58400.

Learnt from: cryo-zd
Repo: pingcap/tidb PR: 65532
File: pkg/planner/core/preprocess.go:1906-1910
Timestamp: 2026-03-12T08:03:56.097Z
Learning: In pingcap/tidb (PR `#65532`), `buildSelect` in `pkg/planner/core/logical_plan_builder.go` derives the privilege-check DB for `FOR UPDATE OF` lock targets using `getLowerDB(tName.Schema, ...)`. For alias or unqualified lock targets whose `TableName.Schema` is empty after preprocessing, this falls back to `currentDB` and authorizes against the wrong database. The correct fix is to use `tNameW.DBInfo.Name.L` (with a nil guard) directly, since `tNameW` is already fully resolved by `checkLockClauseTables` in preprocess.go. This is consistent with how privilege checks for INSERT/DELETE use `tnW.DBInfo.Name.L` elsewhere in the codebase.

Learnt from: joechenrh
Repo: pingcap/tidb PR: 66878
File: pkg/planner/core/logical_plan_builder.go:6455-6472
Timestamp: 2026-03-11T06:29:00.122Z
Learning: Ensure code reviews verify the VirtualAssignmentsOffset semantics: the planner sets Update.VirtualAssignmentsOffset = len(update.List). The executor should only apply OrderedList[:VirtualAssignmentsOffset] when composing new rows, and only after the 'changed' check should it group/evaluate OrderedList[VirtualAssignmentsOffset:] per table. This pattern applies to files under pkg/planner/core and pkg/executor (e.g., common_plans.go and update.go). Reviewers should check that updates respect slicing behavior, that the offset is consistently derived from the planner, and that downstream code does not bypass the offset when creating new rows. Add tests validating both branches: the slice before the offset for new rows, and the per-table handling of the slice after the offset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kennytm Should I follow this comment? Which means utf8 JOIN latin1 = utf8mb4. I think it's reasonable.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lance6716 I think it is fine. The implication is that though, if we have 2 shards and the user goes

alter table t1 add column c text collate latin1_bin;
alter table t2 add column c text collate utf8mb3_bin;

it will end up in downstream TiDB as (if i remembered what optimistic sharding works correctly):

alter table t add column c text collate latin1_bin;
alter table t modify column c text collate utf8mb4_bin; # ← result of join

Copy link
Copy Markdown
Contributor Author

@lance6716 lance6716 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My coding agent pushed a commit 67b337b for this comment but I haven't reviewed it 😂 I'll do a self review first

Comment thread pkg/util/schemacmp/charset_collation.go Outdated
Comment thread pkg/util/schemacmp/charset_collation_test.go
Comment thread pkg/util/schemacmp/lattice.go Outdated
Comment thread pkg/util/schemacmp/table.go Outdated
Comment thread pkg/util/schemacmp/table.go Outdated
Comment on lines +336 to +344
collate := table[tableInfoTupleIndexCollate].(string)
if collate != "" {
cs, _, hasSuffix := strings.Cut(collate, "_")
ctx.WriteKeyWord(" CHARSET ")
ctx.WriteKeyWord(charset)
}
if collate := table[tableInfoTupleIndexCollate].(string); collate != "" {
ctx.WriteKeyWord(" COLLATE ")
ctx.WriteKeyWord(collate)
ctx.WriteKeyWord(cs)
if hasSuffix {
ctx.WriteKeyWord(" COLLATE ")
ctx.WriteKeyWord(collate)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/util/schemacmp/charset_collation.go (1)

63-76: ⚠️ Potential issue | 🟠 Major

Return the least upper bound for utf8/latin1 joins.

Both Join() implementations still treat utf8 vs latin1 as a hard incompatibility because they only forward to Compare(). That means Charset("utf8").Join(Charset("latin1")) and Collation("utf8_bin").Join(Collation("latin1_bin")) still fail even though the lattice defined above them has a valid LUB: utf8mb4 / utf8mb4_bin. In practice, table joins remain order-dependent for the mixed-shard upgrade path this PR is meant to unblock.

💡 Proposed fix
 func (a charsetLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(charsetLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
-	}
-	if cmp >= 0 {
+	switch {
+	case a.value == b.value:
 		return a, nil
+	case a.value == tidbcharset.CharsetUTF8MB4 &&
+		(b.value == tidbcharset.CharsetUTF8 || b.value == tidbcharset.CharsetLatin1):
+		return a, nil
+	case b.value == tidbcharset.CharsetUTF8MB4 &&
+		(a.value == tidbcharset.CharsetUTF8 || a.value == tidbcharset.CharsetLatin1):
+		return b, nil
+	case (a.value == tidbcharset.CharsetUTF8 && b.value == tidbcharset.CharsetLatin1) ||
+		(a.value == tidbcharset.CharsetLatin1 && b.value == tidbcharset.CharsetUTF8):
+		return Charset(tidbcharset.CharsetUTF8MB4), nil
+	default:
+		return nil, incompatibleCharsetError(a.value, b.value)
 	}
-	return b, nil
 }
 
 func (a collationLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(collationLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
+	if a.suffix != b.suffix {
+		return nil, incompatibleCollationError(a.unwrapString(), b.unwrapString())
 	}
-	if cmp >= 0 {
-		return a, nil
+
+	joinedCharset, err := a.charset.Join(b.charset)
+	if err != nil {
+		return nil, err
 	}
-	return b, nil
+	return collationLattice{
+		charset: joinedCharset.(charsetLattice),
+		suffix:  a.suffix,
+	}, nil
 }

Please also add a direct regression case for utf8latin1 and utf8_binlatin1_bin.

Also applies to: 128-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation.go` around lines 63 - 76, The Join
implementations for charsetLattice (and the analogous collation lattice)
currently treat utf8 vs latin1 as incompatible because they simply forward to
Compare; change Join in charsetLattice (and the corresponding collation join) to
detect the utf8↔latin1 pair and return the least upper bound (utf8mb4 for
charset, utf8mb4_bin for collation) instead of erroring: after type-asserting
other to charsetLattice and before returning errors from Compare, add logic to
recognize when one lattice value is Charset("utf8") and the other is
Charset("latin1") (and similarly for Collation("utf8_bin") vs
Collation("latin1_bin")) and return the appropriate LUB instance; update the
Compare-based fallback to remain for other cases. Also add regression tests that
call Charset("utf8").Join(Charset("latin1")) and
Collation("utf8_bin").Join(Collation("latin1_bin")) and assert the returned LUB
equals utf8mb4 / utf8mb4_bin.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 63-76: The Join implementations for charsetLattice (and the
analogous collation lattice) currently treat utf8 vs latin1 as incompatible
because they simply forward to Compare; change Join in charsetLattice (and the
corresponding collation join) to detect the utf8↔latin1 pair and return the
least upper bound (utf8mb4 for charset, utf8mb4_bin for collation) instead of
erroring: after type-asserting other to charsetLattice and before returning
errors from Compare, add logic to recognize when one lattice value is
Charset("utf8") and the other is Charset("latin1") (and similarly for
Collation("utf8_bin") vs Collation("latin1_bin")) and return the appropriate LUB
instance; update the Compare-based fallback to remain for other cases. Also add
regression tests that call Charset("utf8").Join(Charset("latin1")) and
Collation("utf8_bin").Join(Collation("latin1_bin")) and assert the returned LUB
equals utf8mb4 / utf8mb4_bin.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a932c3a1-e67f-455d-a25c-102070788827

📥 Commits

Reviewing files that changed from the base of the PR and between 650f8b3 and 7f484b0.

📒 Files selected for processing (7)
  • pkg/util/schemacmp/BUILD.bazel
  • pkg/util/schemacmp/charset_collation.go
  • pkg/util/schemacmp/charset_collation_test.go
  • pkg/util/schemacmp/lattice.go
  • pkg/util/schemacmp/table.go
  • pkg/util/schemacmp/table_test.go
  • pkg/util/schemacmp/type_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/util/schemacmp/type_test.go
  • pkg/util/schemacmp/BUILD.bazel

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/util/schemacmp/charset_collation_test.go (1)

108-116: Add one negative Join case for suffix mismatch.

TestCollationJoin currently covers only successful joins. Since the contract requires matching suffixes, adding one failing-join assertion will harden regression coverage for the join path itself (not just compare).

Suggested test addition
 func TestCollationJoin(t *testing.T) {
 	join, err := Collation("utf8_bin").Join(Collation("latin1_bin"))
 	require.NoError(t, err)
 	require.Equal(t, "utf8mb4_bin", join.Unwrap())

 	join, err = Collation("latin1_bin").Join(Collation("utf8_bin"))
 	require.NoError(t, err)
 	require.Equal(t, "utf8mb4_bin", join.Unwrap())
+
+	_, err = Collation("utf8mb4_general_ci").Join(Collation("utf8mb4_0900_ai_ci"))
+	require.ErrorContains(t, err, "incompatible collation (utf8mb4_general_ci vs utf8mb4_0900_ai_ci)")
 }

Based on learnings: **/*.go : MUST add a regression test and verify it fails before fix and passes after fix for bug fixes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation_test.go` around lines 108 - 116, Add a
negative test case to TestCollationJoin to assert that Collation.Join returns an
error when suffixes mismatch: call
Collation("utf8_general_ci").Join(Collation("utf8_bin")) (or another pair with
differing suffixes), require an error (require.Error/require.NotNil) and
optionally assert the error message contains the suffix-mismatch hint; this
targets the Collation.Join behavior (and the TestCollationJoin test) so the test
fails before the fix and passes after.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/schemacmp/charset_collation_test.go`:
- Around line 108-116: Add a negative test case to TestCollationJoin to assert
that Collation.Join returns an error when suffixes mismatch: call
Collation("utf8_general_ci").Join(Collation("utf8_bin")) (or another pair with
differing suffixes), require an error (require.Error/require.NotNil) and
optionally assert the error message contains the suffix-mismatch hint; this
targets the Collation.Join behavior (and the TestCollationJoin test) so the test
fails before the fix and passes after.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 96243436-ea7b-488c-98bd-1ec62c37aa98

📥 Commits

Reviewing files that changed from the base of the PR and between 7f484b0 and 67b337b.

📒 Files selected for processing (3)
  • pkg/util/schemacmp/charset_collation.go
  • pkg/util/schemacmp/charset_collation_test.go
  • pkg/util/schemacmp/type_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/util/schemacmp/type_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/util/schemacmp/charset_collation.go

Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/util/schemacmp/charset_collation.go (1)

155-158: Return joinErr directly in the collation join failure path.

Line 157 currently returns the earlier err from a.Compare(b). Returning joinErr is clearer and future-proof if charset join starts producing richer errors.

♻️ Proposed refactor
 	joinCharset, joinErr := a.charset.Join(b.charset)
 	if joinErr != nil {
-		return nil, err
+		return nil, joinErr
 	}

As per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation.go` around lines 155 - 158, The
collation join error path returns the wrong variable; when calling
a.charset.Join(b.charset) and assigning joinCharset, joinErr, replace the
erroneous return of the earlier err (from a.Compare(b)) with returning joinErr
so the actual charset join error is propagated; update the failure branch in the
function that calls a.Compare(b) and a.charset.Join(b.charset) to return joinErr
instead of err.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 155-158: The collation join error path returns the wrong variable;
when calling a.charset.Join(b.charset) and assigning joinCharset, joinErr,
replace the erroneous return of the earlier err (from a.Compare(b)) with
returning joinErr so the actual charset join error is propagated; update the
failure branch in the function that calls a.Compare(b) and
a.charset.Join(b.charset) to return joinErr instead of err.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e1a27928-23b5-422c-b143-82dc039ca920

📥 Commits

Reviewing files that changed from the base of the PR and between 67b337b and c3bffac.

📒 Files selected for processing (3)
  • pkg/util/schemacmp/charset_collation.go
  • pkg/util/schemacmp/charset_collation_test.go
  • pkg/util/schemacmp/table.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/util/schemacmp/table.go
  • pkg/util/schemacmp/charset_collation_test.go

@lance6716
Copy link
Copy Markdown
Contributor Author

I have reviewed AI's code. PTAL @kennytm @GMHDBJD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/util/schemacmp: allow joining utf8/latin1 with utf8mb4

2 participants