pkg/util/schemacmp: allow utf8/latin1 < utf8mb4 charset/collation join by lance6716 · Pull Request #67778 · pingcap/tidb

lance6716 · 2026-04-15T06:27:58Z

What problem does this PR solve?

Issue Number: close #67777

Problem Summary:

pkg/util/schemacmp previously treated charset / collation as strict singletons, which makes schema compare/join fail for common upgrade paths like utf8 (utf8mb3) -> utf8mb4 or latin1 -> utf8mb4.

What changed and how does it work?

Add Charset and Collation lattices with ordering:
- utf8 (utf8mb3) < utf8mb4
- latin1 < utf8mb4
- for collations, only allow joining when the suffix matches (e.g. utf8_bin < utf8mb4_bin).
Use them when encoding types.FieldType and model.TableInfo in pkg/util/schemacmp.
Extend unit tests to cover the new join behavior.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

New Features
- Schema comparison now normalizes charsets and collations, with explicit compatibility rules and deterministic join results across compatible families.
Bug Fixes
- Restoration and encoding now consistently derive and emit CHARSET/COLLATE from stored collation, improving correctness for mixed charset/collation schemas.
- New, clearer incompatibility errors for charset vs collation mismatch cases.
Tests
- Added comprehensive tests for normalization, comparison ordering, join outcomes, and error messages.

pantheon-ai · 2026-04-15T06:28:04Z

@lance6716 I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

ti-chi-bot · 2026-04-15T06:28:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhaoxinyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tiprow · 2026-04-15T06:28:20Z

Hi @lance6716. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-15T06:28:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds charset and collation lattice types and tests; removes explicit charset singletons from schema lattices in favor of deriving charset from stored collations; introduces incompatibility error constructors; updates encoding/decoding, tests, and Bazel build files.

Changes

Cohort / File(s)	Summary
Charset & Collation Lattices `pkg/util/schemacmp/charset_collation.go`, `pkg/util/schemacmp/charset_collation_test.go`	Add `charsetLattice` and `collationLattice`, `Charset()`/`Collation()` constructors, implement `Compare`, `Join`, `Unwrap`; tests for normalization, ordering, join behavior, and error messages.
Lattice Errors `pkg/util/schemacmp/lattice.go`	Add `ErrMsgIncompatibleCharset` and `ErrMsgIncompatibleCollation` constants and helper constructors `incompatibleCharsetError` / `incompatibleCollationError`.
Schema Encoding/Decoding Integration `pkg/util/schemacmp/table.go`, `pkg/util/schemacmp/type.go`	Remove charset singleton from lattice tuples; store collations via `Collation(...)`; reconstruct charset from collate when decoding; adjust table encoding/restore to always emit `CHARSET` derived from `COLLATE`.
Tests Updated `pkg/util/schemacmp/table_test.go`, `pkg/util/schemacmp/type_test.go`	Switch test context to `metabuild`, add FieldType fixtures and compare/join cases (including utf8/latin1 → utf8mb4 scenarios), and update expected DDL/assertions to deterministic strings.
Build Files `pkg/util/schemacmp/BUILD.bazel`, `pkg/importsdk/BUILD.bazel`	Include `charset_collation.go` and its tests in schemacmp targets; add `//pkg/meta/metabuild` to test deps, remove some sessionctx/mock deps; add `//pkg/parser/ast` to `importsdk_test` deps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

importsdk, importer, importinto: add import size estimate #67241: also modifies pkg/importsdk/BUILD.bazel to add //pkg/parser/ast to test/library dependencies (closely related build-target change).

Suggested labels

size/XXL, ok-to-test, approved, lgtm

Suggested reviewers

GMHDBJD
yudongusa

Poem

"I nibble lattices, soft and neat,
Charsets shuffle, collations meet,
utf8mb4 hops to lead the way,
Joins resolve and make my day,
Hoppy tests — schema complete! 🐰"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: introducing charset/collation lattices that allow utf8/latin1 to join with utf8mb4.
Description check	✅ Passed	The PR description follows the template structure with issue number, problem summary, detailed explanation of changes, checked unit test box, and release note section.
Linked Issues check	✅ Passed	The PR fully addresses issue `#67777` objectives: implements Charset/Collation lattices with utf8/latin1 < utf8mb4 ordering, applies them to FieldType/TableInfo encoding, and includes comprehensive unit tests.
Out of Scope Changes check	✅ Passed	All changes directly support the stated objectives: lattice implementations, integration into encoding/decoding logic, BUILD.bazel dependency updates, and comprehensive test coverage with no unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lance6716 · 2026-04-15T06:33:13Z

/check-issue-triage-complete

kennytm · 2026-04-15T06:41:05Z

(legit failure in idc-jenkins-ci-tidb/build)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 72-75: The comparison treats unknown charset values in
charsetFamilyOther as case-sensitive (a.value != b.value) causing GBK vs gbk to
be considered different; update the equality check inside the branch that
handles charsetFamilyOther to use a case-insensitive comparison (e.g., normalize
both values or use strings.EqualFold) instead of direct !=, and add the
suggested regression test in pkg/util/schemacmp/charset_collation_test.go
asserting Charset("GBK").Compare(Charset("gbk")) returns no error and cmp == 0
to prevent future regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40f5d86c-81b2-4038-afc6-f0e66be34cf0

📥 Commits

Reviewing files that changed from the base of the PR and between 212ded5 and 350bc04.

📒 Files selected for processing (9)

pkg/importsdk/BUILD.bazel
pkg/util/schemacmp/BUILD.bazel
pkg/util/schemacmp/charset_collation.go
pkg/util/schemacmp/charset_collation_test.go
pkg/util/schemacmp/lattice.go
pkg/util/schemacmp/table.go
pkg/util/schemacmp/table_test.go
pkg/util/schemacmp/type.go
pkg/util/schemacmp/type_test.go

codecov · 2026-04-15T07:07:42Z

Codecov Report

❌ Patch coverage is 85.08772% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.3623%. Comparing base (212ded5) to head (c3bffac).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67778        +/-   ##
================================================
- Coverage   77.6075%   77.3623%   -0.2453%     
================================================
  Files          1981       1967        -14     
  Lines        548811     550879      +2068     
================================================
+ Hits         425919     426173       +254     
- Misses       122082     124703      +2621     
+ Partials        810          3       -807

Flag	Coverage Δ
integration	`40.8657% <ø> (+6.5260%)`	⬆️
unit	`76.6529% <85.0877%> (+0.3103%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`49.4060% <ø> (-11.0132%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: lance6716 <lance6716@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 63-76: The Join method for charsetLattice (and the analogous
collation lattice at 131-144) must stop treating a Compare() error as a hard
failure and instead, when Compare reports incomparable, compute the least upper
bound by scanning known lattice members: implement Join to first try cmp :=
a.Compare(b) and return a or b when comparable, but if Compare returns an
incomparable error, iterate over a canonical list of all charset lattice members
(use or add a helper like allCharsets()) and pick the minimal candidate c such
that c.Compare(a) >= 0 and c.Compare(b) >= 0 (use Compare to test), returning
the smallest matching candidate; if none found return the original error. Apply
the same pattern for collationLattice, and add regression tests asserting
Charset("utf8").Join(Charset("latin1")) succeeds and
Collation("utf8_bin").Join(Collation("latin1_bin")) succeeds.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c7909230-f861-42c9-913e-bdd80be2c9c2

📥 Commits

Reviewing files that changed from the base of the PR and between 350bc04 and 650f8b3.

📒 Files selected for processing (5)

pkg/util/schemacmp/charset_collation.go
pkg/util/schemacmp/charset_collation_test.go
pkg/util/schemacmp/table.go
pkg/util/schemacmp/table_test.go
pkg/util/schemacmp/type.go

✅ Files skipped from review due to trivial changes (1)

pkg/util/schemacmp/charset_collation_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/util/schemacmp/table.go
pkg/util/schemacmp/table_test.go

coderabbitai · 2026-04-15T10:14:49Z

+func (a charsetLattice) Join(other Lattice) (Lattice, error) {
+	b, ok := other.(charsetLattice)
+	if !ok {
+		return nil, typeMismatchError(a, other)
+	}
+
+	cmp, err := a.Compare(b)
+	if err != nil {
+		return nil, err
+	}
+	if cmp >= 0 {
+		return a, nil
+	}
+	return b, nil


⚠️ Potential issue | 🟠 Major

Handle incomparable-but-joinable pairs in Join().

Both Join() implementations currently just call Compare(), so utf8 and latin1 still fail to join even though the new ordering makes both of them lower than utf8mb4. That breaks the lattice contract and makes multi-way joins order-dependent, e.g. (utf8 JOIN latin1) JOIN utf8mb4 fails at the first step even though the least upper bound exists.

🔧 Proposed fix

func (a charsetLattice) Join(other Lattice) (Lattice, error) { b, ok := other.(charsetLattice) if !ok { return nil, typeMismatchError(a, other) } - cmp, err := a.Compare(b) - if err != nil { - return nil, err - } - if cmp >= 0 { + switch { + case a.value == b.value: return a, nil + case a.value == tidbcharset.CharsetUTF8MB4 && + (b.value == tidbcharset.CharsetUTF8 || b.value == tidbcharset.CharsetLatin1): + return a, nil + case b.value == tidbcharset.CharsetUTF8MB4 && + (a.value == tidbcharset.CharsetUTF8 || a.value == tidbcharset.CharsetLatin1): + return b, nil + case (a.value == tidbcharset.CharsetUTF8 && b.value == tidbcharset.CharsetLatin1) || + (a.value == tidbcharset.CharsetLatin1 && b.value == tidbcharset.CharsetUTF8): + return Charset(tidbcharset.CharsetUTF8MB4), nil + default: + return nil, incompatibleCharsetError(a.value, b.value) } - return b, nil } func (a collationLattice) Join(other Lattice) (Lattice, error) { b, ok := other.(collationLattice) if !ok { return nil, typeMismatchError(a, other) } - cmp, err := a.Compare(b) - if err != nil { - return nil, err + if a.suffix != b.suffix { + return nil, incompatibleCollationError(a.unwrapString(), b.unwrapString()) } - if cmp >= 0 { - return a, nil + + joinedCharset, err := a.charset.Join(b.charset) + if err != nil { + return nil, err } - return b, nil + return collationLattice{ + charset: joinedCharset.(charsetLattice), + suffix: a.suffix, + }, nil }

Please also add regression coverage for Charset("utf8").Join(Charset("latin1")) and Collation("utf8_bin").Join(Collation("latin1_bin")).

Also applies to: 131-144

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/util/schemacmp/charset_collation.go` around lines 63 - 76, The Join method for charsetLattice (and the analogous collation lattice at 131-144) must stop treating a Compare() error as a hard failure and instead, when Compare reports incomparable, compute the least upper bound by scanning known lattice members: implement Join to first try cmp := a.Compare(b) and return a or b when comparable, but if Compare returns an incomparable error, iterate over a canonical list of all charset lattice members (use or add a helper like allCharsets()) and pick the minimal candidate c such that c.Compare(a) >= 0 and c.Compare(b) >= 0 (use Compare to test), returning the smallest matching candidate; if none found return the original error. Apply the same pattern for collationLattice, and add regression tests asserting Charset("utf8").Join(Charset("latin1")) succeeds and Collation("utf8_bin").Join(Collation("latin1_bin")) succeeds.

Oh, very complicated 😂 not sure we should support it

@lance6716 Totally fair — the utf8 JOIN latin1 case is genuinely unusual as an upgrade path and isn't part of the scope in issue #67777. The two intended paths (utf8 → utf8mb4 and latin1 → utf8mb4) work correctly with the current Join() implementation since those are directly comparable via Compare().

The edge case only matters if a DM task ever encounters a table where one side has utf8 and the other has latin1 (with neither being utf8mb4), which seems unlikely in practice. If that scenario comes up later it can be addressed separately. Happy to drop the suggestion.

🧠 Learnings used

Learnt from: guo-shaoge Repo: pingcap/tidb PR: 66349 File: tests/integrationtest/r/tpch.result:447-447 Timestamp: 2026-03-11T03:48:51.420Z Learning: In `pkg/planner/core/joinorder/` (TiDB join reorder), cartesian edges should only be considered in two specific scenarios: 1. After the entire join reorder is complete and there are still disjoint join groups that need to be connected. 2. When a `leading` hint explicitly requires a cartesian join. PR `#66213` attempted to fix case 2 but accidentally enabled case 1 as well. The correct logic ensures cartesian edges are gated to only these two situations.

Learnt from: Reminiscent Repo: pingcap/tidb PR: 65367 File: pkg/planner/core/rule_join_reorder_dp.go:336-364 Timestamp: 2026-04-15T02:49:15.099Z Learning: In `pkg/planner/core/rule_join_reorder_dp.go` (pingcap/tidb PR `#65367`), `findNodeIndexForColumns` intentionally returns `plannererrors.ErrInternal` when substituted equality-edge columns span multiple join-group leaves. This error is NOT user-visible: `optimizeRecursive` in `rule_join_reorder.go` wraps the DP call with a `fallbackOnErr` closure that, when `TiDBOptJoinReorderThroughProj` is enabled, catches any error from the DP solver and reruns join reorder with `TiDBOptJoinReorderThroughProj` disabled, treating the blocking projection as an atomic leaf. The hard-fail path is a deliberate retry sentinel, not a user-facing planner error. Do not flag this pattern as incorrect in future reviews.

Learnt from: guo-shaoge Repo: pingcap/tidb PR: 67284 File: pkg/planner/core/joinorder/join_order.go:386-406 Timestamp: 2026-03-26T09:18:03.097Z Learning: In `pkg/planner/core/joinorder/join_order.go` (pingcap/tidb), the DP join reorder `optimizeWithDetector` function has a sanity guard `nodeCount > 63` before allocating `bestPlan := make([]*Node, 1<<nodeCount)`. This guard is off-by-one: when `nodeCount == 63`, `1<<63` overflows a signed int on 64-bit Go causing a `makeslice` panic. The correct guard should be `nodeCount >= 63`. A follow-up to reduce the effective DP threshold below 63 is tracked in issue `#67325`.

Learnt from: cryo-zd Repo: pingcap/tidb PR: 65532 File: pkg/planner/core/preprocess.go:1887-1898 Timestamp: 2026-03-12T08:41:44.509Z Learning: In pkg/planner/core/preprocess.go (pingcap/tidb PR `#65532`), the `checkLockClauseTables` function intentionally accepts `FOR UPDATE OF <base_table>` even when the FROM item was introduced with an alias (`matched.alias.L != "" && !matchedByAlias`). Rather than returning an error (strict MySQL behavior), TiDB emits a warning to guide users toward the alias form. This backward-compatibility fallback is a deliberate acceptance criterion for the change — do not flag it as a MySQL-incompatibility bug in future reviews.

Learnt from: zimulala Repo: pingcap/tidb PR: 67265 File: pkg/util/topsql/reporter/ru_datamodel_test.go:259-308 Timestamp: 2026-03-25T03:46:10.574Z Learning: In `pkg/util/topsql/reporter/ru_datamodel_test.go` (pingcap/tidb PR `#67265`), `TestRUCollectingOthersWireLabelNoCollisionWithRuntimeUserShape` intentionally uses `"app127.0.0.1"` (not `othersUserWireLabel`) as the runtime user. The regression is scoped to runtime user shapes (`userhost` / empty string); broadening to arbitrary raw user strings matching the wire label is out of contract. The empty-user and merge-path regressions (`TestRUCollectingEmptyUserAndGlobalOthersRemainDistinct`, `TestRUCollectingMergeFromKeepsEmptyUserDistinctFromGlobalOthers`) plus aggregator-side assertion hardening collectively cover the fix.

Learnt from: joechenrh Repo: pingcap/tidb PR: 66878 File: tests/realtikvtest/importintotest/import_into_test.go:253-267 Timestamp: 2026-03-11T06:26:14.065Z Learning: In pingcap/tidb, PR `#66878` (cherry-pick of `#58401` to release-7.5) also bundles the follow-up nil-check fix from PR `#63146`. `TestOnUpdateColumn` in `tests/realtikvtest/importintotest/import_into_test.go` was cherry-picked from PR `#63146` and tests that `IMPORT INTO` on a table with an `ON UPDATE CURRENT_TIMESTAMP` column does NOT panic due to a nil `onDup` map in `resolveGeneratedColumns`. It is NOT intended to cover the transitive generated-column / index-inconsistency fix from `#58401/`#58400.

Learnt from: cryo-zd Repo: pingcap/tidb PR: 65532 File: pkg/planner/core/preprocess.go:1906-1910 Timestamp: 2026-03-12T08:03:56.097Z Learning: In pingcap/tidb (PR `#65532`), `buildSelect` in `pkg/planner/core/logical_plan_builder.go` derives the privilege-check DB for `FOR UPDATE OF` lock targets using `getLowerDB(tName.Schema, ...)`. For alias or unqualified lock targets whose `TableName.Schema` is empty after preprocessing, this falls back to `currentDB` and authorizes against the wrong database. The correct fix is to use `tNameW.DBInfo.Name.L` (with a nil guard) directly, since `tNameW` is already fully resolved by `checkLockClauseTables` in preprocess.go. This is consistent with how privilege checks for INSERT/DELETE use `tnW.DBInfo.Name.L` elsewhere in the codebase.

Learnt from: joechenrh Repo: pingcap/tidb PR: 66878 File: pkg/planner/core/logical_plan_builder.go:6455-6472 Timestamp: 2026-03-11T06:29:00.122Z Learning: Ensure code reviews verify the VirtualAssignmentsOffset semantics: the planner sets Update.VirtualAssignmentsOffset = len(update.List). The executor should only apply OrderedList[:VirtualAssignmentsOffset] when composing new rows, and only after the 'changed' check should it group/evaluate OrderedList[VirtualAssignmentsOffset:] per table. This pattern applies to files under pkg/planner/core and pkg/executor (e.g., common_plans.go and update.go). Reviewers should check that updates respect slicing behavior, that the offset is consistently derived from the planner, and that downstream code does not bypass the offset when creating new rows. Add tests validating both branches: the slice before the offset for new rows, and the per-table handling of the slice after the offset.

@kennytm Should I follow this comment? Which means utf8 JOIN latin1 = utf8mb4. I think it's reasonable.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@lance6716 I think it is fine. The implication is that though, if we have 2 shards and the user goes

alter table t1 add column c text collate latin1_bin; alter table t2 add column c text collate utf8mb3_bin;

it will end up in downstream TiDB as (if i remembered what optimistic sharding works correctly):

alter table t add column c text collate latin1_bin; alter table t modify column c text collate utf8mb4_bin; # ← result of join

My coding agent pushed a commit 67b337b for this comment but I haven't reviewed it 😂 I'll do a self review first

kennytm · 2026-04-15T15:35:21Z

+	collate := table[tableInfoTupleIndexCollate].(string)
+	if collate != "" {
+		cs, _, hasSuffix := strings.Cut(collate, "_")
 		ctx.WriteKeyWord(" CHARSET ")
-		ctx.WriteKeyWord(charset)
-	}
-	if collate := table[tableInfoTupleIndexCollate].(string); collate != "" {
-		ctx.WriteKeyWord(" COLLATE ")
-		ctx.WriteKeyWord(collate)
+		ctx.WriteKeyWord(cs)
+		if hasSuffix {
+			ctx.WriteKeyWord(" COLLATE ")
+			ctx.WriteKeyWord(collate)
+		}


Signed-off-by: lance6716 <lance6716@gmail.com>

coderabbitai

♻️ Duplicate comments (1)

pkg/util/schemacmp/charset_collation.go (1)

63-76: ⚠️ Potential issue | 🟠 Major

Return the least upper bound for utf8/latin1 joins.

Both Join() implementations still treat utf8 vs latin1 as a hard incompatibility because they only forward to Compare(). That means Charset("utf8").Join(Charset("latin1")) and Collation("utf8_bin").Join(Collation("latin1_bin")) still fail even though the lattice defined above them has a valid LUB: utf8mb4 / utf8mb4_bin. In practice, table joins remain order-dependent for the mixed-shard upgrade path this PR is meant to unblock.

💡 Proposed fix

 func (a charsetLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(charsetLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
-	}
-	if cmp >= 0 {
+	switch {
+	case a.value == b.value:
 		return a, nil
+	case a.value == tidbcharset.CharsetUTF8MB4 &&
+		(b.value == tidbcharset.CharsetUTF8 || b.value == tidbcharset.CharsetLatin1):
+		return a, nil
+	case b.value == tidbcharset.CharsetUTF8MB4 &&
+		(a.value == tidbcharset.CharsetUTF8 || a.value == tidbcharset.CharsetLatin1):
+		return b, nil
+	case (a.value == tidbcharset.CharsetUTF8 && b.value == tidbcharset.CharsetLatin1) ||
+		(a.value == tidbcharset.CharsetLatin1 && b.value == tidbcharset.CharsetUTF8):
+		return Charset(tidbcharset.CharsetUTF8MB4), nil
+	default:
+		return nil, incompatibleCharsetError(a.value, b.value)
 	}
-	return b, nil
 }
 
 func (a collationLattice) Join(other Lattice) (Lattice, error) {
 	b, ok := other.(collationLattice)
 	if !ok {
 		return nil, typeMismatchError(a, other)
 	}
 
-	cmp, err := a.Compare(b)
-	if err != nil {
-		return nil, err
+	if a.suffix != b.suffix {
+		return nil, incompatibleCollationError(a.unwrapString(), b.unwrapString())
 	}
-	if cmp >= 0 {
-		return a, nil
+
+	joinedCharset, err := a.charset.Join(b.charset)
+	if err != nil {
+		return nil, err
 	}
-	return b, nil
+	return collationLattice{
+		charset: joinedCharset.(charsetLattice),
+		suffix:  a.suffix,
+	}, nil
 }

Please also add a direct regression case for utf8↔latin1 and utf8_bin↔latin1_bin.

Also applies to: 128-141

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation.go` around lines 63 - 76, The Join
implementations for charsetLattice (and the analogous collation lattice)
currently treat utf8 vs latin1 as incompatible because they simply forward to
Compare; change Join in charsetLattice (and the corresponding collation join) to
detect the utf8↔latin1 pair and return the least upper bound (utf8mb4 for
charset, utf8mb4_bin for collation) instead of erroring: after type-asserting
other to charsetLattice and before returning errors from Compare, add logic to
recognize when one lattice value is Charset("utf8") and the other is
Charset("latin1") (and similarly for Collation("utf8_bin") vs
Collation("latin1_bin")) and return the appropriate LUB instance; update the
Compare-based fallback to remain for other cases. Also add regression tests that
call Charset("utf8").Join(Charset("latin1")) and
Collation("utf8_bin").Join(Collation("latin1_bin")) and assert the returned LUB
equals utf8mb4 / utf8mb4_bin.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 63-76: The Join implementations for charsetLattice (and the
analogous collation lattice) currently treat utf8 vs latin1 as incompatible
because they simply forward to Compare; change Join in charsetLattice (and the
corresponding collation join) to detect the utf8↔latin1 pair and return the
least upper bound (utf8mb4 for charset, utf8mb4_bin for collation) instead of
erroring: after type-asserting other to charsetLattice and before returning
errors from Compare, add logic to recognize when one lattice value is
Charset("utf8") and the other is Charset("latin1") (and similarly for
Collation("utf8_bin") vs Collation("latin1_bin")) and return the appropriate LUB
instance; update the Compare-based fallback to remain for other cases. Also add
regression tests that call Charset("utf8").Join(Charset("latin1")) and
Collation("utf8_bin").Join(Collation("latin1_bin")) and assert the returned LUB
equals utf8mb4 / utf8mb4_bin.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a932c3a1-e67f-455d-a25c-102070788827

📥 Commits

Reviewing files that changed from the base of the PR and between 650f8b3 and 7f484b0.

📒 Files selected for processing (7)

pkg/util/schemacmp/BUILD.bazel
pkg/util/schemacmp/charset_collation.go
pkg/util/schemacmp/charset_collation_test.go
pkg/util/schemacmp/lattice.go
pkg/util/schemacmp/table.go
pkg/util/schemacmp/table_test.go
pkg/util/schemacmp/type_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/util/schemacmp/type_test.go
pkg/util/schemacmp/BUILD.bazel

coderabbitai

🧹 Nitpick comments (1)

pkg/util/schemacmp/charset_collation_test.go (1)

108-116: Add one negative Join case for suffix mismatch.

TestCollationJoin currently covers only successful joins. Since the contract requires matching suffixes, adding one failing-join assertion will harden regression coverage for the join path itself (not just compare).

Suggested test addition

 func TestCollationJoin(t *testing.T) {
 	join, err := Collation("utf8_bin").Join(Collation("latin1_bin"))
 	require.NoError(t, err)
 	require.Equal(t, "utf8mb4_bin", join.Unwrap())

 	join, err = Collation("latin1_bin").Join(Collation("utf8_bin"))
 	require.NoError(t, err)
 	require.Equal(t, "utf8mb4_bin", join.Unwrap())
+
+	_, err = Collation("utf8mb4_general_ci").Join(Collation("utf8mb4_0900_ai_ci"))
+	require.ErrorContains(t, err, "incompatible collation (utf8mb4_general_ci vs utf8mb4_0900_ai_ci)")
 }

Based on learnings: **/*.go : MUST add a regression test and verify it fails before fix and passes after fix for bug fixes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation_test.go` around lines 108 - 116, Add a
negative test case to TestCollationJoin to assert that Collation.Join returns an
error when suffixes mismatch: call
Collation("utf8_general_ci").Join(Collation("utf8_bin")) (or another pair with
differing suffixes), require an error (require.Error/require.NotNil) and
optionally assert the error message contains the suffix-mismatch hint; this
targets the Collation.Join behavior (and the TestCollationJoin test) so the test
fails before the fix and passes after.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/schemacmp/charset_collation_test.go`:
- Around line 108-116: Add a negative test case to TestCollationJoin to assert
that Collation.Join returns an error when suffixes mismatch: call
Collation("utf8_general_ci").Join(Collation("utf8_bin")) (or another pair with
differing suffixes), require an error (require.Error/require.NotNil) and
optionally assert the error message contains the suffix-mismatch hint; this
targets the Collation.Join behavior (and the TestCollationJoin test) so the test
fails before the fix and passes after.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 96243436-ea7b-488c-98bd-1ec62c37aa98

📥 Commits

Reviewing files that changed from the base of the PR and between 7f484b0 and 67b337b.

📒 Files selected for processing (3)

pkg/util/schemacmp/charset_collation.go
pkg/util/schemacmp/charset_collation_test.go
pkg/util/schemacmp/type_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/util/schemacmp/type_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/util/schemacmp/charset_collation.go

Signed-off-by: lance6716 <lance6716@gmail.com>

coderabbitai

🧹 Nitpick comments (1)

pkg/util/schemacmp/charset_collation.go (1)
155-158: Return joinErr directly in the collation join failure path.

Line 157 currently returns the earlier err from a.Compare(b). Returning joinErr is clearer and future-proof if charset join starts producing richer errors.
♻️ Proposed refactor
 	joinCharset, joinErr := a.charset.Join(b.charset)
 	if joinErr != nil {
-		return nil, err
+		return nil, joinErr
 	}
As per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/util/schemacmp/charset_collation.go` around lines 155 - 158, The
collation join error path returns the wrong variable; when calling
a.charset.Join(b.charset) and assigning joinCharset, joinErr, replace the
erroneous return of the earlier err (from a.Compare(b)) with returning joinErr
so the actual charset join error is propagated; update the failure branch in the
function that calls a.Compare(b) and a.charset.Join(b.charset) to return joinErr
instead of err.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/util/schemacmp/charset_collation.go`:
- Around line 155-158: The collation join error path returns the wrong variable;
when calling a.charset.Join(b.charset) and assigning joinCharset, joinErr,
replace the erroneous return of the earlier err (from a.Compare(b)) with
returning joinErr so the actual charset join error is propagated; update the
failure branch in the function that calls a.Compare(b) and
a.charset.Join(b.charset) to return joinErr instead of err.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e1a27928-23b5-422c-b143-82dc039ca920

📥 Commits

Reviewing files that changed from the base of the PR and between 67b337b and c3bffac.

📒 Files selected for processing (3)

pkg/util/schemacmp/charset_collation.go
pkg/util/schemacmp/charset_collation_test.go
pkg/util/schemacmp/table.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/util/schemacmp/table.go
pkg/util/schemacmp/charset_collation_test.go

lance6716 · 2026-04-16T14:15:20Z

I have reviewed AI's code. PTAL @kennytm @GMHDBJD

pkg/util/schemacmp: allow utf8/latin1 < utf8mb4 charset/collation join

0162d39

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2026

ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026

lance6716 mentioned this pull request Apr 15, 2026

schemacmp: allow latin1/utf8 < utf8mb4 charset/collation join pingcap/tidb-tools#890

Closed

4 tasks

pkg/util/schemacmp: use ErrorContains in charset/collation tests

3a97fea

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Apr 15, 2026

lance6716 requested review from GMHDBJD and kennytm April 15, 2026 06:33

pkg/util/schemacmp: reorder collationLattice fields

350bc04

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/util/schemacmp/charset_collation.go Outdated

kennytm reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/util/schemacmp/charset_collation.go

Comment thread pkg/util/schemacmp/charset_collation.go Outdated

Comment thread pkg/util/schemacmp/table.go

Comment thread pkg/util/schemacmp/type.go

address comment

650f8b3

Signed-off-by: lance6716 <lance6716@gmail.com>

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

kennytm reviewed Apr 15, 2026

View reviewed changes

lance6716 added 3 commits April 16, 2026 16:21

address comment

393cea7

Signed-off-by: lance6716 <lance6716@gmail.com>

remove a unused branch

d505f24

Signed-off-by: lance6716 <lance6716@gmail.com>

pkg/util/schemacmp: fix bazel deps and nogo lint

7f484b0

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

pkg/util/schemacmp: handle join for incomparable pairs

67b337b

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

lance6716 added 2 commits April 16, 2026 21:44

refine code

cd24200

Signed-off-by: lance6716 <lance6716@gmail.com>

address comment

c3bffac

Signed-off-by: lance6716 <lance6716@gmail.com>

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Conversation

lance6716 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot bot commented Apr 15, 2026

Uh oh!

tiprow bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

lance6716 commented Apr 15, 2026

Uh oh!

kennytm commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lance6716 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

lance6716 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kennytm Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

lance6716 Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kennytm Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

lance6716 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

pantheon-ai bot commented Apr 15, 2026 •

edited

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading

coderabbitai bot Apr 15, 2026 •

edited

Loading

lance6716 Apr 16, 2026 •

edited

Loading