Skip to content

MDEV-39318 MTR: fix test failures when running with --ssl#4929

Open
FaramosCZ wants to merge 4 commits intoMariaDB:10.11from
FaramosCZ:MDEV-39318
Open

MDEV-39318 MTR: fix test failures when running with --ssl#4929
FaramosCZ wants to merge 4 commits intoMariaDB:10.11from
FaramosCZ:MDEV-39318

Conversation

@FaramosCZ
Copy link
Copy Markdown
Contributor

@FaramosCZ FaramosCZ commented Apr 13, 2026

When MTR runs with --ssl (as Fedora's %check does), all client connections use TLS. This causes four categories of test failure. All fixes are test-side, backward-compatible, and produce identical results without --ssl.

Upstream CI doesn't use --ssl, so these failures go unnoticed.

1. TLS error codes replace MySQL protocol error codes

The connector's TLS layer intercepts server-side errors before the MySQL protocol can report them. For example, error 2026 (CR_SSL_CONNECTION_ERROR) replaces 2013 (CR_SERVER_LOST) on killed/crashed connections, and error 2002 (CR_CONNECTION_ERROR) replaces 1040 (ER_CON_COUNT_ERROR) when max_connections is exceeded.

This was partially addressed in two earlier commits (CONC-603 in 2022, MDEV-30452 in 2023), but only for tests actively failing upstream at the time. Thirteen tests remained unfixed.

This is arguably a connector bug — the connector should propagate the server's error code through the TLS layer rather than replacing it with a generic TLS error. The test-side fix (adding the TLS error codes to --error directives) is needed until the connector is fixed, and becomes a harmless no-op afterward.

2. Value differences in test output

connection_type in performance_schema.threads shows SSL/TLS instead of Socket. The server_audit plugin logs the TLS version (e.g. TLSv1.3) in CONNECT/DISCONNECT events.

Fix: --replace_result and --replace_regex to normalize the output. Without --ssl the replacements are no-ops.

3. Missing PFS socket instrumentation

Under TLS, data flows through SSL_read/SSL_write instead of direct recv/send syscalls, completely bypassing the performance_schema socket I/O instrumentation hooks. Tests that verify socket wait events or byte counters get zero values.

Fix: detect active TLS session via Ssl_cipher status variable and skip the affected tests. The existing not_ssl.inc checks @@have_ssl (server capability), which is always YES on OpenSSL builds and would skip too aggressively.

4. Ssl_cipher_list truncation

Ssl_cipher_list contains the full list of supported ciphers (~2047 chars on Fedora's OpenSSL 3.x), exceeding the VARCHAR(1024) VARIABLE_VALUE column in PFS and information_schema tables. This causes ER_DATA_TOO_LONG in multi-table UPDATEs with STRICT_TRANS_TABLES, and Warning 1265 in information_schema SELECTs.

Fix: temporarily relax sql_mode for PFS-joining UPDATEs; wrap information_schema SELECTs with --disable_warnings.


Each test was verified in the upstream BuildBot Fedora 42 container (both RelWithDebInfo and Debug builds):

  • without --ssl, before patch: pass (baseline)
  • with --ssl, before patch: fail
  • without --ssl, after patch: pass (no regression)
  • with --ssl, after patch: pass (fix works)

When MTR's '--ssl' flag is active, all client connections use
TLS. When a connection is killed or rejected, the TLS layer
intercepts the error before the MySQL protocol can report it:

- Error 2026 (CR_SSL_CONNECTION_ERROR) replaces error 2013
  (CR_SERVER_LOST) or 0 on killed connections.
- Error 2002 (CR_CONNECTION_ERROR) replaces error 1040
  (ER_CON_COUNT_ERROR) when max_connections is exceeded.

Upstream CI does not use '--ssl', so these go unnoticed. Distros
like Fedora run the test suite with '--ssl' enabled, causing
spurious failures.

Add the TLS error codes to each affected test's '--error'
directive. This is backward-compatible (tests still pass without
'--ssl') and follows the pattern already used elsewhere in the
tree.

The pattern was introduced in two waves, both incomplete:

- f9315b3 "CC 3.1 update" (CONC-603): Connector/C 3.1 fixed
  TLS read/write error handling so the client no longer always
  returns 2013 on connection loss -- it may now correctly return
  2026 or 5014. Updated ssl_timeout.test,
  wait_until_disconnected, kill.test, and openssl tests.

- e3e7264 "MDEV-30452": OpenSSL 3.x and recent GnuTLS changed
  error indication when the peer omits close_notify. Updated
  several galera, innodb, and sys_vars tests.

Both waves only fixed tests that were actively failing in their
CI at the time, leaving the remaining tests unfixed.

main.kill was also evaluated but excluded: it already tolerates
error 5014 (write error) which covers the TLS case, so it
passes with --ssl without changes.

Each test was verified in the upstream BuildBot Fedora 42
container (both RelWithDebInfo and Debug builds):

  - without --ssl, before patch: pass (baseline)
  - with --ssl, before patch: fail (got 2026)
  - without --ssl, after patch: pass (no regression)
  - with --ssl, after patch: pass (fix works)

Co-Authored-By: Claude AI <noreply@anthropic.com>
@FaramosCZ FaramosCZ marked this pull request as ready for review April 13, 2026 19:06
FaramosCZ and others added 3 commits April 14, 2026 00:02
When MTR's '--ssl' flag is active, connection_type in
performance_schema.threads shows 'SSL/TLS' instead of
'Socket', and server_audit logs include the TLS version
(e.g. 'TLSv1.3') in CONNECT/DISCONNECT events.

Fix with --replace_result and --replace_regex to normalize
the output.  Without '--ssl' the replacements are no-ops.

Same patterns already in the tree:
- --replace_result for TLS version normalization:
  main/chained_ssl_certificates.test:42,55
  main/ssl_crl.test:6
- The server_audit --replace_regex already normalizes
  timestamps and session IDs; the TLS version pattern
  is appended to the same directive.

Co-Authored-By: Claude AI <noreply@anthropic.com>
Under TLS, data flows through SSL_read/SSL_write instead
of direct recv/send syscalls, completely bypassing the
performance_schema socket I/O instrumentation hooks.
Socket wait events are never recorded.

Skip the tests when the session is actually encrypted,
detected via a non-empty Ssl_cipher status variable.
The existing not_ssl.inc checks @@have_ssl (server
capability), which is always 'YES' on OpenSSL builds
and would skip too aggressively.

Same Ssl_cipher session-level check already in the tree:
- main/ssl_timeout.test:11
- main/userstat.test:41

Co-Authored-By: Claude AI <noreply@anthropic.com>
Under TLS, Ssl_cipher_list contains the full list of
supported ciphers (~2047 chars on Fedora OpenSSL 3.x),
exceeding the VARCHAR(1024) VARIABLE_VALUE column in
PFS and information_schema tables.

In show_aggregate.inc, multi-table UPDATEs joining PFS
status tables fail with ER_DATA_TOO_LONG when
STRICT_TRANS_TABLES is active.  Fix: temporarily relax
sql_mode for PFS-joining UPDATEs.

In feedback_plugin_load.test, every SELECT from
information_schema.feedback triggers Warning 1265.
Fix: wrap SELECTs with --disable_warnings.

--disable_warnings/--enable_warnings for known benign
warnings is the standard MTR idiom, used extensively
throughout the test suite.

Co-Authored-By: Claude AI <noreply@anthropic.com>
@FaramosCZ FaramosCZ changed the title MDEV-39318 MTR: accept TLS errors with --ssl MDEV-39318 MTR: fix test failures when running with --ssl Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant