[SPARK-56000][BUILD] Upgrade `arrow-java` to 19.0.0 by LuciferYang · Pull Request #54820 · apache/spark

LuciferYang · 2026-03-16T08:08:34Z

What changes were proposed in this pull request?

This pr aims to upgrade arrow-java from 18.3.0 to 19.0.0.

It also fixes a buffer leak in SparkResult.processResponses() that only manifests after this upgrade and has no actual impact under Arrow 18.3.0. The issue is that when a deserialized Arrow batch contains 0 rows, the ArrowMessage objects were silently dropped without calling close(), and were not stored in resultMap (so SparkResultCloseable.close() would not release them either). Under Arrow 18.3.0 this was completely harmless — empty batches produced a 0-byte IPC body, which goes through BaseAllocator.buffer(0) → getEmpty() (a singleton backed by EmptyReferenceManager whose retain()/release() are no-ops and not tracked by the allocator), so no off-heap memory was ever allocated or leaked. However, Arrow 19.0.0 includes GH-343, which correctly serializes offset buffers for empty vectors per the Arrow spec, making the IPC body non-zero. This causes real tracked off-heap buffers to be allocated, and the missing close() becomes a real memory leak detectable by allocator.close(). Therefore this fix is included as a necessary companion change for the 19.0.0 upgrade.

Why are the changes needed?

The full release note as follows:

https://github.com/apache/arrow-java/releases/tag/v19.0.0

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Acitons

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2026-03-16T11:08:01Z

There are test failures that require investigation.

Yicong-Huang · 2026-03-16T18:32:35Z

Thanks @LuciferYang for upgrading this. Pyspark side has a few efforts waiting for fixes in arrow-java 19.0.0. Do we think this can be back ported to 4.0 and 4.1 branches?

cc @HyukjinKwon @dbtsai as well

dongjoon-hyun · 2026-03-17T03:48:17Z

Pyspark side has a few efforts waiting for fixes in arrow-java 19.0.0. Do we think this can be back ported to 4.0 and 4.1 branches?

cc @HyukjinKwon @dbtsai as well

No, @Yicong-Huang . The Apache Spark community doesn't allow those kind of backporting. Instead, you need to ask Apache Arrow community to deliver the maintenance releases. For example, in your request case,

arrow-java 18.3.1 for branch-4.1
arrow-java 18.1.x for branch-4.0

We may consume those bug-fixed maintenance releases.

Yicong-Huang · 2026-03-17T03:50:34Z

Got it. Thanks @dongjoon-hyun for the explanation. I will talk to arrow-java community on corresponding maintenance releases.

dongjoon-hyun · 2026-03-17T03:55:19Z

Thank you, @Yicong-Huang .

LuciferYang · 2026-03-17T03:56:40Z

sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala

            throw new IllegalStateException(
              s"Expected $expectedNumRows rows in arrow batch but got $numRecordsInBatch.")
          }
+          val messagesInBatch = messages.result()


After making the modification here, all tests have passed. However, I haven't yet examined the specific change details in version 19.0.0 that necessitated this modification, so I'm temporarily unable to confirm whether it is truly issue-free for version 18.3.0 or if problems simply haven't been uncovered yet. Let's hold off for a while. If this turns out to be a lingering issue, I will submit a pr to fix it first.

Personally, I'm more inclined to think that this is a lingering issue.

1. Problem Statement

After upgrading arrow-java from 18.3.0 to 19.0.0, Spark Connect client tests(e.g. CatalogSuite, DataFrameTableValuedFunctionsSuite) fail in afterAll() when allocator.close() is called:

java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding buffers allocated (12).

The stack trace points to:

SparkResult.processResponses() → MessageIterator.next() → MessageSerializer.deserializeRecordBatch() → ArrowBuf.slice()

2. Root Cause

2.1 Pre-existing bug in SparkResult.processResponses()

In SparkResult.processResponses(), when a deserialized Arrow batch contains 0 rows (numRecordsInBatch == 0), the ArrowMessage objects are neither stored in resultMap nor closed:

// Before fix if (numRecordsInBatch > 0) { numRecords += numRecordsInBatch resultMap.put(nextResultIndex, (reader.bytesRead, messages.result())) nextResultIndex += 1 // ... } // When numRecordsInBatch == 0: messages.result() is silently dropped — no close()

SparkResultCloseable.close() only releases messages stored in resultMap. Empty-batch messages fall through and their underlying Arrow buffers are never released.

2.2 Arrow GH-343 made it observable

Arrow-Java GH-343 fixed offset buffer IPC serialization for empty vectors (valueCount == 0). This fix, included in v19.0.0, changed the IPC body size of empty batches from 0 bytes to a non-zero value, which turned the previously-silent Spark bug into a visible allocator failure.

The relevant commits between v18.3.0 and v19.0.0:

Commit Scope

0f8a0808f (PR #967) Fix ListVector / LargeListVector offset buffer when valueCount == 0

77df3ecb2 (PR #989) Fix BaseVariableWidthVector / BaseLargeVariableWidthVector offset buffer when valueCount == 0

What changed: when valueCount == 0, setReaderAndWriterIndex() previously set offsetBuffer.writerIndex(0), making readableBytes() == 0 and writing 0 bytes to the IPC stream. The Arrow spec requires that offset buffers always contain at least one entry [0], so GH-343 changed this to offsetBuffer.writerIndex(OFFSET_WIDTH), making readableBytes() == 4.

Version setReaderAndWriterIndex() when valueCount == 0 IPC body size

v18.3.0 offsetBuffer.writerIndex(0) → readableBytes() = 0 0 bytes

v19.0.0 offsetBuffer.writerIndex(OFFSET_WIDTH = 4) → readableBytes() = 4 > 0 bytes

3. Detailed Causal Chain

Step v18.3.0 v19.0.0

1. Server serializes empty batch (valueCount=0) Offset buffer writes 0 bytes → IPC body = 0 bytes Offset buffer writes 4+ bytes → IPC body > 0 bytes

2. Client calls readMessageBody(in, bodyLength, allocator) allocator.buffer(0) → returns singleton getEmpty(), backed by EmptyReferenceManager (not tracked by allocator) allocator.buffer(bodyLength > 0) → allocates real ArrowBuf (tracked by allocator)

3. deserializeRecordBatch calls body.slice() per field buffer Slices share EmptyReferenceManager; retain()/release() are no-ops Slices share real BufferLedger; retain() increments refcount

4. ArrowRecordBatch constructor calls retain() per slice No-op Refcount increases

5. body.getReferenceManager().release() No-op Refcount decreases by 1, but slices still hold references

6. ArrowRecordBatch.close() never called (Spark bug) No impact — empty buffers are untracked Buffer leak — refcount > 0, tracked buffers remain

7. allocator.close() Succeeds — no outstanding tracked buffers Throws IllegalStateException

Key mechanism: BaseAllocator.buffer(0) returns untracked empty buffer

// BaseAllocator.java public ArrowBuf buffer(final long initialRequestSize, BufferManager manager) { if (initialRequestSize == 0) { return getEmpty(); // singleton, EmptyReferenceManager — not tracked } // ... allocate real buffer — tracked by allocator }

In v18.3.0, empty-batch IPC body is 0 bytes → allocator.buffer(0) → getEmpty() (untracked). All downstream slice(), retain(), and release() calls are no-ops. The missing close() in SparkResult is harmless.

In v19.0.0, empty-batch IPC body is > 0 bytes → allocator.buffer(n) → real tracked buffer. The missing close() becomes a real off-heap memory leak.

4. Does v18.3.0 have an actual memory leak?

No. Under v18.3.0, empty batches cause no memory leak at all:

Off-heap memory: Zero leak. allocator.buffer(0) returns the pre-allocated singleton empty buffer. No additional off-heap memory is allocated for empty batches, so there is nothing to leak.

Java heap objects: The orphaned ArrowRecordBatch / ArrowBuf wrapper objects hold no strong references after processResponses() returns, and are collected by normal GC.

Allocator tracking: EmptyReferenceManager is a no-op singleton. The allocator never registers these buffers, so allocator.close() sees no outstanding allocations.

The bug in SparkResult is logically present in v18.3.0, but it is structurally impossible to cause any resource leak because the entire empty buffer path — from allocation through slicing to reference counting — operates on untracked no-op objects.

Under v19.0.0 without the fix, the situation is different:

allocator.buffer(bodyLength > 0) allocates real off-heap memory.

ArrowRecordBatch is never close()-d, so the BufferLedger refcount never reaches 0.

ArrowBuf has no finalizer or Cleaner, so GC of the Java wrapper does not decrement the off-heap refcount.

The off-heap memory is permanently leaked until allocator.close() detects and reports it.

5. The Fix

When numRecordsInBatch == 0, the deserialized ArrowMessage objects are explicitly closed. This calls ArrowRecordBatch.close(), which invokes release() on each sliced buffer, allowing the BufferLedger refcount to reach 0 and the off-heap memory to be freed.

After conducting research, I believe this issue will not have a material impact on version 18.3.0, so I personally prefer to address it with a fix in the current pr.

dongjoon-hyun

Thank you for sharing the impressive research result, @LuciferYang .

+1, LGTM.

dongjoon-hyun · 2026-03-17T08:11:27Z

Feel free to merge this.

Just FYI, the master branch CI is currently broken and will be fixed by the following.

[SPARK-56025][INFRA] Install remotes R package for spark-test-image/(lint|docs|sparkr)/Dockerfile #54853

LuciferYang · 2026-03-17T11:44:36Z

Merged into master. Thanks @dongjoon-hyun and @Yicong-Huang

… nested array with empty outer array ### What changes were proposed in this pull request? Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV. ### Why are the changes needed? SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0). This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). These tests confirm the fix works correctly without any Spark-side workaround. ### Does this PR introduce _any_ user-facing change? No (test only). ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54880 from Yicong-Huang/SPARK-55056-test. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… nested array with empty outer array ### What changes were proposed in this pull request? Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV. ### Why are the changes needed? SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0). This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). These tests confirm the fix works correctly without any Spark-side workaround. ### Does this PR introduce _any_ user-facing change? No (test only). ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54880 from Yicong-Huang/SPARK-55056-test. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

init

04cf024

LuciferYang marked this pull request as draft March 16, 2026 11:07

test

d9fde64

This was referenced Mar 16, 2026

[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays #53822

Closed

[SPARK-55059][PYTHON] Remove empty table workaround in toPandas #53824

Closed

Merge branch 'apache:master' into SPARK-56000

1c8865f

LuciferYang commented Mar 17, 2026

View reviewed changes

LuciferYang marked this pull request as ready for review March 17, 2026 07:55

dongjoon-hyun approved these changes Mar 17, 2026

View reviewed changes

LuciferYang closed this in e3c947d Mar 17, 2026

Yicong-Huang mentioned this pull request Mar 18, 2026

[SPARK-55056][SQL][PYTHON][TEST] Add tests using Arrow to deserialize nested array with empty outer array #54880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56000][BUILD] Upgrade `arrow-java` to 19.0.0#54820

[SPARK-56000][BUILD] Upgrade `arrow-java` to 19.0.0#54820
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56000

LuciferYang commented Mar 16, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Mar 16, 2026

Uh oh!

Yicong-Huang commented Mar 16, 2026

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

Yicong-Huang commented Mar 17, 2026

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

LuciferYang Mar 17, 2026

Uh oh!

LuciferYang Mar 17, 2026 •

edited

Loading

Uh oh!

LuciferYang Mar 17, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

LuciferYang commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Commit	Scope
`0f8a0808f` (PR #967)	Fix `ListVector` / `LargeListVector` offset buffer when `valueCount == 0`
`77df3ecb2` (PR #989)	Fix `BaseVariableWidthVector` / `BaseLargeVariableWidthVector` offset buffer when `valueCount == 0`

Version	`setReaderAndWriterIndex()` when `valueCount == 0`	IPC body size
v18.3.0	`offsetBuffer.writerIndex(0)` → `readableBytes() = 0`	0 bytes
v19.0.0	`offsetBuffer.writerIndex(OFFSET_WIDTH = 4)` → `readableBytes() = 4`	> 0 bytes

Step	v18.3.0	v19.0.0
1. Server serializes empty batch (`valueCount=0`)	Offset buffer writes 0 bytes → IPC body = 0 bytes	Offset buffer writes 4+ bytes → IPC body > 0 bytes
2. Client calls `readMessageBody(in, bodyLength, allocator)`	`allocator.buffer(0)` → returns singleton `getEmpty()`, backed by `EmptyReferenceManager` (not tracked by allocator)	`allocator.buffer(bodyLength > 0)` → allocates real `ArrowBuf` (tracked by allocator)
3. `deserializeRecordBatch` calls `body.slice()` per field buffer	Slices share `EmptyReferenceManager`; `retain()`/`release()` are no-ops	Slices share real `BufferLedger`; `retain()` increments refcount
4. `ArrowRecordBatch` constructor calls `retain()` per slice	No-op	Refcount increases
5. `body.getReferenceManager().release()`	No-op	Refcount decreases by 1, but slices still hold references
6. `ArrowRecordBatch.close()` never called (Spark bug)	No impact — empty buffers are untracked	Buffer leak — refcount > 0, tracked buffers remain
7. `allocator.close()`	Succeeds — no outstanding tracked buffers	Throws `IllegalStateException`

Conversation

LuciferYang commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Mar 16, 2026

Uh oh!

Yicong-Huang commented Mar 16, 2026

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

Yicong-Huang commented Mar 17, 2026

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

LuciferYang Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Problem Statement

2. Root Cause

2.1 Pre-existing bug in SparkResult.processResponses()

2.2 Arrow GH-343 made it observable

3. Detailed Causal Chain

Key mechanism: BaseAllocator.buffer(0) returns untracked empty buffer

4. Does v18.3.0 have an actual memory leak?

5. The Fix

Uh oh!

LuciferYang Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

LuciferYang commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented Mar 16, 2026 •

edited

Loading

LuciferYang Mar 17, 2026 •

edited

Loading

2.1 Pre-existing bug in `SparkResult.processResponses()`

Key mechanism: `BaseAllocator.buffer(0)` returns untracked empty buffer