[SPARK-56000][BUILD] Upgrade arrow-java to 19.0.0#54820
[SPARK-56000][BUILD] Upgrade arrow-java to 19.0.0#54820LuciferYang wants to merge 3 commits intoapache:masterfrom
arrow-java to 19.0.0#54820Conversation
|
There are test failures that require investigation. |
|
Thanks @LuciferYang for upgrading this. Pyspark side has a few efforts waiting for fixes in arrow-java 19.0.0. Do we think this can be back ported to 4.0 and 4.1 branches? cc @HyukjinKwon @dbtsai as well |
No, @Yicong-Huang . The Apache Spark community doesn't allow those kind of backporting. Instead, you need to ask Apache Arrow community to deliver the maintenance releases. For example, in your request case,
We may consume those bug-fixed maintenance releases. |
|
Got it. Thanks @dongjoon-hyun for the explanation. I will talk to arrow-java community on corresponding maintenance releases. |
|
Thank you, @Yicong-Huang . |
| throw new IllegalStateException( | ||
| s"Expected $expectedNumRows rows in arrow batch but got $numRecordsInBatch.") | ||
| } | ||
| val messagesInBatch = messages.result() |
There was a problem hiding this comment.
After making the modification here, all tests have passed. However, I haven't yet examined the specific change details in version 19.0.0 that necessitated this modification, so I'm temporarily unable to confirm whether it is truly issue-free for version 18.3.0 or if problems simply haven't been uncovered yet. Let's hold off for a while. If this turns out to be a lingering issue, I will submit a pr to fix it first.
Personally, I'm more inclined to think that this is a lingering issue.
There was a problem hiding this comment.
1. Problem Statement
After upgrading arrow-java from 18.3.0 to 19.0.0, Spark Connect client tests(e.g. CatalogSuite, DataFrameTableValuedFunctionsSuite) fail in afterAll() when allocator.close() is called:
java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding buffers allocated (12).
The stack trace points to:
SparkResult.processResponses()
→ MessageIterator.next()
→ MessageSerializer.deserializeRecordBatch()
→ ArrowBuf.slice()
2. Root Cause
2.1 Pre-existing bug in SparkResult.processResponses()
In SparkResult.processResponses(), when a deserialized Arrow batch contains 0 rows (numRecordsInBatch == 0), the ArrowMessage objects are neither stored in resultMap nor closed:
// Before fix
if (numRecordsInBatch > 0) {
numRecords += numRecordsInBatch
resultMap.put(nextResultIndex, (reader.bytesRead, messages.result()))
nextResultIndex += 1
// ...
}
// When numRecordsInBatch == 0: messages.result() is silently dropped — no close()SparkResultCloseable.close() only releases messages stored in resultMap. Empty-batch messages fall through and their underlying Arrow buffers are never released.
2.2 Arrow GH-343 made it observable
Arrow-Java GH-343 fixed offset buffer IPC serialization for empty vectors (valueCount == 0). This fix, included in v19.0.0, changed the IPC body size of empty batches from 0 bytes to a non-zero value, which turned the previously-silent Spark bug into a visible allocator failure.
The relevant commits between v18.3.0 and v19.0.0:
| Commit | Scope |
|---|---|
0f8a0808f (PR #967) |
Fix ListVector / LargeListVector offset buffer when valueCount == 0 |
77df3ecb2 (PR #989) |
Fix BaseVariableWidthVector / BaseLargeVariableWidthVector offset buffer when valueCount == 0 |
What changed: when valueCount == 0, setReaderAndWriterIndex() previously set offsetBuffer.writerIndex(0), making readableBytes() == 0 and writing 0 bytes to the IPC stream. The Arrow spec requires that offset buffers always contain at least one entry [0], so GH-343 changed this to offsetBuffer.writerIndex(OFFSET_WIDTH), making readableBytes() == 4.
| Version | setReaderAndWriterIndex() when valueCount == 0 |
IPC body size |
|---|---|---|
| v18.3.0 | offsetBuffer.writerIndex(0) → readableBytes() = 0 |
0 bytes |
| v19.0.0 | offsetBuffer.writerIndex(OFFSET_WIDTH = 4) → readableBytes() = 4 |
> 0 bytes |
3. Detailed Causal Chain
| Step | v18.3.0 | v19.0.0 |
|---|---|---|
1. Server serializes empty batch (valueCount=0) |
Offset buffer writes 0 bytes → IPC body = 0 bytes | Offset buffer writes 4+ bytes → IPC body > 0 bytes |
2. Client calls readMessageBody(in, bodyLength, allocator) |
allocator.buffer(0) → returns singleton getEmpty(), backed by EmptyReferenceManager (not tracked by allocator) |
allocator.buffer(bodyLength > 0) → allocates real ArrowBuf (tracked by allocator) |
3. deserializeRecordBatch calls body.slice() per field buffer |
Slices share EmptyReferenceManager; retain()/release() are no-ops |
Slices share real BufferLedger; retain() increments refcount |
4. ArrowRecordBatch constructor calls retain() per slice |
No-op | Refcount increases |
5. body.getReferenceManager().release() |
No-op | Refcount decreases by 1, but slices still hold references |
6. ArrowRecordBatch.close() never called (Spark bug) |
No impact — empty buffers are untracked | Buffer leak — refcount > 0, tracked buffers remain |
7. allocator.close() |
Succeeds — no outstanding tracked buffers | Throws IllegalStateException |
Key mechanism: BaseAllocator.buffer(0) returns untracked empty buffer
// BaseAllocator.java
public ArrowBuf buffer(final long initialRequestSize, BufferManager manager) {
if (initialRequestSize == 0) {
return getEmpty(); // singleton, EmptyReferenceManager — not tracked
}
// ... allocate real buffer — tracked by allocator
}In v18.3.0, empty-batch IPC body is 0 bytes → allocator.buffer(0) → getEmpty() (untracked). All downstream slice(), retain(), and release() calls are no-ops. The missing close() in SparkResult is harmless.
In v19.0.0, empty-batch IPC body is > 0 bytes → allocator.buffer(n) → real tracked buffer. The missing close() becomes a real off-heap memory leak.
4. Does v18.3.0 have an actual memory leak?
No. Under v18.3.0, empty batches cause no memory leak at all:
- Off-heap memory: Zero leak.
allocator.buffer(0)returns the pre-allocated singleton empty buffer. No additional off-heap memory is allocated for empty batches, so there is nothing to leak. - Java heap objects: The orphaned
ArrowRecordBatch/ArrowBufwrapper objects hold no strong references afterprocessResponses()returns, and are collected by normal GC. - Allocator tracking:
EmptyReferenceManageris a no-op singleton. The allocator never registers these buffers, soallocator.close()sees no outstanding allocations.
The bug in SparkResult is logically present in v18.3.0, but it is structurally impossible to cause any resource leak because the entire empty buffer path — from allocation through slicing to reference counting — operates on untracked no-op objects.
Under v19.0.0 without the fix, the situation is different:
allocator.buffer(bodyLength > 0)allocates real off-heap memory.ArrowRecordBatchis neverclose()-d, so theBufferLedgerrefcount never reaches 0.ArrowBufhas no finalizer orCleaner, so GC of the Java wrapper does not decrement the off-heap refcount.- The off-heap memory is permanently leaked until
allocator.close()detects and reports it.
5. The Fix
When numRecordsInBatch == 0, the deserialized ArrowMessage objects are explicitly closed. This calls ArrowRecordBatch.close(), which invokes release() on each sliced buffer, allowing the BufferLedger refcount to reach 0 and the off-heap memory to be freed.
There was a problem hiding this comment.
After conducting research, I believe this issue will not have a material impact on version 18.3.0, so I personally prefer to address it with a fix in the current pr.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you for sharing the impressive research result, @LuciferYang .
+1, LGTM.
|
Feel free to merge this. Just FYI, the master branch CI is currently broken and will be fixed by the following. |
|
Merged into master. Thanks @dongjoon-hyun and @Yicong-Huang |
… nested array with empty outer array ### What changes were proposed in this pull request? Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV. ### Why are the changes needed? SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0). This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). These tests confirm the fix works correctly without any Spark-side workaround. ### Does this PR introduce _any_ user-facing change? No (test only). ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54880 from Yicong-Huang/SPARK-55056-test. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… nested array with empty outer array ### What changes were proposed in this pull request? Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV. ### Why are the changes needed? SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0). This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). These tests confirm the fix works correctly without any Spark-side workaround. ### Does this PR introduce _any_ user-facing change? No (test only). ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54880 from Yicong-Huang/SPARK-55056-test. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This pr aims to upgrade
arrow-javafrom 18.3.0 to 19.0.0.It also fixes a buffer leak in
SparkResult.processResponses()that only manifests after this upgrade and has no actual impact under Arrow 18.3.0. The issue is that when a deserialized Arrow batch contains 0 rows, theArrowMessageobjects were silently dropped without callingclose(), and were not stored inresultMap(soSparkResultCloseable.close()would not release them either). Under Arrow 18.3.0 this was completely harmless — empty batches produced a 0-byte IPC body, which goes throughBaseAllocator.buffer(0)→getEmpty()(a singleton backed byEmptyReferenceManagerwhoseretain()/release()are no-ops and not tracked by the allocator), so no off-heap memory was ever allocated or leaked. However, Arrow 19.0.0 includes GH-343, which correctly serializes offset buffers for empty vectors per the Arrow spec, making the IPC body non-zero. This causes real tracked off-heap buffers to be allocated, and the missingclose()becomes a real memory leak detectable byallocator.close(). Therefore this fix is included as a necessary companion change for the 19.0.0 upgrade.Why are the changes needed?
The full release note as follows:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No