Skip to content

[SPARK-56000][BUILD] Upgrade arrow-java to 19.0.0#54820

Closed
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56000
Closed

[SPARK-56000][BUILD] Upgrade arrow-java to 19.0.0#54820
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56000

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Mar 16, 2026

What changes were proposed in this pull request?

This pr aims to upgrade arrow-java from 18.3.0 to 19.0.0.

It also fixes a buffer leak in SparkResult.processResponses() that only manifests after this upgrade and has no actual impact under Arrow 18.3.0. The issue is that when a deserialized Arrow batch contains 0 rows, the ArrowMessage objects were silently dropped without calling close(), and were not stored in resultMap (so SparkResultCloseable.close() would not release them either). Under Arrow 18.3.0 this was completely harmless — empty batches produced a 0-byte IPC body, which goes through BaseAllocator.buffer(0)getEmpty() (a singleton backed by EmptyReferenceManager whose retain()/release() are no-ops and not tracked by the allocator), so no off-heap memory was ever allocated or leaked. However, Arrow 19.0.0 includes GH-343, which correctly serializes offset buffers for empty vectors per the Arrow spec, making the IPC body non-zero. This causes real tracked off-heap buffers to be allocated, and the missing close() becomes a real memory leak detectable by allocator.close(). Therefore this fix is included as a necessary companion change for the 19.0.0 upgrade.

Why are the changes needed?

The full release note as follows:

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass GitHub Acitons

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang LuciferYang marked this pull request as draft March 16, 2026 11:07
@LuciferYang
Copy link
Copy Markdown
Contributor Author

There are test failures that require investigation.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Thanks @LuciferYang for upgrading this. Pyspark side has a few efforts waiting for fixes in arrow-java 19.0.0. Do we think this can be back ported to 4.0 and 4.1 branches?

cc @HyukjinKwon @dbtsai as well

@dongjoon-hyun
Copy link
Copy Markdown
Member

Pyspark side has a few efforts waiting for fixes in arrow-java 19.0.0. Do we think this can be back ported to 4.0 and 4.1 branches?

cc @HyukjinKwon @dbtsai as well

No, @Yicong-Huang . The Apache Spark community doesn't allow those kind of backporting. Instead, you need to ask Apache Arrow community to deliver the maintenance releases. For example, in your request case,

  • arrow-java 18.3.1 for branch-4.1
  • arrow-java 18.1.x for branch-4.0

We may consume those bug-fixed maintenance releases.

@Yicong-Huang
Copy link
Copy Markdown
Contributor

Got it. Thanks @dongjoon-hyun for the explanation. I will talk to arrow-java community on corresponding maintenance releases.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Thank you, @Yicong-Huang .

throw new IllegalStateException(
s"Expected $expectedNumRows rows in arrow batch but got $numRecordsInBatch.")
}
val messagesInBatch = messages.result()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After making the modification here, all tests have passed. However, I haven't yet examined the specific change details in version 19.0.0 that necessitated this modification, so I'm temporarily unable to confirm whether it is truly issue-free for version 18.3.0 or if problems simply haven't been uncovered yet. Let's hold off for a while. If this turns out to be a lingering issue, I will submit a pr to fix it first.

Personally, I'm more inclined to think that this is a lingering issue.

Copy link
Copy Markdown
Contributor Author

@LuciferYang LuciferYang Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Problem Statement

After upgrading arrow-java from 18.3.0 to 19.0.0, Spark Connect client tests(e.g. CatalogSuite, DataFrameTableValuedFunctionsSuite) fail in afterAll() when allocator.close() is called:

java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding buffers allocated (12).

The stack trace points to:

SparkResult.processResponses()
  → MessageIterator.next()
    → MessageSerializer.deserializeRecordBatch()
      → ArrowBuf.slice()

2. Root Cause

2.1 Pre-existing bug in SparkResult.processResponses()

In SparkResult.processResponses(), when a deserialized Arrow batch contains 0 rows (numRecordsInBatch == 0), the ArrowMessage objects are neither stored in resultMap nor closed:

// Before fix
if (numRecordsInBatch > 0) {
  numRecords += numRecordsInBatch
  resultMap.put(nextResultIndex, (reader.bytesRead, messages.result()))
  nextResultIndex += 1
  // ...
}
// When numRecordsInBatch == 0: messages.result() is silently dropped — no close()

SparkResultCloseable.close() only releases messages stored in resultMap. Empty-batch messages fall through and their underlying Arrow buffers are never released.

2.2 Arrow GH-343 made it observable

Arrow-Java GH-343 fixed offset buffer IPC serialization for empty vectors (valueCount == 0). This fix, included in v19.0.0, changed the IPC body size of empty batches from 0 bytes to a non-zero value, which turned the previously-silent Spark bug into a visible allocator failure.

The relevant commits between v18.3.0 and v19.0.0:

Commit Scope
0f8a0808f (PR #967) Fix ListVector / LargeListVector offset buffer when valueCount == 0
77df3ecb2 (PR #989) Fix BaseVariableWidthVector / BaseLargeVariableWidthVector offset buffer when valueCount == 0

What changed: when valueCount == 0, setReaderAndWriterIndex() previously set offsetBuffer.writerIndex(0), making readableBytes() == 0 and writing 0 bytes to the IPC stream. The Arrow spec requires that offset buffers always contain at least one entry [0], so GH-343 changed this to offsetBuffer.writerIndex(OFFSET_WIDTH), making readableBytes() == 4.

Version setReaderAndWriterIndex() when valueCount == 0 IPC body size
v18.3.0 offsetBuffer.writerIndex(0)readableBytes() = 0 0 bytes
v19.0.0 offsetBuffer.writerIndex(OFFSET_WIDTH = 4)readableBytes() = 4 > 0 bytes

3. Detailed Causal Chain

Step v18.3.0 v19.0.0
1. Server serializes empty batch (valueCount=0) Offset buffer writes 0 bytes → IPC body = 0 bytes Offset buffer writes 4+ bytes → IPC body > 0 bytes
2. Client calls readMessageBody(in, bodyLength, allocator) allocator.buffer(0) → returns singleton getEmpty(), backed by EmptyReferenceManager (not tracked by allocator) allocator.buffer(bodyLength > 0) → allocates real ArrowBuf (tracked by allocator)
3. deserializeRecordBatch calls body.slice() per field buffer Slices share EmptyReferenceManager; retain()/release() are no-ops Slices share real BufferLedger; retain() increments refcount
4. ArrowRecordBatch constructor calls retain() per slice No-op Refcount increases
5. body.getReferenceManager().release() No-op Refcount decreases by 1, but slices still hold references
6. ArrowRecordBatch.close() never called (Spark bug) No impact — empty buffers are untracked Buffer leak — refcount > 0, tracked buffers remain
7. allocator.close() Succeeds — no outstanding tracked buffers Throws IllegalStateException

Key mechanism: BaseAllocator.buffer(0) returns untracked empty buffer

// BaseAllocator.java
public ArrowBuf buffer(final long initialRequestSize, BufferManager manager) {
    if (initialRequestSize == 0) {
        return getEmpty();  // singleton, EmptyReferenceManager — not tracked
    }
    // ... allocate real buffer — tracked by allocator
}

In v18.3.0, empty-batch IPC body is 0 bytes → allocator.buffer(0)getEmpty() (untracked). All downstream slice(), retain(), and release() calls are no-ops. The missing close() in SparkResult is harmless.

In v19.0.0, empty-batch IPC body is > 0 bytes → allocator.buffer(n) → real tracked buffer. The missing close() becomes a real off-heap memory leak.

4. Does v18.3.0 have an actual memory leak?

No. Under v18.3.0, empty batches cause no memory leak at all:

  • Off-heap memory: Zero leak. allocator.buffer(0) returns the pre-allocated singleton empty buffer. No additional off-heap memory is allocated for empty batches, so there is nothing to leak.
  • Java heap objects: The orphaned ArrowRecordBatch / ArrowBuf wrapper objects hold no strong references after processResponses() returns, and are collected by normal GC.
  • Allocator tracking: EmptyReferenceManager is a no-op singleton. The allocator never registers these buffers, so allocator.close() sees no outstanding allocations.

The bug in SparkResult is logically present in v18.3.0, but it is structurally impossible to cause any resource leak because the entire empty buffer path — from allocation through slicing to reference counting — operates on untracked no-op objects.

Under v19.0.0 without the fix, the situation is different:

  • allocator.buffer(bodyLength > 0) allocates real off-heap memory.
  • ArrowRecordBatch is never close()-d, so the BufferLedger refcount never reaches 0.
  • ArrowBuf has no finalizer or Cleaner, so GC of the Java wrapper does not decrement the off-heap refcount.
  • The off-heap memory is permanently leaked until allocator.close() detects and reports it.

5. The Fix

When numRecordsInBatch == 0, the deserialized ArrowMessage objects are explicitly closed. This calls ArrowRecordBatch.close(), which invokes release() on each sliced buffer, allowing the BufferLedger refcount to reach 0 and the off-heap memory to be freed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conducting research, I believe this issue will not have a material impact on version 18.3.0, so I personally prefer to address it with a fix in the current pr.

@LuciferYang LuciferYang marked this pull request as ready for review March 17, 2026 07:55
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the impressive research result, @LuciferYang .

+1, LGTM.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Feel free to merge this.

Just FYI, the master branch CI is currently broken and will be fixed by the following.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Merged into master. Thanks @dongjoon-hyun and @Yicong-Huang

HyukjinKwon pushed a commit that referenced this pull request Mar 18, 2026
… nested array with empty outer array

### What changes were proposed in this pull request?

Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV.

### Why are the changes needed?

SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0).

This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). These tests confirm the fix works correctly without any Spark-side workaround.

### Does this PR introduce _any_ user-facing change?

No (test only).

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #54880 from Yicong-Huang/SPARK-55056-test.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Mar 19, 2026
### What changes were proposed in this pull request?
Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables.

### Why are the changes needed?

The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow.

This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). The workaround is no longer necessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
… nested array with empty outer array

### What changes were proposed in this pull request?

Add tests to verify that writing triple-nested arrays (and nested arrays with maps) with an empty outer array no longer triggers a SIGSEGV.

### Why are the changes needed?

SPARK-55056 reported a segmentation fault when deserializing triple-nested arrays with an empty outer array via Arrow IPC. The root cause was in arrow-java: `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty vectors, which violates the Arrow spec (offset buffer must have N+1 entries even when N=0).

This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). These tests confirm the fix works correctly without any Spark-side workaround.

### Does this PR introduce _any_ user-facing change?

No (test only).

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54880 from Yicong-Huang/SPARK-55056-test.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
### What changes were proposed in this pull request?
Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables.

### Why are the changes needed?

The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow.

This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR apache#54820). The workaround is no longer necessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround.

Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants