Skip to content

Commit 0c4e1aa

Browse files
adriangbclaude
andcommitted
docs: clarify expr_arc_id scope for expression deduplication
Add documentation explaining that expr_arc_id is only valid for deduplication within a single serialized plan from a single process. Arc pointer addresses can collide across different processes or even within the same process over time, so a fresh DefaultPhysicalProtoConverter instance must be used for each plan deserialization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0f76e5b commit 0c4e1aa

3 files changed

Lines changed: 31 additions & 0 deletions

File tree

datafusion/proto/proto/datafusion.proto

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -842,6 +842,14 @@ message PhysicalExprNode {
842842
// When serializing, this is set to Arc::as_ptr(expr) as u64.
843843
// When deserializing, if this ID has been seen before, the cached Arc is returned
844844
// instead of creating a new one, enabling expression sharing.
845+
//
846+
// IMPORTANT: This field is ONLY valid as a deduplication key within a single
847+
// serialized plan from a single process. It MUST NOT be used to deduplicate
848+
// across different plans or across different nodes/processes, as Arc pointer
849+
// addresses can collide (the same address may be reused for different allocations
850+
// in different processes, or even within the same process over time).
851+
//
852+
// The deserializer should use a fresh dedup cache for each plan it deserializes.
845853
optional uint64 expr_arc_id = 30;
846854

847855
oneof ExprType {

datafusion/proto/src/generated/prost.rs

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

datafusion/proto/src/physical_plan/mod.rs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3673,10 +3673,25 @@ struct DataEncoderTuple {
36733673
/// instead of creating a new one. This enables expression sharing and can
36743674
/// significantly reduce memory usage for plans with duplicate expressions
36753675
/// (e.g., large IN lists).
3676+
///
3677+
/// # Important: Scope of Deduplication
3678+
///
3679+
/// The `expr_arc_id` is only valid as a deduplication key **within a single
3680+
/// serialized plan from a single process**. Arc pointer addresses can collide:
3681+
/// - Different processes may allocate Arcs at the same address
3682+
/// - The same process may reuse addresses after deallocation
3683+
///
3684+
/// Therefore, you **must create a fresh `DefaultPhysicalProtoConverter` instance
3685+
/// for each plan you deserialize**. Do not reuse the same converter instance
3686+
/// across multiple plans from different sources, as this could incorrectly
3687+
/// deduplicate unrelated expressions that happen to share the same pointer address.
36763688
#[derive(Default)]
36773689
pub struct DefaultPhysicalProtoConverter {
36783690
/// Cache for expression deduplication during deserialization.
36793691
/// Maps expr_arc_id (the original Arc pointer address) to the deserialized expression.
3692+
///
3693+
/// This cache should only be used for a single plan deserialization.
3694+
/// Create a new converter instance for each plan to avoid cross-plan collisions.
36803695
dedup_cache: RwLock<HashMap<u64, Arc<dyn PhysicalExpr>>>,
36813696
}
36823697

0 commit comments

Comments
 (0)