Add Iceberg sink plugin#6734
Open
lawofcycles wants to merge 11 commits intoopensearch-project:mainfrom
Open
Conversation
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Contributor
Author
|
Opening this as a draft. The implementation is functional and tested, but I want to carefully review the design decisions, processing flow, and edge cases before requesting reviews. Will mark as ready once that is complete. |
EventHandles are collected during output() and associated with WriteResultPartition IDs on flush. A polling thread checks the coordination store for completed partitions and releases the corresponding EventHandles, ensuring events are acknowledged only after the Iceberg commit succeeds. Also moves the coordinator null check to the top of the IcebergSinkService constructor. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Check at startup for static routing, and at TaskWriter creation for dynamic routing. Throws IllegalArgumentException with the unknown column name instead of NullPointerException. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Make TableContext immutable and replace it atomically in the ConcurrentHashMap when schema changes. Each thread detects stale writers by comparing schema IDs and recreates them. Also replace fully qualified class names with imports in IcebergSinkIT. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
When operation is configured (CDC mode) but neither identifier_columns nor the table's identifier-field-ids are set, fail at startup with a clear error message instead of silently producing broken equality deletes. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Schema evolution bug: after evolveSchema(), the local writerManager and ctx variables in output() were stale. Data written to the old writer was never flushed, and the old RecordConverter produced records with the wrong schema (ArrayIndexOutOfBoundsException). Re-read both variables from their respective maps after evolveSchema(). Graceful shutdown: previously, shutdown() flushed all writers and registered WriteResultPartitions but immediately killed the CommitScheduler, so the final data was never committed to Iceberg. Now CommitScheduler supports a shutdownRequested flag. On shutdown, the executor sends an interrupt to wake the scheduler from its sleep, and the scheduler executes one final commit cycle before exiting. This ensures data buffered at shutdown time is committed rather than relying on source re-delivery. Also update the schemaEvolution_concurrentThreads integration test to verify data via shutdown() (which now guarantees flush and commit) instead of polling with Awaitility. Add debug-level logging for EventHandle release to aid E2E verification. Signed-off-by: Sotaro Hikita <bering1814@gmail.com>
Contributor
Author
|
This PR is ready for review. It adds an The full design is documented in #6664. I would really appreciate it if you could review it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a new
icebergsink plugin that writes Data Prepper events into Apache Iceberg tables. Supports append only writes and row level deletes (equality delete) for handling INSERT, UPDATE, and DELETE operations from CDC sources. Commit coordination across multiple nodes uses the existingEnhancedSourceCoordinatorinfrastructure. Marked as@Experimental.See #6664 for the full design.
Framework change: adds
UsesEnhancedSinkCoordinationinterface indata-prepper-apiand 8 lines toPipeline.execute()to inject a coordination store into Sink plugins.Example: Append only (REST Catalog)
Example: Row level deletes (REST Catalog)
Example: S3 Tables
Testing
In addition to the unit and integration tests included in this PR, the following E2E verification was performed.
E2E tests (Data Prepper process with DynamoDB coordination store, SeaweedFS + REST Catalog): 26 scenarios, all passing.
${/table}expression routing events to different tables in one batchE2E verified on Amazon S3 Tables (us-east-1): append only and CDC both confirmed working
Issues Resolved
Resolves #6664
Check List