[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277
Open
moomindani wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed#55277moomindani wants to merge 1 commit intoapache:masterfrom
moomindani wants to merge 1 commit intoapache:masterfrom
Conversation
Add a new section to the Spark Declarative Pipelines programming guide that explains the storage and refresh mechanics, including: - Default table format and how to specify a different format - How materialized views are refreshed (full recomputation via TRUNCATE + append) - How streaming tables are refreshed (incremental processing with checkpoints) - Full refresh behavior for both dataset types
5763f95 to
723afa3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #55276.
What changes were proposed in this pull request?
Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:
parquetviaspark.sql.sources.default) and how to specify a different format with Python and SQL examplesWhy are the changes needed?
The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:
--full-refreshactually does for each dataset typeWithout this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.
Does this PR introduce any user-facing change?
No. Documentation only.
How was this patch tested?
Documentation change only. Verified the content is accurate by reading the SDP implementation (
DatasetManager.scala,FlowExecution.scala).Was this patch authored or co-authored using generative AI tooling?
Yes.