gdal CLI: surface update intent in JSON usage#14322
gdal CLI: surface update intent in JSON usage#14322brownag wants to merge 8 commits intoOSGeo:masterfrom
Conversation
| /** Declare that dataset is opened for update if any of these arguments are used. | ||
| */ | ||
| GDALAlgorithmArgDecl & | ||
| SetOpenForUpdateIfAnyOf(const std::vector<std::string> &argNames) |
There was a problem hiding this comment.
I'm not a big fan of adding those new methods in the API and requiring algorithms to use them, which can be easy to forget when new algorithms will be written. I feel like they shouln't be necessary since GDALAlgorithm::ProcessDatasetArg() knows if it must open in update mode or not. Perhaps part of the logic of ProcessDatasetArg() should be moved to an auxiliary method that can be used by both ProcessDataseArg() and GetUsageAsJson().
I might be wrong, but I would like to see some investigation along those lines to be done first
There was a problem hiding this comment.
thanks @rouault, I definitely agree that it would be best to not have to repeat this logic both in process arg and in the algorithm declaration.
It was a conscious decision on my part to implement this first draft of the PR without trying to abstract out or modify the existing runtime logic. I recognize the issues with expanding the API surface and additional cognitive overhead for developers, but I would argue that adding these types of conditional update features should not be very common. GDAL_OF_UPDATE signals intent very well, but quite a few of these other tools have complex logic. There is value in being explicit in the algorithm declaration.
All that said, I've been tinkering with some alternative options, so far I have not come up with something I am happy with that seems simpler. But I suspect I can improve on what I have so far. I will need a bit more time. If you can confirm the first 3 commits on this PR are adequate that will help me know I can safely more forward with an alternate implementation that respects the same schema.
|
/me giving up on all AI assisted contributions |
If you would rather just close this PR then that is fine by me. I spent a few hours of my Saturday considering ways to refactor in response to your prior comment. I have had a busy weekend though, and did not want to respond before I had fully considered your suggestion. Please feel free to close. I disclosed AI use as is required, but I don't think it is a fair characterization to say I did not give any thought to this work, or that I primarily relied on AI to construct this PR. |
|
Is overwrite handled for datasets? In this PR, "vector rasterize" gets .SetOpenForUpdateIfAnyOf({"update", "add"});But if |
But I agree that overwrite is potentially mutating the output. On some level this is semantics of what an "update" is. There are cases where whole algorithms are GDAL_OF_UPDATE, and some cases where GDAL_OF_UPDATE is triggered at runtime, but overwrite is neither of those. |
I was thinking of it in terms of the additional context given in #14290
I would consider
I don't understand "overwrite is neither of those"? A file that is overwritten is still opened with update access (at a least a file with same name is, and the previous file no longer existing) and it's triggered at runtime. |
|
My goal with #14290 was to identify in the JSON usage which algorithms open datasets for in-place update by default, and also which algorithm arguments modify the default behavior declared for the algorithm. This is metadata that bindings can use to reason about determinism/reproducibility without them having to maintain custom lists and rules. At JSON usage emission time, we can only track the declarations, not the runtime conditions.
I agree in the general sense of the file system, but I don't think that overwrite should be included in the proposed "open_for_update" "if_any_of" metadata. I think the distinction is not about whether the overwritten file gets changed, clearly it does change (at least the timestamp). What makes an overwrite operation different from update is that overwrite does not depend on what was there before.
--overwrite-layer opens a dataset with GDAL_OF_UPDATE, and that is semantically different from --overwrite.
Sure but when rendering the JSON usage we can't know how a user is going to run the algorithm. We can know which algorithms or arguments will trigger opening a dataset for update. Also, we can separately know that overwrite will always create a new dataset. Perhaps we do not need the JSON usage flags to map to the code paths that attach or disable GDAL_OF_UPDATE... but that is what I proposed and attempted to do here. |
|
Thanks. That makes sense to me.
I wasn't suggesting these flags aren't needed. I asked about the implications of overwrite wrt knowing whether a file is mutated. The PR (in its current form at least) adds incremental complexity and another item for algorithm developers to track and implement consistently. The benefit for determinism/reproducibility should be clear. You articulated well what it adds without having runtime information on whether an existing file is actually modified by overwrite. I see value in that but don't have strong opinion on cost/benefit. Maintainer perspective obviously carries a lot of weight on it, and there are small number of algorithm developers. |
Agreed. I think what @rouault suggested about abstracting out the logic from As I have looked around I have noticed inconsistencies in algorithm definitions, so I realize that the additional cognitive overhead of adding SetOpenForUpdateIfAnyOf/UnlessAnyOf means that, as implemented, it is likely to be omitted unintentionally as the codebase develops. Though as implemented in this PR omission is not a huge "risk" as only the JSON usage output is affected, not the actual runtime behavior. I probably should have couched this PR a bit better: I want to get feedback on this relatively low-impact implementation of the idea (i.e. is the proposed JSON schema right, are we capturing everything we need to, are there more algorithms with edge cases I am missing). Your point about |
I didn't say or imply that. More that I've had an indigestion of AI contributions over various projects recently that makes me paranoid in general. The submitter knows how much personal thoughts they have put. On my side, I'm in the dark. AI also increases the volume of contributions : maintainers do not scale up. |
What does this PR do?
Extends the GDAL algorithm JSON usage schema to include
"open_for_update"metadata which describes when datasets are opened in update mode. The schema, as suggested by @rouault in #14290, includes:"by_default" (boolean set flag),"if_any_of"(update conditional on specific arguments), and"unless_any_of"(update suppressed by specific arguments). This set of metadata items handles the diversity of tools with varying default update behavior that can be modified by additional arguments.Adds builder methods (SetOpenForUpdateIfAnyOf()/UnlessAnyOf()) for the algorithm declaration API. Updates
GetUsageAsJSON()to evaluate each algorithm and serialize necessary usage metadata for update intent.These new conditions apply to several CLI tools, a couple of which were specifically discussed in the original issue, and others were identified while drafting this PR. Includes: raster/vector update (previously missing
GDAL_OF_UPDATE), raster edit, raster overview (add/delete/refresh), raster clean collar, vector concat (along with all vector algorithms built on the pipeline framework), and vector rasterize.New tests validate the serialization of
"by_default"and the conditional paths.EDIT: I want to get feedback on this draft of the idea (i.e. is the proposed JSON schema right, are we capturing everything we need to, are there more algorithms with edge cases I am missing).
What are related issues/pull requests?
#14290
AI tool usage
Tasklist
Environment