Skip to content

gdal CLI: surface update intent in JSON usage#14322

Draft
brownag wants to merge 8 commits intoOSGeo:masterfrom
brownag:of-update-json1
Draft

gdal CLI: surface update intent in JSON usage#14322
brownag wants to merge 8 commits intoOSGeo:masterfrom
brownag:of-update-json1

Conversation

@brownag
Copy link
Copy Markdown

@brownag brownag commented Apr 10, 2026

What does this PR do?

Extends the GDAL algorithm JSON usage schema to include "open_for_update" metadata which describes when datasets are opened in update mode. The schema, as suggested by @rouault in #14290, includes: "by_default" (boolean set flag), "if_any_of" (update conditional on specific arguments), and "unless_any_of" (update suppressed by specific arguments). This set of metadata items handles the diversity of tools with varying default update behavior that can be modified by additional arguments.

Adds builder methods (SetOpenForUpdateIfAnyOf()/UnlessAnyOf()) for the algorithm declaration API. Updates GetUsageAsJSON() to evaluate each algorithm and serialize necessary usage metadata for update intent.

These new conditions apply to several CLI tools, a couple of which were specifically discussed in the original issue, and others were identified while drafting this PR. Includes: raster/vector update (previously missing GDAL_OF_UPDATE), raster edit, raster overview (add/delete/refresh), raster clean collar, vector concat (along with all vector algorithms built on the pipeline framework), and vector rasterize.

New tests validate the serialization of "by_default" and the conditional paths.

EDIT: I want to get feedback on this draft of the idea (i.e. is the proposed JSON schema right, are we capturing everything we need to, are there more algorithms with edge cases I am missing).

What are related issues/pull requests?

#14290

AI tool usage

  • AI (Gemini, Claude) supported my development of this PR. See our policy about AI tool use. Use of AI tools must be indicated.

Tasklist

  • Make sure code is correctly formatted (cf pre-commit configuration)
  • Add test case(s)
  • Add documentation
  • Updated Python API documentation (swig/include/python/docs/)
  • Review
  • Adjust for comments
  • All CI builds and checks have passed

Environment

  • OS: Ubuntu 24.04 LTS
  • Compiler: GCC 13.3.0

Comment thread gcore/gdalalgorithm_cpp.h
/** Declare that dataset is opened for update if any of these arguments are used.
*/
GDALAlgorithmArgDecl &
SetOpenForUpdateIfAnyOf(const std::vector<std::string> &argNames)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of adding those new methods in the API and requiring algorithms to use them, which can be easy to forget when new algorithms will be written. I feel like they shouln't be necessary since GDALAlgorithm::ProcessDatasetArg() knows if it must open in update mode or not. Perhaps part of the logic of ProcessDatasetArg() should be moved to an auxiliary method that can be used by both ProcessDataseArg() and GetUsageAsJson().
I might be wrong, but I would like to see some investigation along those lines to be done first

Copy link
Copy Markdown
Author

@brownag brownag Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rouault, I definitely agree that it would be best to not have to repeat this logic both in process arg and in the algorithm declaration.

It was a conscious decision on my part to implement this first draft of the PR without trying to abstract out or modify the existing runtime logic. I recognize the issues with expanding the API surface and additional cognitive overhead for developers, but I would argue that adding these types of conditional update features should not be very common. GDAL_OF_UPDATE signals intent very well, but quite a few of these other tools have complex logic. There is value in being explicit in the algorithm declaration.

All that said, I've been tinkering with some alternative options, so far I have not come up with something I am happy with that seems simpler. But I suspect I can improve on what I have so far. I will need a bit more time. If you can confirm the first 3 commits on this PR are adequate that will help me know I can safely more forward with an alternate implementation that respects the same schema.

@rouault rouault added the AI assisted⚠️ AI assisted coding involved. Review with extreme scepticism. label Apr 12, 2026
@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 12, 2026

/me giving up on all AI assisted contributions

@brownag
Copy link
Copy Markdown
Author

brownag commented Apr 13, 2026

/me giving up on all AI assisted contributions

If you would rather just close this PR then that is fine by me. I spent a few hours of my Saturday considering ways to refactor in response to your prior comment. I have had a busy weekend though, and did not want to respond before I had fully considered your suggestion. Please feel free to close. I disclosed AI use as is required, but I don't think it is a fair characterization to say I did not give any thought to this work, or that I primarily relied on AI to construct this PR.

@ctoney
Copy link
Copy Markdown
Contributor

ctoney commented Apr 13, 2026

Is overwrite handled for datasets? In this PR, "vector rasterize" gets

.SetOpenForUpdateIfAnyOf({"update", "add"});

But if --overwrite is used, then the output file is potentially being mutated conditional on whether it already exists (i.e., an existing file would be deleted and then a new file with the same name opened for update).

@brownag
Copy link
Copy Markdown
Author

brownag commented Apr 13, 2026

Is overwrite handled for datasets? In this PR, "vector rasterize" gets

.SetOpenForUpdateIfAnyOf({"update", "add"});

But if --overwrite is used, then the output file is potentially being mutated conditional on whether it already exists (i.e., an existing file would be deleted and then a new file with the same name opened for update).

--overwrite is not handled because it doesn't need update mode opening, as you say it re-creates a new raster file. If I understand the logic in ProcessDatasetArg() properly, with --overwrite set the handling for in-place update will not trigger. Also, if using an arg that opens for update with --overwrite, overwrite takes precedence

But I agree that overwrite is potentially mutating the output. On some level this is semantics of what an "update" is. There are cases where whole algorithms are GDAL_OF_UPDATE, and some cases where GDAL_OF_UPDATE is triggered at runtime, but overwrite is neither of those.

@ctoney
Copy link
Copy Markdown
Contributor

ctoney commented Apr 13, 2026

On some level this is semantics of what an "update" is.

I was thinking of it in terms of the additional context given in #14290

In-place modifications break idempotency. Adding this property will help bindings know when a file is being mutated to maintain reproducibility.

I would consider --overwrite as a case of the file being mutated in that context. Otherwise, overwriting a layer in an existing, single-layer GPKG would be "update", but --overwrite the file with a new single-layer would not. Overwriting a single-band raster file with a new single band would not be considered update, when that is basically the raster analog of overwriting a vector layer which is "update". Aren't there cases where the same output could be produced with a choice of algorithm arguments, one considered "update" and the other not, but the choice of which way to do it is more or less arbitrary?

There are cases where whole algorithms are GDAL_OF_UPDATE, and some cases where GDAL_OF_UPDATE is triggered at runtime, but overwrite is neither of those.

I don't understand "overwrite is neither of those"? A file that is overwritten is still opened with update access (at a least a file with same name is, and the previous file no longer existing) and it's triggered at runtime.

@brownag
Copy link
Copy Markdown
Author

brownag commented Apr 13, 2026

My goal with #14290 was to identify in the JSON usage which algorithms open datasets for in-place update by default, and also which algorithm arguments modify the default behavior declared for the algorithm. This is metadata that bindings can use to reason about determinism/reproducibility without them having to maintain custom lists and rules.

At JSON usage emission time, we can only track the declarations, not the runtime conditions.

I would consider --overwrite as a case of the file being mutated in that context.

I agree in the general sense of the file system, but I don't think that overwrite should be included in the proposed "open_for_update" "if_any_of" metadata.

I think the distinction is not about whether the overwritten file gets changed, clearly it does change (at least the timestamp). What makes an overwrite operation different from update is that overwrite does not depend on what was there before.

Otherwise, overwriting a layer in an existing, single-layer GPKG would be "update", but --overwrite the file with a new single-layer would not. Overwriting a single-band raster file with a new single band would not be considered update, when that is basically the raster analog of overwriting a vector layer which is "update".

--overwrite-layer opens a dataset with GDAL_OF_UPDATE, and that is semantically different from --overwrite.

Aren't there cases where the same output could be produced with a choice of algorithm arguments, one considered "update" and the other not, but the choice of which way to do it is more or less arbitrary?

Sure but when rendering the JSON usage we can't know how a user is going to run the algorithm. We can know which algorithms or arguments will trigger opening a dataset for update. Also, we can separately know that overwrite will always create a new dataset.

Perhaps we do not need the JSON usage flags to map to the code paths that attach or disable GDAL_OF_UPDATE... but that is what I proposed and attempted to do here.

@ctoney
Copy link
Copy Markdown
Contributor

ctoney commented Apr 14, 2026

Thanks. That makes sense to me.

Perhaps we do not need the JSON usage flags to map to the code paths that attach or disable GDAL_OF_UPDATE... but that is what I proposed and attempted to do here.

I wasn't suggesting these flags aren't needed. I asked about the implications of overwrite wrt knowing whether a file is mutated. The PR (in its current form at least) adds incremental complexity and another item for algorithm developers to track and implement consistently. The benefit for determinism/reproducibility should be clear. You articulated well what it adds without having runtime information on whether an existing file is actually modified by overwrite. I see value in that but don't have strong opinion on cost/benefit. Maintainer perspective obviously carries a lot of weight on it, and there are small number of algorithm developers.

@brownag
Copy link
Copy Markdown
Author

brownag commented Apr 14, 2026

I see value in that but don't have strong opinion on cost/benefit. Maintainer perspective obviously carries a lot of weight on it, and there are small number of algorithm developers.

Agreed.

I think what @rouault suggested about abstracting out the logic from ProcessDatasetArg() and using it in GetUsageAsJSON() is the most parsimonious approach. However, ProcessDatasetArg() has a lot of logic related to this that gets pretty complex. The scope of things that could break increases quite a bit if I start messing around with the core argument processing routine for what I originally conceived of as improved machine-readable documentation. When I proposed it I don't think I had a full appreciation for how much of this logic was determined at runtime as opposed to just part of the algorithm definition...

As I have looked around I have noticed inconsistencies in algorithm definitions, so I realize that the additional cognitive overhead of adding SetOpenForUpdateIfAnyOf/UnlessAnyOf means that, as implemented, it is likely to be omitted unintentionally as the codebase develops. Though as implemented in this PR omission is not a huge "risk" as only the JSON usage output is affected, not the actual runtime behavior.

I probably should have couched this PR a bit better: I want to get feedback on this relatively low-impact implementation of the idea (i.e. is the proposed JSON schema right, are we capturing everything we need to, are there more algorithms with edge cases I am missing). Your point about overwrite is well taken, so thank you for weighing in.

@brownag brownag marked this pull request as draft April 14, 2026 15:36
@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 14, 2026

but I don't think it is a fair characterization to say I did not give any thought to this work, or that I primarily relied on AI to construct this PR.

I didn't say or imply that. More that I've had an indigestion of AI contributions over various projects recently that makes me paranoid in general. The submitter knows how much personal thoughts they have put. On my side, I'm in the dark. AI also increases the volume of contributions : maintainers do not scale up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI assisted⚠️ AI assisted coding involved. Review with extreme scepticism.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants