Skip to content

CSharp bindings: marshal all strings as UTF-8#14283

Merged
rouault merged 14 commits intoOSGeo:masterfrom
Mbucari:master
Apr 15, 2026
Merged

CSharp bindings: marshal all strings as UTF-8#14283
rouault merged 14 commits intoOSGeo:masterfrom
Mbucari:master

Conversation

@Mbucari
Copy link
Copy Markdown
Contributor

@Mbucari Mbucari commented Apr 2, 2026

The Problem

Java's runtime uses native UTF-16 strings and the JNI uses UTF-8 by default, so strings with wide characters are handled properly. Likewise, python converts to/from UTF-8 (or binary) without issue. C#, however, converts all strings to ANSI.

The problem is in SWIG library's SWIGStringHelper. It registers a managed callback with the wrapper. This callback accepts a string argument and returns a string. The wrapper passes the native string to this callback function. The callback creates a managed string and then returns the managed string. Its purpose is to allow the .NET runtime to create a copy of the unmanaged string wile giving the unmanaged caller a chance to free the unmanaged string before returning the managed string.

The problem with this method is that the string goes through the marshaller several times before it's returned to the managed caller. Each pass through the marshaller is a chance to down-convert the string to ANSI if the proper attributes are not applied. More importantly, each pass through the marshaller creates another copy of the string. With the default SWIG implementation, output strings are copied three times before they're returned to the managed caller.

  1. The first copy happens when the unmanaged string argument is passed to the callback, it is marshalled as a .NET string. This is also when the UTF-8 string is converted to ANSI, because that managed delegate has no [MarshalAs] attribute, it defaults to ANSI.
  2. The second copy happens when the callback returns the string. The managed .NET string is marshalled as a LPStr. Note the string is marshaled, so the string that the native code gets from this callback is a copy.
  3. The callback returns, and the native code has the copy of the string made in 2. above. Note that this is an unmanaged string. It returns that unmanaged string to the managed caller, during which it is marshaled (copied) back to a managed string.

What does this PR do?

Fix string marshalling with C# bindings so that all strings are marshalled to/from unmanaged UTF-8.

Replace SWIG's built-in SWIGStringHelper with a modified version named Utf8StringHelper

  • Disable SWIG's string helper by defining the SWIG_CSHARP_NO_STRING_HELPER constant.
  • Add a new SWIG interface, csharp_strings.i, to handle all string conversions in C# bindings.
  • Add Utf8StringHelper to csharp_strings.i. It uses the same callback pattern as the built-in SWIGStringHelper and defines the SWIG_csharp_string_callback function (which is used by Gdal and by other SWIG routines), but it correctly handles UTF-8 strings and only makes two copies. When the callback is called from native code, the managed callback creates a length-prefixed UTF-16 string in unmanaged memory (first copy), and then returns a pointer to that string. The native code then returns that pointer to the managed code, and the csout and csvarout typemaps convert that length-prefixed UTF-16 string to a managed .NET string (second copy) and release the unmanaged memory.

Change C# and Java string constants from runtime to compile time.

Use the %javaconst(flag) and %csconst(flag) directives to mark all string constants as compile time. This was necessary because SWIG was generating string constants for each #define, but it was using the imtype to assign them, which is now IntPtr. There may be another way to apply a different typemap to these constants, but making them compile time solves the problem and is more performant anyway.

Minor Refactor

  • Deleted StringToUtf8Bytes. StringToUtf8Unmanaged is the only UTF-8 encoder now.
  • Moved StringToUtf8Unmanaged from $module class to $imclassname in csharp_strings.i.
  • Moved all utf8_string definitions into csharp_strings.i.

Tasklist

  • Make sure code is correctly formatted (cf pre-commit configuration)
  • Add test case(s)
  • All CI builds and checks have passed
  • Evaluate for memory leaks
  • Review
  • Adjust for comments

Environment

Mono and dotnet

Fix string marshalling with C# bindings so that all strings are marshalled to/from unmanaged UTF-8.

- Replace SWIG's built-in `SWIGStringHelper` with a modified version named `Utf8StringHelper`.
- Change C# and Java string constants from runtime to compile time.
- Minor Refactor
@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 2, 2026

@Mbucari There PR template has a checkbox about AI use. I feel you would need to have ticked it. Please avoid copying verbatim AI generate PR descriptions. They are 10 times bigger than needed...

CC @runette @szekerest

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 2, 2026

There PR template has a checkbox about AI use.

@rouault I guess I can only take that as a compliment, because I did not use an ounce of AI. I sat down last night and typed every character of my PR description into the GitHub editor. I proofread it, edited it, and formatted it myself. I am literate.

EDIT* Every character except for the template parts which I kept. And I suppose if you want to be pedantic, I also copied and pasted class names, file names, etc. from my PR changes and from Microsoft's website.

EDIT2* As to being bigger than needed, this PR changes how every string is marshal to and from gdal across the entire API surface. I feel like that's significant enough to warrant a slightly longer description.

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 2, 2026

because I did not use an ounce of AI

ok sorry. Looks like my AI detection radar needs to be improved :-)

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 2, 2026

the fedora:rawhide failure is unrelated and (hopefully) addressed in latest master

Use StringToUtf8Bytes instead of StringToUtf8Unmanaged to marshal string from managed to unmanaged. The unmanaged string was not being freed, which would lead to memory leaks. Used a managed byte[] instead so GC will free it.
@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 2, 2026

7b65807 fixed a memory leak.

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 7, 2026

@rouault can you please re-run the failing failing test, Linux Builds / Ubuntu 26.04, clang ASAN?

This PR works and is ready for full review.

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 7, 2026

can you please re-run the failing failing test, Linux Builds / Ubuntu 26.04, clang ASAN?

I believe just restarting it will not be enough. You have to git rebase on top of latest master and force push

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 8, 2026

@runette @szekerest review of this PR appreciated when you've a chance

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 10, 2026

@rouault I will have a look today

Mbucari added 2 commits April 10, 2026 15:38
Update typemaps for `out string` and `ref string` to work with changes made to string marshalling in 2049d72.
@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 10, 2026

@rouault I had to fix typemaps for out string and ref string parameters. I also added tests for our and ref strings.

String cases covered:

  • input as function parameter
  • output returned by function
  • input to property setter
  • output from property getter
  • output to out string parameter
  • input from ref string parameter
  • string array input to function
  • string array output returned by function

The only string marshaling case not handled is output back to ref string parameter, but I don't think that's used anywhere. The only ref string typemap is see is for char **ignorechange, which doesn't need output.

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 11, 2026

While doping the above fix, I learned a way to improve performance for all strings on input.

I'm done now. Sorry for the late modifications to this PR.

Encode input strings directly to UTF8 unmanaged memory instead of managed byte array.

Apply string typemap to `void* callback_data`
Mbucari added 3 commits April 10, 2026 22:05
- Move all string typedefs into csharp_strings.i
- Add documentation
@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 12, 2026

I reorganized the C# typedefs by moving all string and string[] definitions into csharp_strings.i. I also updated the OGR error code messages.

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 12, 2026

@rouault can you please rerun those failing tests? The 'Memory leak detected' error on Ubuntu 26.04, clang ASAN seems concerning, but none of my changes should have caused that, and the test succeeded in my repo's workflow runs.

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 12, 2026

he 'Memory leak detected' error on Ubuntu 26.04, clang ASAN seems concerning

done. it often fails due to a timeout in the GMLAS tests . Somone should fix that...

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 13, 2026

@Mbucari It is quite difficult to review if you keep adding commits.

Can you let me know when you think that it is ready for review.

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 13, 2026

I reorganized the C# typedefs by moving all string and string[] definitions into csharp_strings.i. I also updated the OGR error code messages.

There are a number of other issues related to UTF8. Does this PR also fix these?

#1245
#1433
#3107
#3633
#10883

If yes - do we need to / can we add specific test to the sample code to show that?
If no - can we add those fixes?

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 13, 2026

@runette I'm done making additions to this PR, unless I find another bug (I'm currently using this build in a project, so I may come across additional bugs).

I've explored all the issues you linked, and I've outlined my findings below. Tl;dr: This PR is good as-is, and I added tests.


#1245 - I can't reproduce this error with the MaxRev-Win64 3.12.2 package, so I assume the problem has been fixed in the intervening years. I added a test for this case.


#1433 - Using the shapefile that OP posted, I cannot reproduce the error with the MaxRev-Win64 3.12.2 package. I get the expected field names: Drâ’ Berrani and Drâ’ Ali Ben Amar. I added a test for this case.

Note that if you are creating these field values from code, the layer may need to be created with the "ENCODING=UTF-8" creation option.


#3107 - This bug has 2 parts, both of which have been fixed. The first part was that OP's nuget package was using proj6 which didn't support unicode file path. The second part was that GetProjSearchPaths wasn't decoding string lists as UTF8, and that was fixed in a9d1885.


#3633 - This will be fixed by this PR, and I've added a test.


#10883 - This will be fixed by this PR, but the layer on which the field will be created must have been created with the "ENCODING=UTF-8" creation option. I've added a test.

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 13, 2026

@Mbucari Thank you for taking the time to do that. And thank you for the PR and your detailed analysis.

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 13, 2026

@runette Sorry for the extra commit. I noticed that my workflow tests failed because the test apps were being compiled with C# language version 7.0, and I'd used features from a later version. Do you know of a CMAKE flag that I can use to manually specify the C# language version for my build environment?

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 13, 2026

Interestingly, my last commit failed on the line if (shpSrc is null) with the following error.

error CS1644: Feature `pattern matching' cannot be used because it is not part of the C# 7.0 language specification

But that's wrong. Pattern matching was introduced in C# 7.0, so that appears to be a Mono bug.

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 13, 2026

Interestingly, my last commit failed on the line if (shpSrc is null) with the following error.

error CS1644: Feature `pattern matching' cannot be used because it is not part of the C# 7.0 language specification

But that's wrong. Pattern matching was introduced in C# 7.0, so that appears to be a Mono bug.

The CMAKE build process "grandfathered" in legacy support for compiling for Mono using mcs which has "partial" support for C# 7.0 and it seems that pattern matching is one of the gaps.

There is no CMAKE flag for C# version

For dotnet builds there are flags for the target framework for the libraries and apps (separately specified). For no very good reason these ended up saying CSHARP_xx_VERSION which is not ideal.

-DCSHARP_LIBRARY_VERSION=netstandard2.0
and
-DCSHARP_APPLICATION_VERSION=xx

The sample applcations are only built for testing purposes, so you would think the target framework should be flexible. The problem is the Mono builds that effectively limit the C# version to 7.0 -ish!

It has not really been much of a problem until now - because as you have probably seen the sample applications have not changed very much in recent years and when they do it is a manual check. However, I did have exaclty the same problem a few weeks ago on another PR (also with using statements).

What we probably should have done is include the language vesion in the csproj files. However, that should be another PR. It is also probably a question if mcs support should be retained.

If you are building on Windows - I think you can specify the -DCSHARP_APPLICATION_VERSION=net48 which should default the C# version to 7.3 and catch most of the language problems (but not all). I cannot do that because I build on Linux for testing.

Copy link
Copy Markdown
Contributor

@runette runette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the depths of these changes - it would be good to get @szekerest to comment given your greater knowledge on the structure of the bindings

Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/gdalconst.i Outdated
Comment thread swig/include/csharp/csharp_strings.i Outdated
Comment thread swig/include/csharp/csharp_strings.i
@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 14, 2026

Based on the conversation above

fixes #1245
fixes #1433
fixes #3107
fixes #3633
fixes #10883

@Mbucari Mbucari requested a review from runette April 14, 2026 17:05
@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 14, 2026

@Mbucari Thanks

I think we have to wait for @rouault on the Java question and @szekerest for any comments. Will revist tomorrow

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 14, 2026

Thanks

You're very welcome. It's been my pleasure, and I'm looking forward to GDAL 3.13.

I think we have to wait for rouault on the Java question

He already OK'd it here.

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 14, 2026

Thanks

You're very welcome. It's been my pleasure, and I'm looking forward to GDAL 3.13.

I think we have to wait for rouault on the Java question

He already OK'd it here.

sorry - missed that

@rouault
Copy link
Copy Markdown
Member

rouault commented Apr 14, 2026

I'll probably cut a GDAL 3.13.0 beta1 tomorrow or thursday and would like it to include this PR in it. If there are adjustments needed, we'll have ~ 2 weeks before final release

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 14, 2026

@runette I was thinking it might be a good idea to expose the string marshaller to users and allow them to use their own custom marshaller. Something like the below interface. It could be a last resort for users to manage exactly how strings are read, serving a purpose similar to #3825. Beyond the scope of this PR, but what do you think?

/// <summary>
/// Interface for marshalling strings to and from the GDAL library
/// </summary>
public interface IGdalStringMarshaler
{
    /// <summary>
    /// Copies a native GDAL string to an unmanaged memory location and returns a pointer to it.
    /// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
    /// </summary>
    /// <param name="nativeString">A pointer to a native GDAL char* string.</param>
    public IntPtr CopyNativeToUnmanaged(IntPtr nativeString);
    /// <summary>
    /// Convert a managed string to an unmanaged string and returns a pointer to it.
    /// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
    /// </summary>
    /// <param name="managedString">
    public IntPtr StringToUnmanaged(string managedString);
    /// <summary>
    /// Converts an unmanaged string (allocated by CopyNativeToUnmanaged or StringToUnmanaged) back to a managed string.
    /// </summary>
    /// <param name="unmanagedString">A pointer to an unmanaged string created by CopyNativeToUnmanaged or StringToUnmanaged</param>
    public string UnmanagedToString(IntPtr unmanagedString);
    /// <summary>
    /// Releases the memory allocated for an unmanaged string pointer.
    /// Must be called exactly once for each pointer returned by <CopyNativeToUnmanaged>
    /// or <StringToUnmanaged> to prevent memory leaks or double frees.
    /// </summary>
    public void FreeUnmanagedString(IntPtr unmanagedString);
}

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 14, 2026

I'll probably cut a GDAL 3.13.0 beta1 tomorrow or thursday and would like it to include this PR in it. If there are adjustments needed, we'll have ~ 2 weeks before final release

I am happy with the PR. The code looks very good to me and I have tested it everyway I can think of. It defnitely buidls on netstandard2.0 (this time) so my vote would be yes

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 14, 2026

@runette I was thinking it might be a good idea to expose the string marshaller to users and allow them to use their own custom marshaller. Something like the below interface. It could be a last resort for users to manage exactly how strings are read, serving a purpose similar to #3825. Beyond the scope of this PR, but what do you think?

/// <summary>
/// Interface for marshalling strings to and from the GDAL library
/// </summary>
public interface IGdalStringMarshaler
{
    /// <summary>
    /// Copies a native GDAL string to an unmanaged memory location and returns a pointer to it.
    /// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
    /// </summary>
    /// <param name="nativeString">A pointer to a native GDAL char* string.</param>
    public IntPtr CopyNativeToUnmanaged(IntPtr nativeString);
    /// <summary>
    /// Convert a managed string to an unmanaged string and returns a pointer to it.
    /// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
    /// </summary>
    /// <param name="managedString">
    public IntPtr StringToUnmanaged(string managedString);
    /// <summary>
    /// Converts an unmanaged string (allocated by CopyNativeToUnmanaged or StringToUnmanaged) back to a managed string.
    /// </summary>
    /// <param name="unmanagedString">A pointer to an unmanaged string created by CopyNativeToUnmanaged or StringToUnmanaged</param>
    public string UnmanagedToString(IntPtr unmanagedString);
    /// <summary>
    /// Releases the memory allocated for an unmanaged string pointer.
    /// Must be called exactly once for each pointer returned by <CopyNativeToUnmanaged>
    /// or <StringToUnmanaged> to prevent memory leaks or double frees.
    /// </summary>
    public void FreeUnmanagedString(IntPtr unmanagedString);
}

I think it is an excellent idea - but maybe another PR

@coveralls
Copy link
Copy Markdown
Collaborator

Coverage Status

coverage: 71.688% (+0.02%) from 71.664% — Mbucari:master into OSGeo:master

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 15, 2026

I'll probably cut a GDAL 3.13.0 beta1 tomorrow or thursday and would like it to include this PR in it. If there are adjustments needed, we'll have ~ 2 weeks before final release

I am happy with the PR. The code looks very good to me and I have tested it everyway I can think of. It defnitely buidls on netstandard2.0 (this time) so my vote would be yes

Hold on that - I may have found a problem

@runette
Copy link
Copy Markdown
Contributor

runette commented Apr 15, 2026

I'll probably cut a GDAL 3.13.0 beta1 tomorrow or thursday and would like it to include this PR in it. If there are adjustments needed, we'll have ~ 2 weeks before final release

I am happy with the PR. The code looks very good to me and I have tested it everyway I can think of. It defnitely buidls on netstandard2.0 (this time) so my vote would be yes

Hold on that - I may have found a problem

Sorry for the alarm - it was a problem with my build environment.

@Mbucari
Copy link
Copy Markdown
Contributor Author

Mbucari commented Apr 15, 2026

Sorry for the alarm - it was a problem with my build environment.

Thank goodness.

If you wouldn't mind, @runette, would you please take a look at some changes I made on a separate branch?

I was working on implementing the user-defined marsheller I mentioned above, and in the process I found a solution that is simpler and less error prone, using pinned GCHandle to a managed string instead of using unmanaged memory. I can make it a separate PR, but I think this new approach is superior and should be added to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants