CSharp bindings: marshal all strings as UTF-8#14283
Conversation
Fix string marshalling with C# bindings so that all strings are marshalled to/from unmanaged UTF-8. - Replace SWIG's built-in `SWIGStringHelper` with a modified version named `Utf8StringHelper`. - Change C# and Java string constants from runtime to compile time. - Minor Refactor
|
@Mbucari There PR template has a checkbox about AI use. I feel you would need to have ticked it. Please avoid copying verbatim AI generate PR descriptions. They are 10 times bigger than needed... |
@rouault I guess I can only take that as a compliment, because I did not use an ounce of AI. I sat down last night and typed every character of my PR description into the GitHub editor. I proofread it, edited it, and formatted it myself. I am literate. EDIT* Every character except for the template parts which I kept. And I suppose if you want to be pedantic, I also copied and pasted class names, file names, etc. from my PR changes and from Microsoft's website. EDIT2* As to being bigger than needed, this PR changes how every string is marshal to and from gdal across the entire API surface. I feel like that's significant enough to warrant a slightly longer description. |
ok sorry. Looks like my AI detection radar needs to be improved :-) |
|
the fedora:rawhide failure is unrelated and (hopefully) addressed in latest master |
Use StringToUtf8Bytes instead of StringToUtf8Unmanaged to marshal string from managed to unmanaged. The unmanaged string was not being freed, which would lead to memory leaks. Used a managed byte[] instead so GC will free it.
|
7b65807 fixed a memory leak. |
|
@rouault can you please re-run the failing failing test, Linux Builds / Ubuntu 26.04, clang ASAN? This PR works and is ready for full review. |
I believe just restarting it will not be enough. You have to git rebase on top of latest master and force push |
|
@runette @szekerest review of this PR appreciated when you've a chance |
|
@rouault I will have a look today |
Update typemaps for `out string` and `ref string` to work with changes made to string marshalling in 2049d72.
|
@rouault I had to fix typemaps for String cases covered:
The only string marshaling case not handled is output back to |
|
While doping the above fix, I learned a way to improve performance for all strings on input. I'm done now. Sorry for the late modifications to this PR. |
Encode input strings directly to UTF8 unmanaged memory instead of managed byte array. Apply string typemap to `void* callback_data`
- Move all string typedefs into csharp_strings.i - Add documentation
|
I reorganized the C# typedefs by moving all |
|
@rouault can you please rerun those failing tests? The 'Memory leak detected' error on Ubuntu 26.04, clang ASAN seems concerning, but none of my changes should have caused that, and the test succeeded in my repo's workflow runs. |
done. it often fails due to a timeout in the GMLAS tests . Somone should fix that... |
|
@Mbucari It is quite difficult to review if you keep adding commits. Can you let me know when you think that it is ready for review. |
There are a number of other issues related to UTF8. Does this PR also fix these? #1245 If yes - do we need to / can we add specific test to the sample code to show that? |
|
@runette I'm done making additions to this PR, unless I find another bug (I'm currently using this build in a project, so I may come across additional bugs). I've explored all the issues you linked, and I've outlined my findings below. Tl;dr: This PR is good as-is, and I added tests. #1245 - I can't reproduce this error with the MaxRev-Win64 3.12.2 package, so I assume the problem has been fixed in the intervening years. I added a test for this case. #1433 - Using the shapefile that OP posted, I cannot reproduce the error with the MaxRev-Win64 3.12.2 package. I get the expected field names: Note that if you are creating these field values from code, the layer may need to be created with the #3107 - This bug has 2 parts, both of which have been fixed. The first part was that OP's nuget package was using proj6 which didn't support unicode file path. The second part was that #3633 - This will be fixed by this PR, and I've added a test. #10883 - This will be fixed by this PR, but the layer on which the field will be created must have been created with the |
|
@Mbucari Thank you for taking the time to do that. And thank you for the PR and your detailed analysis. |
|
@runette Sorry for the extra commit. I noticed that my workflow tests failed because the test apps were being compiled with C# language version 7.0, and I'd used features from a later version. Do you know of a CMAKE flag that I can use to manually specify the C# language version for my build environment? |
|
Interestingly, my last commit failed on the line But that's wrong. Pattern matching was introduced in C# 7.0, so that appears to be a Mono bug. |
The CMAKE build process "grandfathered" in legacy support for compiling for Mono using mcs which has "partial" support for C# 7.0 and it seems that pattern matching is one of the gaps. There is no CMAKE flag for C# version For dotnet builds there are flags for the target framework for the libraries and apps (separately specified). For no very good reason these ended up saying CSHARP_xx_VERSION which is not ideal.
The sample applcations are only built for testing purposes, so you would think the target framework should be flexible. The problem is the Mono builds that effectively limit the C# version to 7.0 -ish! It has not really been much of a problem until now - because as you have probably seen the sample applications have not changed very much in recent years and when they do it is a manual check. However, I did have exaclty the same problem a few weeks ago on another PR (also with What we probably should have done is include the language vesion in the csproj files. However, that should be another PR. It is also probably a question if mcs support should be retained. If you are building on Windows - I think you can specify the |
runette
left a comment
There was a problem hiding this comment.
Given the depths of these changes - it would be good to get @szekerest to comment given your greater knowledge on the structure of the bindings
|
@Mbucari Thanks I think we have to wait for @rouault on the Java question and @szekerest for any comments. Will revist tomorrow |
You're very welcome. It's been my pleasure, and I'm looking forward to GDAL 3.13.
|
sorry - missed that |
|
I'll probably cut a GDAL 3.13.0 beta1 tomorrow or thursday and would like it to include this PR in it. If there are adjustments needed, we'll have ~ 2 weeks before final release |
|
@runette I was thinking it might be a good idea to expose the string marshaller to users and allow them to use their own custom marshaller. Something like the below interface. It could be a last resort for users to manage exactly how strings are read, serving a purpose similar to #3825. Beyond the scope of this PR, but what do you think? /// <summary>
/// Interface for marshalling strings to and from the GDAL library
/// </summary>
public interface IGdalStringMarshaler
{
/// <summary>
/// Copies a native GDAL string to an unmanaged memory location and returns a pointer to it.
/// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
/// </summary>
/// <param name="nativeString">A pointer to a native GDAL char* string.</param>
public IntPtr CopyNativeToUnmanaged(IntPtr nativeString);
/// <summary>
/// Convert a managed string to an unmanaged string and returns a pointer to it.
/// The caller is responsible for freeing the unmanaged memory using <FreeUnmanagedString>.
/// </summary>
/// <param name="managedString">
public IntPtr StringToUnmanaged(string managedString);
/// <summary>
/// Converts an unmanaged string (allocated by CopyNativeToUnmanaged or StringToUnmanaged) back to a managed string.
/// </summary>
/// <param name="unmanagedString">A pointer to an unmanaged string created by CopyNativeToUnmanaged or StringToUnmanaged</param>
public string UnmanagedToString(IntPtr unmanagedString);
/// <summary>
/// Releases the memory allocated for an unmanaged string pointer.
/// Must be called exactly once for each pointer returned by <CopyNativeToUnmanaged>
/// or <StringToUnmanaged> to prevent memory leaks or double frees.
/// </summary>
public void FreeUnmanagedString(IntPtr unmanagedString);
} |
I am happy with the PR. The code looks very good to me and I have tested it everyway I can think of. It defnitely buidls on netstandard2.0 (this time) so my vote would be yes |
I think it is an excellent idea - but maybe another PR |
Hold on that - I may have found a problem |
Sorry for the alarm - it was a problem with my build environment. |
Thank goodness. If you wouldn't mind, @runette, would you please take a look at some changes I made on a separate branch? I was working on implementing the user-defined marsheller I mentioned above, and in the process I found a solution that is simpler and less error prone, using pinned |
The Problem
Java's runtime uses native UTF-16 strings and the JNI uses UTF-8 by default, so strings with wide characters are handled properly. Likewise, python converts to/from UTF-8 (or binary) without issue. C#, however, converts all strings to ANSI.
The problem is in SWIG library's SWIGStringHelper. It registers a managed callback with the wrapper. This callback accepts a string argument and returns a string. The wrapper passes the native string to this callback function. The callback creates a managed string and then returns the managed string. Its purpose is to allow the .NET runtime to create a copy of the unmanaged string wile giving the unmanaged caller a chance to free the unmanaged string before returning the managed string.
The problem with this method is that the string goes through the marshaller several times before it's returned to the managed caller. Each pass through the marshaller is a chance to down-convert the string to ANSI if the proper attributes are not applied. More importantly, each pass through the marshaller creates another copy of the string. With the default SWIG implementation, output strings are copied three times before they're returned to the managed caller.
[MarshalAs]attribute, it defaults to ANSI.What does this PR do?
Fix string marshalling with C# bindings so that all strings are marshalled to/from unmanaged UTF-8.
Replace SWIG's built-in
SWIGStringHelperwith a modified version namedUtf8StringHelperSWIG_CSHARP_NO_STRING_HELPERconstant.csharp_strings.i, to handle all string conversions in C# bindings.Utf8StringHelpertocsharp_strings.i. It uses the same callback pattern as the built-inSWIGStringHelperand defines theSWIG_csharp_string_callbackfunction (which is used by Gdal and by other SWIG routines), but it correctly handles UTF-8 strings and only makes two copies. When the callback is called from native code, the managed callback creates a length-prefixed UTF-16 string in unmanaged memory (first copy), and then returns a pointer to that string. The native code then returns that pointer to the managed code, and thecsoutandcsvarouttypemaps convert that length-prefixed UTF-16 string to a managed .NET string (second copy) and release the unmanaged memory.Change C# and Java string constants from runtime to compile time.
Use the
%javaconst(flag)and%csconst(flag)directives to mark all string constants as compile time. This was necessary because SWIG was generating string constants for each#define, but it was using theimtypeto assign them, which is nowIntPtr. There may be another way to apply a different typemap to these constants, but making them compile time solves the problem and is more performant anyway.Minor Refactor
StringToUtf8Bytes.StringToUtf8Unmanagedis the only UTF-8 encoder now.StringToUtf8Unmanagedfrom$moduleclass to$imclassnameincsharp_strings.i.utf8_stringdefinitions intocsharp_strings.i.Tasklist
Environment
Mono and dotnet