Fix handling of version files that contain CRLFs#1789
Merged
Conversation
4278986 to
868bbed
Compare
From looking at metrics, I noticed that the `cat --show-nonprinting` approach added in #1783 to aid the visualisation of invisible characters (such as ASCII control characters) also escapes carriage return characters (to `^M`), which we don't want. Instead, `sed` is now used which replaces anything not matching the `:print:` and `:space:` groups with the Unicode substitution character (`�`) - which means the classic buildpack now also matches the CNB's behaviour: https://github.com/heroku/buildpacks-python/blob/f2fdd00edf0f63b298cf88c82377885c88666440/src/python_version_file.rs#L16-L19 For a mapping of what `:print:` and `:space:` cover, see the chart at the bottom of this page: https://en.cppreference.com/w/cpp/string/byte/isprint GUS-W-18225347.
868bbed to
11a3d91
Compare
runesoerensen
approved these changes
May 6, 2025
Merged
edmorley
added a commit
that referenced
this pull request
May 8, 2025
Since otherwise any field that contains a carriage return character could span multiple lines in the metrics data store, which then wouldn't be read back in its entirety by `bin/report`, which would lead to eg a truncated fields in Honeycomb. Such characters are much less likely now after #1789, however, they can still be present in the user-provided input (that ends up in fields like `failure_detail`) in some cases, plus from a general correctness point of view, the key-value store should be escaping all forms of newline characters. I've also changed the escaping strategy to use literal `\n` and `\r` characters so it's possible to distinguish between multi-line and single line (but space delimited) values more easily. GUS-W-18471014.
edmorley
added a commit
that referenced
this pull request
May 8, 2025
Since otherwise any field that contains a carriage return character could span multiple lines in the internal metrics data store, which then wouldn't be read back in its entirety by `bin/report`, which would lead to a truncated field in Honeycomb. Such characters are much less likely now after #1789, however, they can still be present in the user-provided input in some cases (that ends up in metrics fields like `failure_detail`), plus from a general correctness point of view, the key-value store's attribute saving functions should be escaping all forms of newline characters. I've also changed the escaping strategy to use literal `\n` and `\r` characters so it's possible to distinguish between multi-line and single line (but space delimited) values more easily in Honeycomb. GUS-W-18471014.
edmorley
added a commit
that referenced
this pull request
May 8, 2025
Since otherwise any field that contains a carriage return character could span multiple lines in the internal metrics data store, which then wouldn't be read back in its entirety by `bin/report`, which would lead to a truncated field in Honeycomb. Such characters are much less likely now after #1789, however, they can still be present in the user-provided input in some cases (that ends up in metrics fields like `failure_detail`), plus from a general correctness point of view, the key-value store's attribute saving functions should be escaping all forms of newline characters. I've also changed the escaping strategy to use literal `\n` and `\r` characters so it's possible to distinguish between multi-line and single line (but space delimited) values more easily in Honeycomb. GUS-W-18471014.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
From looking at metrics, I noticed that the
cat --show-nonprintingapproach added in #1783 to aid the visualisation of invisible characters (such as ASCII control characters) also escapes carriage return characters (to^M), which we don't want.Instead,
sedis now used which replaces anything not matching the:print:and:space:groups with the Unicode substitution character (�) - which means the classic buildpack now also matches the CNB's behaviour:https://github.com/heroku/buildpacks-python/blob/f2fdd00edf0f63b298cf88c82377885c88666440/src/python_version_file.rs#L16-L19
For a mapping of what
:print:and:space:cover, see the chart at the bottom of this page:https://en.cppreference.com/w/cpp/string/byte/isprint
GUS-W-18225347.