Skip to content

Preserve synthetic paragraph markers in CommonMark output#788

Closed
sandmor wants to merge 1 commit intokivikakk:mainfrom
sandmor:fix-fences
Closed

Preserve synthetic paragraph markers in CommonMark output#788
sandmor wants to merge 1 commit intokivikakk:mainfrom
sandmor:fix-fences

Conversation

@sandmor
Copy link
Copy Markdown

@sandmor sandmor commented Apr 1, 2026

Summary

Preserve block-like text as plain text when formatting synthetic Paragraph nodes to CommonMark.

Before this change, a synthetic paragraph containing text like ~~~, ``````, :::, `: details`, or `~ details` could be emitted as real Markdown block markers, so reparsing the formatter output changed the document structure.

Why

The formatter should preserve AST semantics.

If Markdown is parsed from source text, fences and other block markers should still be recognized normally. But if the formatter is given a synthetic Paragraph node with that same text, it should render it as plain text, not as a new Markdown block.

Example

Before:

synthetic Paragraph("~~~")
-> ~~~

Reparsing that output turns the paragraph into a fenced block.

After:

synthetic Paragraph("~~~")
-> \~~~

The text stays a paragraph after round-trip.

What Changed

  • escape fence, block-directive, and description-list markers when they appear as synthetic paragraph text at valid block-start positions
  • keep ordered-list handling separate so plain text like 1~~~ is not over-escaped
  • add round-trip tests for fences, directives, description-list markers, and numeric-prefix non-regressions

Testing

  • cargo fmt --check
  • cargo test commonmark_ -- --nocapture

@kivikakk
Copy link
Copy Markdown
Owner

kivikakk commented Apr 1, 2026

Did not read the contributing guidelines.

@kivikakk kivikakk closed this Apr 1, 2026
@sandmor
Copy link
Copy Markdown
Author

sandmor commented Apr 2, 2026

But I read them?

What is the issue? Note that for contribution guidelines I only found the contributing section in the README, as pointed by the contributing file

@kivikakk
Copy link
Copy Markdown
Owner

kivikakk commented Apr 2, 2026

The PR body is LLM-generated.

@sandmor
Copy link
Copy Markdown
Author

sandmor commented Apr 2, 2026

The tests are indeed, as I usually don't find very gratifying writing text to lock-in behavior.

The fix itself though is a heavily modified version of an initial LLM-draft, where in that draft the LLM was trying to use the same logic than the ordered list case—where it would check something like if there is a fence using begin_content.

The fix itself is quite simple, just adding escape in case of fence at the start of line, with three leading spaces check as markdown usually interprets anything beyond that as it not being a block starter.

If even tests or using a LLM to sound off problems is off, I can send a smaller PR containing only minimal changes if preferred. If you prefer to not count the block starter leading spaces check that can be change too?

@kivikakk
Copy link
Copy Markdown
Owner

kivikakk commented Apr 2, 2026

Thanks for your response. To be super clear, the contributing part of the README has this subsection:

Policy on LLM contributions

This policy is based on and inspired by Servo's AI contributions policy (permalink) and QEMU's Use of AI-generated content policy. Please see those links for a more detailed analysis.

Contributions must not include content generated by large language models or other probabilistic tools, including but not limited to Copilot or ChatGPT. This policy covers code, documentation, pull requests, issues, comments, and any other contributions to Comrak.

I certainly don't mind you using LLMs to sound off problems; it's just that all the actual code, tests and documentation submitted to Comrak itself needs to be human-authored. I ask that (per above) the pull requests/issues/comments themselves be so too1. This is why I was quite terse in my above comments.

It's worth noting this is a bug in CommonMark roundtripping even with non-synthetic paragraphs:

$ printf '\~~~' | comrak
<p>~~~</p>
$ printf '\~~~' | comrak -t commonmark
~~~
$ printf '\~~~' | comrak -t commonmark | comrak
<pre style="background-color:#2b303b;"><code></code></pre>
$

It's not a problem with \``` because we escape all backticks by default:

$ printf '\```' | comrak
<p>```</p>
$ printf '\```' | comrak -t commonmark
\`\`\`
$ printf '\```' | comrak -t commonmark | comrak
<p>```</p>
$

(This approach and its limitations are inherited from cmark, though there's nothing binding us to them.)

A simpler solution might be simply to escape ~ and : when outputting CommonMark. (--experimental-minimize-commonmark does help somewhat with reducing the number again without affecting the output, but is very expensive.)

At any rate, I am happy to accept a completely human-authored PR that improves upon this situation to any degree 🤍

Footnotes

  1. Note the broken formatting in your PR summary, as LLMs do not tend to compliant CommonMark by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants