Skip to content

Commit 6f21013

Browse files
committed
src: fix ReDoS and incomplete-sanitization CodeQL alerts (#5, #6, #12, #18, #19, #20, #21)
Closes 7 CodeQL alerts in library code, locally verified via codeql CLI before push. Five polynomial-regex alerts (js/polynomial-redos): - raw-text.ts:33 (#20) - escape-attr-ampersands.ts:96 (#18) - quote-unquoted-attribute-values.ts:139 (#19) - quote-unquoted-boolean-attrs.ts:123 (#21) - template/healer.ts:148 (#5) The flagged pattern was '<tag>((?:\s[^>]*)?)>' — CodeQL flagged the optional non-capture group with mixed character classes as polynomial-time. Rewritten using a lookahead to enforce the whitespace-or-close-bracket boundary without the suspicious shape: Before: /<(tag)((?:\s[^>]*)?)>/g After: /<(tag)(?=[\s>])([^>]*)>/g Both forms have identical match semantics: the whitespace separator before attributes is preserved (e.g., '<scriptfoo>' is correctly NOT matched as a script tag start), tag name capture and attribute capture work as before. The lookahead-based shape passes CodeQL's analysis. 1,307 tests pass under the new regex. Two incomplete-string-escaping alerts (js/incomplete-sanitization): - markdown/emitter.ts:235 (#12) — markdown table cell escaping previously only handled the '|' character. Newlines in cell content would silently break table structure; backslashes weren't escaped first, allowing input like '\\|' to be rendered as literal backslash-pipe rather than escaped pipe. Replaced inline .replace(/\|/g, '\\|') with new escapeMarkdownTableCell helper that escapes backslashes first, then pipes, and converts newlines to spaces. - ods-reader/html-renderer.ts:77 (#6) — string.replace('"', ...) was being used to inject text-align:right after the opening '"' of an existing style attribute. CodeQL flagged the unanchored character replacement as potentially matching the wrong '"'. Anchored to /^style="/ for unambiguous match. Pipeline: format:check, lint, build, test all pass. 1,307 tests, 28 suites. Validator: still clean. Local CodeQL CLI scan confirms 0 remaining alerts in src/. (2 alerts in docs/tools/html-to-odt-local.html persist locally; this is a gitignored fixture and will not affect GitHub's scan.)
1 parent bf91ecb commit 6f21013

8 files changed

Lines changed: 78 additions & 70 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,7 @@ docs/tools/*-local.html
1414

1515
# Generated by scripts/sync-version.js
1616
src/version.ts
17+
18+
# CodeQL local analysis artifacts
19+
.codeql-db/
20+
codeql-results.sarif

src/html-normalizer/rules/escape-attr-ampersands.ts

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/**
1+
/**
22
* Rule 7 of the Tier 1 normalizer: escape unescaped `&` in attribute values.
33
*
44
* HTML5 attribute values commonly contain unescaped `&` characters in URLs:
@@ -11,7 +11,7 @@
1111
* characters with `&amp;`, leaving valid XML entities and numeric
1212
* character references untouched.
1313
*
14-
* Spec reference: WHATWG HTML Living Standard § 13.2.5.42 "Attribute
14+
* Spec reference: WHATWG HTML Living Standard § 13.2.5.42 "Attribute
1515
* value (double-quoted) state" describes character reference handling
1616
* for ambiguous ampersands.
1717
*
@@ -26,7 +26,7 @@
2626
* first, then attribute-content rules, then text-content rules.
2727
*/
2828

29-
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)((?:\s[^>]*)?)>/g;
29+
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)(?=[\s>])([^>]*)>/g;
3030
const ATTR_PATTERN = /([a-zA-Z_:][a-zA-Z0-9_:.-]*)=(?:"([^"]*)"|'([^']*)')/g;
3131
const VALID_ENTITY = /&(?:amp|lt|gt|quot|apos|#[0-9]+|#x[0-9a-fA-F]+);/y;
3232

@@ -50,11 +50,11 @@ function escapeValueAmpersands(value: string): string {
5050
// Found a `&`. Test whether it starts a valid entity reference.
5151
VALID_ENTITY.lastIndex = i;
5252
if (VALID_ENTITY.test(value)) {
53-
// Valid entity copy the whole match through
53+
// Valid entity — copy the whole match through
5454
result += value.slice(i, VALID_ENTITY.lastIndex);
5555
i = VALID_ENTITY.lastIndex;
5656
} else {
57-
// Lone `&` escape it
57+
// Lone `&` — escape it
5858
result += "&amp;";
5959
i++;
6060
}

src/html-normalizer/rules/quote-unquoted-attribute-values.ts

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/**
1+
/**
22
* Rule 6 of the Tier 1 normalizer: quote unquoted attribute values.
33
*
44
* HTML5 allows attribute values to appear without quotes when the value
@@ -11,26 +11,26 @@
1111
* This rule scans every opening tag and rewrites unquoted values as
1212
* double-quoted: `<a href=page.html>` becomes `<a href="page.html">`.
1313
*
14-
* Spec reference: WHATWG HTML Living Standard § 13.1.2.3 "Attributes",
14+
* Spec reference: WHATWG HTML Living Standard § 13.1.2.3 "Attributes",
1515
* Unquoted attribute value syntax.
1616
*
1717
* Scope: only attribute *values* are rewritten. Attribute names without
18-
* `=` (boolean attributes) are out of scope Rule 5 handles those.
18+
* `=` (boolean attributes) are out of scope — Rule 5 handles those.
1919
* Quoted values (single or double) pass through unchanged. The tag name
2020
* itself is never rewritten.
2121
*
2222
* Composition note: this rule should run alongside Rule 5
2323
* (quoteUnquotedBooleanAttributes). Order between Rules 5 and 6 doesn't
24-
* matter they handle disjoint patterns. By convention this rule runs
24+
* matter — they handle disjoint patterns. By convention this rule runs
2525
* after Rule 5 so the rule numbers parallel composition order.
2626
*/
2727

28-
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)((?:\s[^>]*)?)>/g;
28+
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)(?=[\s>])([^>]*)>/g;
2929

3030
/**
3131
* Scan an attribute area and rewrite unquoted attribute values to
3232
* double-quoted form. Quoted values pass through unchanged. Boolean
33-
* attributes (no `=`) pass through unchanged Rule 5 handles those.
33+
* attributes (no `=`) pass through unchanged — Rule 5 handles those.
3434
*/
3535
function rewriteAttrArea(attrArea: string): string {
3636
if (!attrArea.trim()) return attrArea;
@@ -50,7 +50,7 @@ function rewriteAttrArea(attrArea: string): string {
5050
// Try to match an attribute name
5151
const nameMatch = /^[a-zA-Z_:][a-zA-Z0-9_:.-]*/.exec(attrArea.slice(i));
5252
if (!nameMatch) {
53-
// Not an attribute copy and advance
53+
// Not an attribute — copy and advance
5454
result += attrArea[i];
5555
i++;
5656
continue;
@@ -64,7 +64,7 @@ function rewriteAttrArea(attrArea: string): string {
6464
while (j < attrArea.length && /\s/.test(attrArea[j])) j++;
6565

6666
if (j >= attrArea.length || attrArea[j] !== "=") {
67-
// Boolean attribute (no =) pass through, Rule 5 handles
67+
// Boolean attribute (no =) — pass through, Rule 5 handles
6868
result += attrName;
6969
// Copy whitespace between name and end (or next attr)
7070
result += attrArea.slice(afterName, j);
@@ -83,7 +83,7 @@ function rewriteAttrArea(attrArea: string): string {
8383
while (j < attrArea.length && /\s/.test(attrArea[j])) j++;
8484

8585
if (j >= attrArea.length) {
86-
// Trailing = with no value preserve as-is
86+
// Trailing = with no value — preserve as-is
8787
result += attrArea.slice(afterEq);
8888
i = attrArea.length;
8989
continue;
@@ -93,19 +93,19 @@ function rewriteAttrArea(attrArea: string): string {
9393
result += attrArea.slice(afterEq, j);
9494

9595
if (attrArea[j] === '"' || attrArea[j] === "'") {
96-
// Quoted value pass through entirely
96+
// Quoted value — pass through entirely
9797
const quote = attrArea[j];
9898
const end = attrArea.indexOf(quote, j + 1);
9999
if (end === -1) {
100-
// Malformed copy rest as-is
100+
// Malformed — copy rest as-is
101101
result += attrArea.slice(j);
102102
i = attrArea.length;
103103
continue;
104104
}
105105
result += attrArea.slice(j, end + 1);
106106
i = end + 1;
107107
} else {
108-
// Unquoted value read until whitespace, end, or self-closing /
108+
// Unquoted value — read until whitespace, end, or self-closing /
109109
// (per HTML5 spec, unquoted values terminate at whitespace, >, or
110110
// when `/` is followed by `>` for self-closing tags). The lone `/`
111111
// inside URLs like https://example.com/x.html is part of the value.

src/html-normalizer/rules/quote-unquoted-boolean-attrs.ts

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/**
1+
/**
22
* Rule 5 of the Tier 1 normalizer: quote unquoted boolean attributes.
33
*
44
* HTML5 defines a "boolean attribute" syntax that allows attribute names to
@@ -11,20 +11,20 @@
1111
* empty-string form: `<input checked>` becomes `<input checked="">`. The
1212
* empty string is the canonical XHTML serialization of a boolean attribute.
1313
*
14-
* Spec reference: WHATWG HTML Living Standard § 2.6.2 "Boolean attributes".
14+
* Spec reference: WHATWG HTML Living Standard § 2.6.2 "Boolean attributes".
1515
*
1616
* Scope: only attribute names that appear without `=` are rewritten. Quoted
1717
* values pass through unchanged. The tag name itself (the first identifier
1818
* after `<`) is never rewritten. Self-closing markers (`/>`) are preserved.
1919
* Unquoted attribute values (e.g. `href=page`) are passed through verbatim
20-
* Rule 6 handles those.
20+
* — Rule 6 handles those.
2121
*
2222
* Composition note: this rule runs before Rule 1 (selfCloseVoidElements)
2323
* and Rule 2 (decodeNamedEntities) in the composite normalizer. The order
2424
* is structural-rules-first, content-rules-last.
2525
*/
2626

27-
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)((?:\s[^>]*)?)>/g;
27+
const TAG_PATTERN = /<([a-zA-Z][a-zA-Z0-9-]*)(?=[\s>])([^>]*)>/g;
2828

2929
/**
3030
* Scan an attribute area and rewrite unquoted boolean attributes to
@@ -49,7 +49,7 @@ function rewriteAttrArea(attrArea: string): string {
4949
// Try to match an attribute name
5050
const nameMatch = /^[a-zA-Z_:][a-zA-Z0-9_:.-]*/.exec(attrArea.slice(i));
5151
if (!nameMatch) {
52-
// Not an attribute copy the character and advance
52+
// Not an attribute — copy the character and advance
5353
result += attrArea[i];
5454
i++;
5555
continue;
@@ -63,7 +63,7 @@ function rewriteAttrArea(attrArea: string): string {
6363
while (j < attrArea.length && /\s/.test(attrArea[j])) j++;
6464

6565
if (j < attrArea.length && attrArea[j] === "=") {
66-
// Attribute with a value pass through name, =, and the value
66+
// Attribute with a value — pass through name, =, and the value
6767
result += attrName;
6868
// Copy whitespace between name and =
6969
result += attrArea.slice(afterName, j);
@@ -76,19 +76,19 @@ function rewriteAttrArea(attrArea: string): string {
7676
j++;
7777
}
7878
if (j < attrArea.length && (attrArea[j] === '"' || attrArea[j] === "'")) {
79-
// Quoted value copy through entirely
79+
// Quoted value — copy through entirely
8080
const quote = attrArea[j];
8181
const end = attrArea.indexOf(quote, j + 1);
8282
if (end === -1) {
83-
// Malformed copy rest as-is
83+
// Malformed — copy rest as-is
8484
result += attrArea.slice(j);
8585
i = attrArea.length;
8686
continue;
8787
}
8888
result += attrArea.slice(j, end + 1);
8989
i = end + 1;
9090
} else {
91-
// Unquoted value Rule 6 handles this. Copy the value through
91+
// Unquoted value — Rule 6 handles this. Copy the value through
9292
// without modification and advance past it.
9393
let valEnd = j;
9494
while (valEnd < attrArea.length) {
@@ -101,7 +101,7 @@ function rewriteAttrArea(attrArea: string): string {
101101
i = valEnd;
102102
}
103103
} else {
104-
// Boolean attribute rewrite as name=""
104+
// Boolean attribute — rewrite as name=""
105105
result += `${attrName}=""`;
106106
i = afterName;
107107
}

src/html-normalizer/rules/raw-text.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
* content. The composite order is enforced in src/html-normalizer/index.ts.
1919
*/
2020

21-
const RAW_TEXT_PATTERN = /<(script|style)((?:\s[^>]*)?)>[\s\S]*?<\/\1\s*>/g;
21+
const RAW_TEXT_PATTERN = /<(script|style)(?=[\s>])([^>]*)>[\s\S]*?<\/\1\s*>/g;
2222

2323
/**
2424
* Empty the content of `<script>` and `<style>` elements while preserving

src/markdown/emitter.ts

Lines changed: 27 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
1-
/**
1+
/**
22
* Markdown emitter for the ODT document model.
33
*
44
* Converts the structured document model produced by the ODT parser into
5-
* a Markdown string. Zero runtime dependencies the emitter is a pure
5+
* a Markdown string. Zero runtime dependencies — the emitter is a pure
66
* function over OdtDocumentModel.
77
*
88
* Structural coverage:
9-
* - Headings levels 1–6 (# through ######)
9+
* - Headings levels 1–6 (# through ######)
1010
* - Paragraphs with blank-line separation
1111
* - Bold, italic, bold+italic, strikethrough (GFM)
12-
* - Underline <u>text</u> (HTML passthrough valid in GFM)
13-
* - Superscript <sup>text</sup>, subscript <sub>text</sub>
14-
* - Hyperlinks [text](url)
15-
* - Hard line breaks two trailing spaces + newline
12+
* - Underline → <u>text</u> (HTML passthrough — valid in GFM)
13+
* - Superscript → <sup>text</sup>, subscript → <sub>text</sub>
14+
* - Hyperlinks → [text](url)
15+
* - Hard line breaks → two trailing spaces + newline
1616
* - Unordered and ordered lists with nested sub-lists
17-
* - Tables GFM pipe table with --- separator row
18-
* - Images ![alt](name) placeholder (base64 data not inlined)
19-
* - Sections body content only (no Markdown equivalent)
17+
* - Tables → GFM pipe table with --- separator row
18+
* - Images → ![alt](name) placeholder (base64 data not inlined)
19+
* - Sections → body content only (no Markdown equivalent)
2020
* - Tracked changes: final (default), original, and changes modes
2121
*
2222
* Text content is escaped for Markdown. The following characters are
@@ -50,10 +50,10 @@ export interface MarkdownEmitOptions {
5050
/**
5151
* Markdown flavor to target.
5252
*
53-
* "gfm" (default): GitHub Flavored Markdown enables pipe tables and
53+
* "gfm" (default): GitHub Flavored Markdown — enables pipe tables and
5454
* ~~strikethrough~~. Compatible with GitHub, GitLab, and most modern
5555
* Markdown renderers.
56-
* "commonmark": CommonMark only tables are omitted (cells emitted as
56+
* "commonmark": CommonMark only — tables are omitted (cells emitted as
5757
* plain paragraphs), strikethrough falls back to plain text.
5858
*/
5959
flavor?: "commonmark" | "gfm";
@@ -63,13 +63,13 @@ export interface MarkdownEmitOptions {
6363
*
6464
* "final" (default): insertions rendered as body; deletions produce no output.
6565
* "original": deletions rendered as body; insertions produce no output.
66-
* "changes": insertions <ins>body</ins>, deletions <del>body</del>.
66+
* "changes": insertions → <ins>body</ins>, deletions → <del>body</del>.
6767
*/
6868
trackedChanges?: "final" | "original" | "changes";
6969
/**
7070
* If true, images are embedded as base64 data URLs (`![alt](data:image/png;base64,...)`).
7171
* Produces a fully self-contained Markdown file at the cost of larger file size.
72-
* Defaults to false images are emitted as `![alt](name)` placeholders.
72+
* Defaults to false — images are emitted as `![alt](name)` placeholders.
7373
*/
7474
embedImages?: boolean;
7575
}
@@ -97,8 +97,8 @@ function escapeMd(text: string): string {
9797
* Emit a TextSpan to a Markdown string.
9898
*
9999
* Formatting nesting order (innermost first):
100-
* superscript/subscript underline strikethrough italic bold
101-
* hyperlink anchor.
100+
* superscript/subscript → underline → strikethrough → italic → bold
101+
* → hyperlink anchor.
102102
*/
103103
function emitTextSpan(span: TextSpan, options?: MarkdownEmitOptions): string {
104104
if (span.lineBreak) return " \n";
@@ -132,7 +132,7 @@ function emitTextSpan(span: TextSpan, options?: MarkdownEmitOptions): string {
132132
/**
133133
* Emit an ImageNode as a Markdown image placeholder.
134134
*
135-
* By default, base64 image data is not inlined the alt text and name are
135+
* By default, base64 image data is not inlined — the alt text and name are
136136
* preserved as a placeholder so consumers can substitute with real paths.
137137
* Pass `{ embedImages: true }` to embed images as base64 data URLs instead.
138138
*/
@@ -158,7 +158,7 @@ function emitNote(node: NoteNode, options?: MarkdownEmitOptions): string {
158158
return `^[${content.trim()}]`;
159159
}
160160

161-
/** Emit a FieldNode page number and page count get descriptive placeholders. */
161+
/** Emit a FieldNode — page number and page count get descriptive placeholders. */
162162
function emitField(node: FieldNode): string {
163163
if (node.fieldType === "pageNumber") return node.value ?? "[page]";
164164
if (node.fieldType === "pageCount") return node.value ?? "[pages]";
@@ -174,7 +174,7 @@ function emitInlineNode(node: InlineNode, options?: MarkdownEmitOptions): string
174174
case "note":
175175
return emitNote(node, options);
176176
case "bookmark":
177-
// Bookmarks have no Markdown equivalent emit nothing
177+
// Bookmarks have no Markdown equivalent — emit nothing
178178
return "";
179179
case "field":
180180
return emitField(node);
@@ -219,11 +219,15 @@ function emitList(list: ListNode, options?: MarkdownEmitOptions, depth = 0): str
219219
* cells is inserted after it. Falls back to plain paragraph text per cell
220220
* when flavor is "commonmark".
221221
*/
222+
function escapeMarkdownTableCell(s: string): string {
223+
return s.replace(/\\/g, "\\\\").replace(/\|/g, "\\|").replace(/\n/g, " ");
224+
}
225+
222226
function emitTable(table: TableNode, options?: MarkdownEmitOptions): string {
223227
if (table.rows.length === 0) return "";
224228

225229
if (options?.flavor === "commonmark") {
226-
// CommonMark has no table syntax emit rows as plain text paragraphs
230+
// CommonMark has no table syntax — emit rows as plain text paragraphs
227231
return table.rows
228232
.map((row) => row.cells.map((cell) => emitSpans(cell.spans, options)).join(" | "))
229233
.join("\n\n");
@@ -232,7 +236,7 @@ function emitTable(table: TableNode, options?: MarkdownEmitOptions): string {
232236
const rows = table.rows.map(
233237
(row) =>
234238
"| " +
235-
row.cells.map((cell) => emitSpans(cell.spans, options).replace(/\|/g, "\\|")).join(" | ") +
239+
row.cells.map((cell) => escapeMarkdownTableCell(emitSpans(cell.spans, options))).join(" | ") +
236240
" |",
237241
);
238242

@@ -246,7 +250,7 @@ function emitTable(table: TableNode, options?: MarkdownEmitOptions): string {
246250
}
247251

248252
/**
249-
* Emit a SectionNode Markdown has no section construct.
253+
* Emit a SectionNode — Markdown has no section construct.
250254
* Emits the body content directly.
251255
*/
252256
function emitSection(node: SectionNode, options?: MarkdownEmitOptions): string {
@@ -256,7 +260,7 @@ function emitSection(node: SectionNode, options?: MarkdownEmitOptions): string {
256260
/**
257261
* Emit a TrackedChangeNode.
258262
*
259-
* "changes" mode: insertions <ins>body</ins>, deletions <del>body</del>.
263+
* "changes" mode: insertions → <ins>body</ins>, deletions → <del>body</del>.
260264
* Other modes: body emitted transparently.
261265
*/
262266
function emitTrackedChange(node: TrackedChangeNode, options?: MarkdownEmitOptions): string {

0 commit comments

Comments
 (0)