Fix bug where parser misidentifies variables as keywords if the keyword is a prefix of the variable name#1639
Conversation
Approach: parse all tokens without looking for keywords. In a second pass look for exact matches in the keywords map
There was a problem hiding this comment.
Pull request overview
This PR fixes a lexer bug where identifiers were incorrectly tokenized as keywords when a keyword was a prefix of the identifier (e.g., as_X). It does this by lexing keywords as identifiers first, then reclassifying only exact keyword matches in a post-pass.
Changes:
- Remove explicit keyword lexing rules and introduce a keyword reclassification pass for
IDENTIFIERtokens based on exact text match. - Add new ZKC unit test sources covering keywords at the beginning/end/middle of identifiers.
- Register the new unit tests in the Go test suite.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
pkg/zkc/compiler/parser/lexer.go |
Switches keyword recognition to a post-lexing reclassification step to avoid prefix-matching issues. |
testdata/zkc/unit/basic_17.zkc |
Adds coverage for identifiers starting with keywords. |
testdata/zkc/unit/basic_18.zkc |
Adds coverage for identifiers ending with keywords. |
testdata/zkc/unit/basic_19.zkc |
Adds coverage for keywords appearing in the middle of identifiers. |
pkg/test/zkc_unit_test.go |
Adds Go test entries to run the new unit fixtures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Reclassify identifiers whose full text is an exact keyword match. | ||
| contents := srcfile.Contents() | ||
|
|
||
| for i, tok := range tokens { | ||
| if tok.Kind == IDENTIFIER { | ||
| text := string(contents[tok.Span.Start():tok.Span.End()]) | ||
| if kind, ok := keywords[text]; ok { | ||
| tokens[i].Kind = kind | ||
| } |
There was a problem hiding this comment.
Reclassifying identifiers currently converts every IDENTIFIER token’s rune slice into a string (string(contents[start:end])). This allocates and UTF-8 encodes even for non-keywords, which can be a noticeable lexer performance regression on large sources. Consider avoiding the conversion unless the span length could match a keyword (e.g., early-continue when span length > max keyword length), or compare against keywords without allocating (e.g., rune-slice comparisons or a small switch).
…ed-in-the-name-of-a-variable
Initial attempt (adding a
whitespaceto the keyword pattern) was flawed: it would have misidentified e.g.if(condition == 1). Claude made it robust. The approach: parse all tokens without looking for keywords. In a second pass look for exact matches in the keywords mapNote
Medium Risk
Adjusts core lexer behavior for all ZKC source files; while the change is targeted, any keyword/identifier edge-case could affect parsing across the language.
Overview
Fixes a lexer bug where identifiers could be mis-tokenized as keywords when a keyword appears as a prefix of the identifier.
Keyword matching is removed from the primary lexing rules and replaced with a post-pass in
Lex()that reclassifiesIDENTIFIERtokens to keyword kinds only on exact text match via akeywordsmap.Adds new unit tests (
basic_17–basic_19) covering keywords at the beginning/end/middle of variable names to prevent regressions.Reviewed by Cursor Bugbot for commit 4577fbe. Bugbot is set up for automated code reviews on this repo. Configure here.