Start making the CST usable in Lute by JohnnyMorganz · Pull Request #71 · luau-lang/lute

JohnnyMorganz · 2025-02-18T20:18:03Z

Now that CST nodes exist in Luau, this PR extends the existing @lute/luau C++ module to return full-fidelity CST nodes to Luau, rather than just AST nodes. This is aimed to support future usages of the CST for tooling such as code modding, linting or formatting implemented fully in Luau.

We extend the serialization of some AST nodes with extra tokens, and create tests to check for validity of tokenization and the roundtrippability of the serialization state in Luau (i.e., print(parse(source) == source where print is fully defined in Luau). Future AST nodes will be added in follow up commits.

Alongside the CST data, a key concept that this PR introduces is the idea of trivia. Right now, the Luau CST does not preserve trivia itself, but retains enough information to compute spans of trivia, allowing later lookup in the original source code. To extract trivia, we keep the original input buffer around, and then take the span between the last token and the next token, and call this "trivia". Trivia is then individually tokenized and attached to "Token" nodes.

There are some key rules for trivia:

Trivia consists of whitespace, single line comments and multi line comments. Each kind is a separate trivia token
Whitespace trivia is split up into separate tokens based on the '\n' character. Other characters of whitespace are joined together up to and including the first newline character. i.e., <space><space>\n\t\t\n\n is split into [<space><space>\n, \t\t\n, \n]
Trivia on a token is split up into leading and trailing trivia. Leading / trailing split is based on the Roslyn Compiler trivia rules: trailing trivia consists of everything up to and including the first \n newline character, and all further trivia is leading trivia of the next token. The idea being that local x = ...\n can then be represented as a single line in trivia, to allow easier replacement (rather than the \n being part of the next statement).

The last rule is not yet implemented but will be introduced in a follow-up commit.

The API still remains experimental and is subject to change. In particular, I'm not fully sold on the naming scheme for nodes and fields (particularly the fields that match keyword names) and it may require some standardisation. Also, we are somewhat mixing Ast nodes with Cst nodes, and we may want to separate the two.

create an identity codemod that should in theory return the same as the input (showing roundtrippability of parsing + printing) enough to get "b.luau" parsing and printing (currently without whitespace trivia)

Introduces the necessary steps to extract and store trivia on tokens The serialization step now also stores the underlying source and a cursor through the source. When we serialize a token, we take everything in between the cursor position and the token start position and extract that as trivia. To speed up offset computation, we compute the line offsets for the file beforehand. We convert AstExprConstantString into a style of token. Right now, this serializes trivia into one big chunk. And also, it does not split trivia into leading and trailing trivia. We will probably want to split trivia into whitespace + comments. We should follow roslyn syntax rules where everything up to and including a newline character is treated as trailing trivia for the previous token, and everything after that is leading trivia for the next token. This makes code modding nicer, as we can treat nodes as discrete chunks.

We need to be careful with the order of serialization, as otherwise it can mess up the cursor and trivia tracking. This commit reorganises some of the serialization so it is in line TODO: The serialization of commas is wrong, since we do it too late. This is going to cause an assertion failure once we hit an example that is comma separated

This better handles serialization of separators and related trivia

We can also reuse a visitor to print a node, by just printing all tokens

Split trivia into smaller tokens. Also, use string_views everywhere rather than copies of strings where possible. Still TODO: - Split whitespace into smaller parts (whitespace should be separated by '\n' characters - a block of whitespace is up to and including the first '\n' character) - Split between leading trivia and trailing trivia

Rather than whitespace being a single chunk of characters, we split it into smaller parts delimited by a '\n' newline character. The purpose for this is to allow us to later separate trivia into leading and trailing trivia that follows the Rosyln Trivia rules

we are creating a wrapper around transforming syntax, but not necessarily a code modder

aatxe · 2025-04-21T22:52:06Z

luau/src/luau.cpp

            luaL_error(L, "encountered illegal operator: Op__Count");
-        }
+
+        lua_pushstring(L, Luau::toString(op).data());


Now that CST nodes exist in Luau, this PR extends the existing `@lute/luau` C++ module to return full-fidelity CST nodes to Luau, rather than just AST nodes. This is aimed to support future usages of the CST for tooling such as code modding, linting or formatting implemented fully in Luau. We extend the serialization of some AST nodes with extra tokens, and create tests to check for validity of tokenization and the roundtrippability of the serialization state in Luau (i.e., `print(parse(source) == source` where `print` is fully defined in Luau). Future AST nodes will be added in follow up commits. Alongside the CST data, a key concept that this PR introduces is the idea of trivia. Right now, the Luau CST does not preserve trivia itself, but retains enough information to compute spans of trivia, allowing later lookup in the original source code. To extract trivia, we keep the original input buffer around, and then take the span between the last token and the next token, and call this "trivia". Trivia is then individually tokenized and attached to "Token" nodes. There are some key rules for trivia: - [x] Trivia consists of whitespace, single line comments and multi line comments. Each kind is a separate trivia token - [x] Whitespace trivia is split up into separate tokens based on the '\n' character. Other characters of whitespace are joined together up to and including the first newline character. i.e., `<space><space>\n\t\t\n\n` is split into `[<space><space>\n, \t\t\n, \n]` - [ ] Trivia on a token is split up into leading and trailing trivia. Leading / trailing split is based on the Roslyn Compiler trivia rules: trailing trivia consists of everything up to and including the first `\n` newline character, and all further trivia is leading trivia of the next token. The idea being that `local x = ...\n` can then be represented as a single line in trivia, to allow easier replacement (rather than the `\n` being part of the next statement). The last rule is not yet implemented but will be introduced in a follow-up commit. The API still remains experimental and is subject to change. In particular, I'm not fully sold on the naming scheme for nodes and fields (particularly the fields that match keyword names) and it may require some standardisation. Also, we are somewhat mixing Ast nodes with Cst nodes, and we may want to separate the two.

JohnnyMorganz force-pushed the cst branch from 5d274e3 to b86761e Compare February 25, 2025 17:34

JohnnyMorganz added 9 commits March 1, 2025 16:19

initial example of code mod

243b667

create an identity codemod that should in theory return the same as the input (showing roundtrippability of parsing + printing) enough to get "b.luau" parsing and printing (currently without whitespace trivia)

Split out library code into separate files

cefec9a

Serialize punctuated sequences into separate structure

55af520

This better handles serialization of separators and related trivia

Handle AstExprLocal and AstExprIndexName

dda45eb

Handle AstExprBinary

27ac6bf

Move to visitor mechanism

2b7e4ca

We can also reuse a visitor to print a node, by just printing all tokens

move from std to batteries

5955ff0

JohnnyMorganz force-pushed the cst branch from b86761e to 5100827 Compare March 1, 2025 15:21

JohnnyMorganz force-pushed the cst branch from 5100827 to f290be1 Compare March 1, 2025 16:25

Simplify op to string

c33bb44

JohnnyMorganz changed the title ~~Code modding~~ Make the CST usable in Lute Mar 4, 2025

JohnnyMorganz added 4 commits March 4, 2025 21:15

Merge branch 'primary' of https://github.com/aatxe/lute into cst

c0c28f0

codemod -> syntax

af20801

we are creating a wrapper around transforming syntax, but not necessarily a code modder

Add tests for locations

f52c277

JohnnyMorganz changed the title ~~Make the CST usable in Lute~~ Start making the CST usable in Lute Mar 4, 2025

JohnnyMorganz marked this pull request as ready for review March 4, 2025 21:02

JohnnyMorganz mentioned this pull request Mar 8, 2025

Attach trailing trivia to tokens #102

Merged

aatxe approved these changes Apr 21, 2025

View reviewed changes

aatxe merged commit bc7d3df into luau-lang:primary Apr 21, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start making the CST usable in Lute#71

Start making the CST usable in Lute#71
aatxe merged 15 commits intoluau-lang:primaryfrom
JohnnyMorganz:cst

JohnnyMorganz commented Feb 18, 2025 •

edited

Loading

Uh oh!

aatxe Apr 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohnnyMorganz commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aatxe Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JohnnyMorganz commented Feb 18, 2025 •

edited

Loading