Start making the CST usable in Lute#71
Merged
aatxe merged 15 commits intoluau-lang:primaryfrom Apr 21, 2025
Merged
Conversation
create an identity codemod that should in theory return the same as the input (showing roundtrippability of parsing + printing) enough to get "b.luau" parsing and printing (currently without whitespace trivia)
Introduces the necessary steps to extract and store trivia on tokens The serialization step now also stores the underlying source and a cursor through the source. When we serialize a token, we take everything in between the cursor position and the token start position and extract that as trivia. To speed up offset computation, we compute the line offsets for the file beforehand. We convert AstExprConstantString into a style of token. Right now, this serializes trivia into one big chunk. And also, it does not split trivia into leading and trailing trivia. We will probably want to split trivia into whitespace + comments. We should follow roslyn syntax rules where everything up to and including a newline character is treated as trailing trivia for the previous token, and everything after that is leading trivia for the next token. This makes code modding nicer, as we can treat nodes as discrete chunks.
We need to be careful with the order of serialization, as otherwise it can mess up the cursor and trivia tracking. This commit reorganises some of the serialization so it is in line TODO: The serialization of commas is wrong, since we do it too late. This is going to cause an assertion failure once we hit an example that is comma separated
This better handles serialization of separators and related trivia
We can also reuse a visitor to print a node, by just printing all tokens
Split trivia into smaller tokens. Also, use string_views everywhere rather than copies of strings where possible. Still TODO: - Split whitespace into smaller parts (whitespace should be separated by '\n' characters - a block of whitespace is up to and including the first '\n' character) - Split between leading trivia and trailing trivia
Rather than whitespace being a single chunk of characters, we split it into smaller parts delimited by a '\n' newline character. The purpose for this is to allow us to later separate trivia into leading and trailing trivia that follows the Rosyln Trivia rules
we are creating a wrapper around transforming syntax, but not necessarily a code modder
aatxe
approved these changes
Apr 21, 2025
| luaL_error(L, "encountered illegal operator: Op__Count"); | ||
| } | ||
|
|
||
| lua_pushstring(L, Luau::toString(op).data()); |
green-real
pushed a commit
to green-real/lute
that referenced
this pull request
May 19, 2025
Now that CST nodes exist in Luau, this PR extends the existing `@lute/luau` C++ module to return full-fidelity CST nodes to Luau, rather than just AST nodes. This is aimed to support future usages of the CST for tooling such as code modding, linting or formatting implemented fully in Luau. We extend the serialization of some AST nodes with extra tokens, and create tests to check for validity of tokenization and the roundtrippability of the serialization state in Luau (i.e., `print(parse(source) == source` where `print` is fully defined in Luau). Future AST nodes will be added in follow up commits. Alongside the CST data, a key concept that this PR introduces is the idea of trivia. Right now, the Luau CST does not preserve trivia itself, but retains enough information to compute spans of trivia, allowing later lookup in the original source code. To extract trivia, we keep the original input buffer around, and then take the span between the last token and the next token, and call this "trivia". Trivia is then individually tokenized and attached to "Token" nodes. There are some key rules for trivia: - [x] Trivia consists of whitespace, single line comments and multi line comments. Each kind is a separate trivia token - [x] Whitespace trivia is split up into separate tokens based on the '\n' character. Other characters of whitespace are joined together up to and including the first newline character. i.e., `<space><space>\n\t\t\n\n` is split into `[<space><space>\n, \t\t\n, \n]` - [ ] Trivia on a token is split up into leading and trailing trivia. Leading / trailing split is based on the Roslyn Compiler trivia rules: trailing trivia consists of everything up to and including the first `\n` newline character, and all further trivia is leading trivia of the next token. The idea being that `local x = ...\n` can then be represented as a single line in trivia, to allow easier replacement (rather than the `\n` being part of the next statement). The last rule is not yet implemented but will be introduced in a follow-up commit. The API still remains experimental and is subject to change. In particular, I'm not fully sold on the naming scheme for nodes and fields (particularly the fields that match keyword names) and it may require some standardisation. Also, we are somewhat mixing Ast nodes with Cst nodes, and we may want to separate the two.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Now that CST nodes exist in Luau, this PR extends the existing
@lute/luauC++ module to return full-fidelity CST nodes to Luau, rather than just AST nodes. This is aimed to support future usages of the CST for tooling such as code modding, linting or formatting implemented fully in Luau.We extend the serialization of some AST nodes with extra tokens, and create tests to check for validity of tokenization and the roundtrippability of the serialization state in Luau (i.e.,
print(parse(source) == sourcewhereprintis fully defined in Luau). Future AST nodes will be added in follow up commits.Alongside the CST data, a key concept that this PR introduces is the idea of trivia. Right now, the Luau CST does not preserve trivia itself, but retains enough information to compute spans of trivia, allowing later lookup in the original source code. To extract trivia, we keep the original input buffer around, and then take the span between the last token and the next token, and call this "trivia". Trivia is then individually tokenized and attached to "Token" nodes.
There are some key rules for trivia:
<space><space>\n\t\t\n\nis split into[<space><space>\n, \t\t\n, \n]\nnewline character, and all further trivia is leading trivia of the next token. The idea being thatlocal x = ...\ncan then be represented as a single line in trivia, to allow easier replacement (rather than the\nbeing part of the next statement).The last rule is not yet implemented but will be introduced in a follow-up commit.
The API still remains experimental and is subject to change. In particular, I'm not fully sold on the naming scheme for nodes and fields (particularly the fields that match keyword names) and it may require some standardisation. Also, we are somewhat mixing Ast nodes with Cst nodes, and we may want to separate the two.