Skip to content

Start making the CST usable in Lute#71

Merged
aatxe merged 15 commits intoluau-lang:primaryfrom
JohnnyMorganz:cst
Apr 21, 2025
Merged

Start making the CST usable in Lute#71
aatxe merged 15 commits intoluau-lang:primaryfrom
JohnnyMorganz:cst

Conversation

@JohnnyMorganz
Copy link
Copy Markdown
Collaborator

@JohnnyMorganz JohnnyMorganz commented Feb 18, 2025

Now that CST nodes exist in Luau, this PR extends the existing @lute/luau C++ module to return full-fidelity CST nodes to Luau, rather than just AST nodes. This is aimed to support future usages of the CST for tooling such as code modding, linting or formatting implemented fully in Luau.

We extend the serialization of some AST nodes with extra tokens, and create tests to check for validity of tokenization and the roundtrippability of the serialization state in Luau (i.e., print(parse(source) == source where print is fully defined in Luau). Future AST nodes will be added in follow up commits.

Alongside the CST data, a key concept that this PR introduces is the idea of trivia. Right now, the Luau CST does not preserve trivia itself, but retains enough information to compute spans of trivia, allowing later lookup in the original source code. To extract trivia, we keep the original input buffer around, and then take the span between the last token and the next token, and call this "trivia". Trivia is then individually tokenized and attached to "Token" nodes.

There are some key rules for trivia:

  • Trivia consists of whitespace, single line comments and multi line comments. Each kind is a separate trivia token
  • Whitespace trivia is split up into separate tokens based on the '\n' character. Other characters of whitespace are joined together up to and including the first newline character. i.e., <space><space>\n\t\t\n\n is split into [<space><space>\n, \t\t\n, \n]
  • Trivia on a token is split up into leading and trailing trivia. Leading / trailing split is based on the Roslyn Compiler trivia rules: trailing trivia consists of everything up to and including the first \n newline character, and all further trivia is leading trivia of the next token. The idea being that local x = ...\n can then be represented as a single line in trivia, to allow easier replacement (rather than the \n being part of the next statement).

The last rule is not yet implemented but will be introduced in a follow-up commit.

The API still remains experimental and is subject to change. In particular, I'm not fully sold on the naming scheme for nodes and fields (particularly the fields that match keyword names) and it may require some standardisation. Also, we are somewhat mixing Ast nodes with Cst nodes, and we may want to separate the two.

create an identity codemod that should in theory return the same as the input (showing
roundtrippability of parsing + printing)

enough to get "b.luau" parsing and printing (currently without whitespace trivia)
Introduces the necessary steps to extract and store trivia on tokens

The serialization step now also stores the underlying source
and a cursor through the source. When we serialize a token,
we take everything in between the cursor position and the token
start position and extract that as trivia.

To speed up offset computation, we compute the line offsets
for the file beforehand.

We convert AstExprConstantString into a style of token.

Right now, this serializes trivia into one big chunk. And also,
it does not split trivia into leading and trailing trivia.
We will probably want to split trivia into whitespace + comments.

We should follow roslyn syntax rules where everything up to
and including a newline character is treated as trailing trivia
for the previous token, and everything after that is leading trivia
for the next token. This makes code modding nicer, as we can
treat nodes as discrete chunks.
We need to be careful with the order of serialization, as
otherwise it can mess up the cursor and trivia tracking.
This commit reorganises some of the serialization so it is in line

TODO: The serialization of commas is wrong, since we do it
too late. This is going to cause an assertion failure once we
hit an example that is comma separated
This better handles serialization of separators and related trivia
We can also reuse a visitor to print a node, by just printing all tokens
Split trivia into smaller tokens. Also, use string_views everywhere rather than copies of strings where possible.

Still TODO:
- Split whitespace into smaller parts (whitespace should be separated by '\n' characters - a block of whitespace is up to and including the first '\n' character)

- Split between leading trivia and trailing trivia
@JohnnyMorganz JohnnyMorganz changed the title Code modding Make the CST usable in Lute Mar 4, 2025
Rather than whitespace being a single chunk of characters,
we split it into smaller parts delimited by a '\n' newline character.

The purpose for this is to allow us to later separate trivia into
leading and trailing trivia that follows the Rosyln Trivia rules
we are creating a wrapper around transforming syntax, but not necessarily a code modder
@JohnnyMorganz JohnnyMorganz changed the title Make the CST usable in Lute Start making the CST usable in Lute Mar 4, 2025
@JohnnyMorganz JohnnyMorganz marked this pull request as ready for review March 4, 2025 21:02
luaL_error(L, "encountered illegal operator: Op__Count");
}

lua_pushstring(L, Luau::toString(op).data());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, nice

@aatxe aatxe merged commit bc7d3df into luau-lang:primary Apr 21, 2025
3 checks passed
green-real pushed a commit to green-real/lute that referenced this pull request May 19, 2025
Now that CST nodes exist in Luau, this PR extends the existing
`@lute/luau` C++ module to return full-fidelity CST nodes to Luau,
rather than just AST nodes. This is aimed to support future usages of
the CST for tooling such as code modding, linting or formatting
implemented fully in Luau.

We extend the serialization of some AST nodes with extra tokens, and
create tests to check for validity of tokenization and the
roundtrippability of the serialization state in Luau (i.e.,
`print(parse(source) == source` where `print` is fully defined in Luau).
Future AST nodes will be added in follow up commits.

Alongside the CST data, a key concept that this PR introduces is the
idea of trivia. Right now, the Luau CST does not preserve trivia itself,
but retains enough information to compute spans of trivia, allowing
later lookup in the original source code. To extract trivia, we keep the
original input buffer around, and then take the span between the last
token and the next token, and call this "trivia". Trivia is then
individually tokenized and attached to "Token" nodes.

There are some key rules for trivia:
- [x] Trivia consists of whitespace, single line comments and multi line
comments. Each kind is a separate trivia token
- [x] Whitespace trivia is split up into separate tokens based on the
'\n' character. Other characters of whitespace are joined together up to
and including the first newline character. i.e.,
`<space><space>\n\t\t\n\n` is split into `[<space><space>\n, \t\t\n,
\n]`
- [ ] Trivia on a token is split up into leading and trailing trivia.
Leading / trailing split is based on the Roslyn Compiler trivia rules:
trailing trivia consists of everything up to and including the first
`\n` newline character, and all further trivia is leading trivia of the
next token. The idea being that `local x = ...\n` can then be
represented as a single line in trivia, to allow easier replacement
(rather than the `\n` being part of the next statement).

The last rule is not yet implemented but will be introduced in a
follow-up commit.

The API still remains experimental and is subject to change. In
particular, I'm not fully sold on the naming scheme for nodes and fields
(particularly the fields that match keyword names) and it may require
some standardisation. Also, we are somewhat mixing Ast nodes with Cst
nodes, and we may want to separate the two.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants