Skip to content

Commit c8f0a35

Browse files
authored
Add sgf-parsing (#140)
* Add sgf-parsing * update test-helpers doc
1 parent c9edddf commit c8f0a35

11 files changed

Lines changed: 773 additions & 112 deletions

File tree

config.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -818,6 +818,14 @@
818818
"prerequisites": [],
819819
"difficulty": 8
820820
},
821+
{
822+
"slug": "sgf-parsing",
823+
"name": "SGF Parsing",
824+
"uuid": "f990927e-2fae-4ac0-ba8f-0cd90d500e62",
825+
"practices": [],
826+
"prerequisites": [],
827+
"difficulty": 8
828+
},
821829
{
822830
"slug": "zebra-puzzle",
823831
"name": "Zebra Puzzle",
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
return {
2+
default = {
3+
ROOT = { '.' }
4+
}
5+
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Hints
2+
3+
## General
4+
5+
The SGF language has a recursive definition:
6+
7+
```none
8+
GameTree = "(" Sequence { GameTree } ")"
9+
```
10+
11+
That is, a GameTree is a piece of text that starts with a `(`, followed by some text called a "Sequence" (a list of game nodes), _followed by **zero or more** child GameTrees_, and ends with a `)`.
12+
13+
You might want to use regular expressions to parse an SGF text.
14+
However, because game trees can be nested inside other game trees to any depth, standard regular expressions are not powerful enough to keep track of the nesting levels.
15+
16+
There are a couple of approaches to solving this exercise:
17+
18+
### State machine
19+
20+
You could scan the input string character-by-character, maintaining a variable holding your current position.
21+
You can keep track of the current state of the parsed text based on the character at the current position.
22+
The "current state" might be something like "reading a property name" or "inside a property value".
23+
There are lots of values you need to keep track of with this approach.
24+
25+
### Parser
26+
27+
A [PEG-based parser][PEG-parser] is a very good way to parse a language that has a defined grammar.
28+
[LPeg][LPeg] is a PEG parser, and it is one of the required dependencies of MoonScript, so you'll already have access to it.
29+
A parser allows you to write rules for what the text should look like.
30+
For example, "a property name is all uppercase letters", or "a property value is enclosed in brackets".
31+
The parser then does the heavy lifting of matching the input text against your rules.
32+
33+
A couple of learning resources for LPeg:
34+
35+
* [LPeg tutorial][tutorial] in the lua-users wiki.
36+
* [Mastering LPeg][mastering-lpeg] is written by one of the authors of LPeg and Lua itself.
37+
He is a university professor and researcher, so this document reads like an academic paper.
38+
39+
[PEG-parser]: https://en.wikipedia.org/wiki/Parsing_expression_grammar
40+
[LPeg]: https://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
41+
[mastering-lpeg]: https://www.inf.puc-rio.br/~roberto/docs/lpeg-primer.pdf
42+
[tutorial]: http://lua-users.org/wiki/LpegTutorial
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Instructions
2+
3+
Parsing a Smart Game Format string.
4+
5+
[SGF][sgf] is a standard format for storing board game files, in particular go.
6+
7+
SGF is a fairly simple format. An SGF file usually contains a single
8+
tree of nodes where each node is a property list. The property list
9+
contains key value pairs, each key can only occur once but may have
10+
multiple values.
11+
12+
The exercise will have you parse an SGF string and return a tree structure of properties.
13+
14+
An SGF file may look like this:
15+
16+
```text
17+
(;FF[4]C[root]SZ[19];B[aa];W[ab])
18+
```
19+
20+
This is a tree with three nodes:
21+
22+
- The top level node has three properties: FF\[4\] (key = "FF", value
23+
= "4"), C\[root\](key = "C", value = "root") and SZ\[19\] (key =
24+
"SZ", value = "19"). (FF indicates the version of SGF, C is a
25+
comment and SZ is the size of the board.)
26+
- The top level node has a single child which has a single property:
27+
B\[aa\]. (Black plays on the point encoded as "aa", which is the
28+
1-1 point).
29+
- The B\[aa\] node has a single child which has a single property:
30+
W\[ab\].
31+
32+
As you can imagine an SGF file contains a lot of nodes with a single
33+
child, which is why there's a shorthand for it.
34+
35+
SGF can encode variations of play. Go players do a lot of backtracking
36+
in their reviews (let's try this, doesn't work, let's try that) and SGF
37+
supports variations of play sequences. For example:
38+
39+
```text
40+
(;FF[4](;B[aa];W[ab])(;B[dd];W[ee]))
41+
```
42+
43+
Here the root node has two variations. The first (which by convention
44+
indicates what's actually played) is where black plays on 1-1. Black was
45+
sent this file by his teacher who pointed out a more sensible play in
46+
the second child of the root node: `B[dd]` (4-4 point, a very standard
47+
opening to take the corner).
48+
49+
A key can have multiple values associated with it. For example:
50+
51+
```text
52+
(;FF[4];AB[aa][ab][ba])
53+
```
54+
55+
Here `AB` (add black) is used to add three black stones to the board.
56+
57+
All property values will be the [SGF Text type][sgf-text].
58+
You don't need to implement any other value type.
59+
Although you can read the [full documentation of the Text type][sgf-text], a summary of the important points is below:
60+
61+
- Newlines are removed if they come immediately after a `\`, otherwise they remain as newlines.
62+
- All whitespace characters other than newline are converted to spaces.
63+
- `\` is the escape character.
64+
Any non-whitespace character after `\` is inserted as-is.
65+
Any whitespace character after `\` follows the above rules.
66+
Note that SGF does **not** have escape sequences for whitespace characters such as `\t` or `\n`.
67+
68+
Be careful not to get confused between:
69+
70+
- The string as it is represented in a string literal in the tests
71+
- The string that is passed to the SGF parser
72+
73+
Escape sequences in the string literals may have already been processed by the programming language's parser before they are passed to the SGF parser.
74+
75+
There are a few more complexities to SGF (and parsing in general), which
76+
you can mostly ignore. You should assume that the input is encoded in
77+
UTF-8, the tests won't contain a charset property, so don't worry about
78+
that. Furthermore you may assume that all newlines are unix style (`\n`,
79+
no `\r` or `\r\n` will be in the tests) and that no optional whitespace
80+
between properties, nodes, etc will be in the tests.
81+
82+
[sgf]: https://en.wikipedia.org/wiki/Smart_Game_Format
83+
[sgf-text]: https://www.red-bean.com/sgf/sgf4.html#text
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"authors": [
3+
"glennj"
4+
],
5+
"files": {
6+
"solution": [
7+
"sgf_parsing.moon"
8+
],
9+
"test": [
10+
"sgf_parsing_spec.moon"
11+
],
12+
"example": [
13+
".meta/example.moon"
14+
]
15+
},
16+
"blurb": "Parsing a Smart Game Format string."
17+
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
lpeg = require "lpeg"
2+
import P, V, R, S, C, Ct, Cg from lpeg
3+
4+
-- Learning references:
5+
-- https://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
6+
-- http://lua-users.org/wiki/LpegTutorial
7+
8+
-- Define some "atomic" patterns that we'll use in the grammar.
9+
escaped_newlines = P"\\\n" / ""
10+
escaped_tabs = P"\\\t" / " "
11+
escaped = P"\\" / "" * C P 1
12+
plain = C 1 - S"]"
13+
-- SGF specifies that tabs and other whitespace characters are converted to spaces, but newlines are preserved.
14+
whitespace = S" \t\r\f\v" / " "
15+
16+
prop_value = P"[" * (Ct((escaped_newlines + escaped_tabs + escaped + whitespace + plain)^0) / table.concat) * P"]"
17+
18+
-- The Grammar
19+
sgf = P {
20+
"GameTree"
21+
22+
-- A GameTree is '(' Sequence ')'
23+
GameTree: P"(" * V"Sequence" * P")"
24+
25+
-- In Exercism, ;A;B means B is a child of A.
26+
-- We nest the remainder of the sequence into the 'children' key.
27+
Sequence: Ct(
28+
P";" * Cg(V"PropMap", "properties") * Cg(Ct((V"Sequence" + V"Children")^-1), "children")
29+
)
30+
31+
-- Maps tags like 'A' to a list of values ['x', 'y']
32+
PropMap: Ct V"Property"^0
33+
34+
-- A property is a PropIdent followed by one or more PropValues.
35+
-- We allow zero PropValues here to catch the case of properties without delimiters,
36+
-- which is invalid but would otherwise be hard to detect.
37+
Property: Cg C(V"PropIdent") * Ct(V"PropValue"^0)
38+
39+
-- Variations: Multiple GameTrees inside a node's children list
40+
Children: V"GameTree"^1
41+
42+
-- PropIdent is a sequence of uppercase letters, but we allow lowercase letters here to catch
43+
-- the case of properties containing lowercase, which is invalid but would otherwise be hard to detect.
44+
PropIdent: (R"AZ" + R"az")^1
45+
PropValue: prop_value
46+
}
47+
48+
-- The LPEG parser can only capture properties as a sequence of name-value pairs,
49+
-- so we need to remap them into a key-value table.
50+
-- Plus, we have a couple of extra validations to perform.
51+
remap_properties = (tree) ->
52+
if tree.properties
53+
props = {}
54+
for i = 1, #tree.properties, 2
55+
name, value = tree.properties[i], tree.properties[i + 1]
56+
assert name\match('^%u+$'), 'property must be in uppercase'
57+
assert #value > 0, 'properties without delimiter'
58+
props[name] = value
59+
tree.properties = props
60+
61+
if tree.children
62+
for child in *tree.children
63+
remap_properties child
64+
65+
{
66+
parse: (input) ->
67+
result = sgf\match input
68+
if not result
69+
-- The LPEG parser will fail if the input doesn't match the grammar,
70+
-- but it won't give us any information about what went wrong.
71+
-- Add some basic checks to provide more helpful error messages.
72+
assert #input > 0 and input\match('^%(.*%)$'), 'tree missing'
73+
assert input\find(';'), 'tree with no nodes'
74+
error "unspecified parse error"
75+
76+
remap_properties result
77+
result
78+
}
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
json = require 'dkjson'
2+
json_string = (s) -> json.encode s
3+
4+
is_sequence = (t) ->
5+
return false if type(t) != 'table'
6+
size = 0
7+
size += 1 for k, _ in pairs t when k != 'n'
8+
size == #t
9+
10+
is_empty = (t) -> not next t
11+
12+
-- mostly taken from:
13+
-- https://github.com/leafo/moonscript/blob/7b7899741c6c1e971e436d36c9aabb56f51dc3d5/moonscript/util.moon#L58
14+
to_string = (what, level = 0) ->
15+
seen = {}
16+
_dump = (what, depth = 0) ->
17+
t = type what
18+
if t == 'string' then
19+
json_string what
20+
elseif t != 'table' then
21+
tostring what
22+
else
23+
if seen[what] then
24+
return "<cycle:#{tostring what}>"
25+
seen[what] = true
26+
if is_sequence what then
27+
return "{" .. table.concat([to_string(v, level + depth + 1) for v in *what], ", ") .. "}"
28+
29+
depth += 1
30+
lines = for k,v in pairs what do
31+
key = if type(k) == 'number' then "[#{k}]" else k
32+
(' ')\rep(depth * 2) .. "#{key}: " .. _dump(v, depth)
33+
seen[what] = false
34+
val = if not is_empty lines
35+
table.concat [indent(line, level) .. "\n" for line in *lines]
36+
else
37+
""
38+
class_name = if type(what.__class) == 'table' and type(what.__class.__name) == 'string'
39+
"<#{what.__class.__name}>"
40+
"#{class_name or ""}{\n#{val}#{indent '}', level + depth - 1}"
41+
_dump what
42+
43+
44+
{
45+
module_name: 'SGFParser',
46+
47+
generate_test: (case, level) ->
48+
lines = if case.expected.error
49+
{
50+
"f = -> SGFParser.parse #{json_string case.input.encoded}"
51+
"assert.has_error f, #{quote case.expected.error}"
52+
}
53+
else
54+
{
55+
"result = SGFParser.#{case.property} #{json_string case.input.encoded}",
56+
"expected = #{to_string case.expected, level}",
57+
"assert.are.same expected, result"
58+
}
59+
table.concat [indent line, level for line in *lines], '\n'
60+
}
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# This is an auto-generated file.
2+
#
3+
# Regenerating this file via `configlet sync` will:
4+
# - Recreate every `description` key/value pair
5+
# - Recreate every `reimplements` key/value pair, where they exist in problem-specifications
6+
# - Remove any `include = true` key/value pair (an omitted `include` key implies inclusion)
7+
# - Preserve any other key/value pair
8+
#
9+
# As user-added comments (using the # character) will be removed when this file
10+
# is regenerated, comments can be added via a `comment` key.
11+
12+
[2668d5dc-109f-4f71-b9d5-8d06b1d6f1cd]
13+
description = "empty input"
14+
15+
[84ded10a-94df-4a30-9457-b50ccbdca813]
16+
description = "tree with no nodes"
17+
18+
[0a6311b2-c615-4fa7-800e-1b1cbb68833d]
19+
description = "node without tree"
20+
21+
[8c419ed8-28c4-49f6-8f2d-433e706110ef]
22+
description = "node without properties"
23+
24+
[8209645f-32da-48fe-8e8f-b9b562c26b49]
25+
description = "single node tree"
26+
27+
[6c995856-b919-4c75-8fd6-c2c3c31b37dc]
28+
description = "multiple properties"
29+
30+
[a771f518-ec96-48ca-83c7-f8d39975645f]
31+
description = "properties without delimiter"
32+
33+
[6c02a24e-6323-4ed5-9962-187d19e36bc8]
34+
description = "all lowercase property"
35+
36+
[8772d2b1-3c57-405a-93ac-0703b671adc1]
37+
description = "upper and lowercase property"
38+
39+
[a759b652-240e-42ec-a6d2-3a08d834b9e2]
40+
description = "two nodes"
41+
42+
[cc7c02bc-6097-42c4-ab88-a07cb1533d00]
43+
description = "two child trees"
44+
45+
[724eeda6-00db-41b1-8aa9-4d5238ca0130]
46+
description = "multiple property values"
47+
48+
[28092c06-275f-4b9f-a6be-95663e69d4db]
49+
description = "within property values, whitespace characters such as tab are converted to spaces"
50+
51+
[deaecb9d-b6df-4658-aa92-dcd70f4d472a]
52+
description = "within property values, newlines remain as newlines"
53+
54+
[8e4c970e-42d7-440e-bfef-5d7a296868ef]
55+
description = "escaped closing bracket within property value becomes just a closing bracket"
56+
57+
[cf371fa8-ba4a-45ec-82fb-38668edcb15f]
58+
description = "escaped backslash in property value becomes just a backslash"
59+
60+
[dc13ca67-fac0-4b65-b3fe-c584d6a2c523]
61+
description = "opening bracket within property value doesn't need to be escaped"
62+
63+
[a780b97e-8dbb-474e-8f7e-4031902190e8]
64+
description = "semicolon in property value doesn't need to be escaped"
65+
66+
[0b57a79e-8d89-49e5-82b6-2eaaa6b88ed7]
67+
description = "parentheses in property value don't need to be escaped"
68+
69+
[c72a33af-9e04-4cc5-9890-1b92262813ac]
70+
description = "escaped tab in property value is converted to space"
71+
72+
[3a1023d2-7484-4498-8d73-3666bb386e81]
73+
description = "escaped newline in property value is converted to nothing at all"
74+
75+
[25abf1a4-5205-46f1-8c72-53273b94d009]
76+
description = "escaped t and n in property value are just letters, not whitespace"
77+
78+
[08e4b8ba-bb07-4431-a3d9-b1f4cdea6dab]
79+
description = "mixing various kinds of whitespace and escaped characters in property value"
80+
reimplements = "11c36323-93fc-495d-bb23-c88ee5844b8c"
81+
82+
[11c36323-93fc-495d-bb23-c88ee5844b8c]
83+
description = "escaped property"
84+
include = false
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
parse: (input) ->
3+
error 'Implement me!'
4+
}

0 commit comments

Comments
 (0)