Skip to content

Commit 4e5a50f

Browse files
committed
Expand server/readme a little
1 parent 8bad801 commit 4e5a50f

1 file changed

Lines changed: 37 additions & 11 deletions

File tree

server/README.md

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,44 @@ This makes it so that:
88
The src/ folder contains the following packages:
99

1010
## app
11-
The base for spring boot and some application wide interfaces.
11+
The base for spring boot and spring configuration like handling special JSON serilization.
1212

13-
## data
14-
All Galahad data is stored on disk. Hence why we have classes here like FileBackedCache and FileBackedValue.
13+
## web
14+
The Endpoints class lists all API endpoints.
1515

16-
## data.corpus
16+
## files
17+
All Galahad data is stored on disk. For example, DiskValue<DocumentMetadata> stores document metadata in JSON on disk.
18+
Similarly, ValidatedDiskValue<CorpusMetadata> stores CorpusMetadata. However, on retrieval it first performs a isValid() check (which checks disk modification).
19+
CorpusMetadata stores the number of documents, so if a new file has been added, it should not be valid and is recalculated
1720

18-
## data.layer
19-
The annotations in a document (i.e. lemma and pos of each token) are collectively called a layer. A document can have multiple layers (as it can be tagged by multiple taggers). The original annotation layer is called the "sourceLayer".
20-
A layer consists of a list of terms. A term consists of a lemma, a part of speech, and a token. At some point, the system was designed for a term to be able to point to multiple token, hence why Term in reality has a list of tokens, called "word forms". But in reality there is only ever one token, and multi word terms were not fully developed.
21+
A corpus or document is represented by a folder on disk. This is the GalahadFolder class.
22+
23+
## annotations
24+
The smallest unit of information in Galahad is an annotation, e.g. the part of speech "NOU-C(number=sg)".
25+
Annotations are bundled together on a Term. The TEI XML `<w lemma="hello" pos="INT">hello</w>` results in a Term with 3 annotations:
26+
- token: hello (token is also an annotation)
27+
- lemma: hello
28+
- pos: INT
29+
30+
Terms are contained in a SentenceLayer (TEI XML `<s>`). Span annotations are defined on this level too. For example, the named entity "LOC" defined as a span over two tokens: "The", "Netherlands".
31+
32+
Sentences are contained in a ParagraphLayer (TEI XML `<p>`). Paragraphs are contained in a DocumentLayer (TEI XML `<text>`).
33+
We are not done yet however, as a single _file_ can contain multiple _documents_ in formats like TEI and CoNLL-U. And so, documents are contained in a Layer. This is the main class that is used.
34+
35+
A Layer provides a summary: this is the number of annotations of each type. It also provides a preview of the first couple of terms.
36+
Converting a layer to plaintext is as simple as .toString().
37+
38+
## corpora
39+
40+
## documents
41+
42+
## exceptions
43+
A bunch of Galahad-specific exceptions that include a HTTP status.
44+
45+
## export
46+
47+
## formats
48+
Contains all document readers, converters and mergers for the supported formats in Galahad.
2149

2250
## evaluation
2351
For evaluating a single layer (the frequency distribution) or comparing two layers (part of speech confusion and accuracy metrics), where one represents the absolute truth (called the "reference") and one is being tested against it (called the "hypothesis"). The main use case is setting the sourceLayer as the absolute truth reference.
@@ -50,8 +78,6 @@ In order to show a leaderboard of taggers on datasets, we have so-called 'assays
5078
## jobs
5179
The process of a tagger tagging a document, which creates a new annotation layer, is called a job.
5280

53-
## port
54-
Contains all document readers, converters and mergers for the supported formats in Galahad.
55-
5681
## tagset & tagger
57-
Both relatively simple packages. Read out yaml files in a folder and make them available in a singleton-like manner.
82+
Both relatively simple packages. Read out yaml files in a folder and make them available in a singleton-like manner.
83+

0 commit comments

Comments
 (0)