You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: server/README.md
+37-11Lines changed: 37 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,16 +8,44 @@ This makes it so that:
8
8
The src/ folder contains the following packages:
9
9
10
10
## app
11
-
The base for spring boot and some application wide interfaces.
11
+
The base for spring boot and spring configuration like handling special JSON serilization.
12
12
13
-
## data
14
-
All Galahad data is stored on disk. Hence why we have classes here like FileBackedCache and FileBackedValue.
13
+
## web
14
+
The Endpoints class lists all API endpoints.
15
15
16
-
## data.corpus
16
+
## files
17
+
All Galahad data is stored on disk. For example, DiskValue<DocumentMetadata> stores document metadata in JSON on disk.
18
+
Similarly, ValidatedDiskValue<CorpusMetadata> stores CorpusMetadata. However, on retrieval it first performs a isValid() check (which checks disk modification).
19
+
CorpusMetadata stores the number of documents, so if a new file has been added, it should not be valid and is recalculated
17
20
18
-
## data.layer
19
-
The annotations in a document (i.e. lemma and pos of each token) are collectively called a layer. A document can have multiple layers (as it can be tagged by multiple taggers). The original annotation layer is called the "sourceLayer".
20
-
A layer consists of a list of terms. A term consists of a lemma, a part of speech, and a token. At some point, the system was designed for a term to be able to point to multiple token, hence why Term in reality has a list of tokens, called "word forms". But in reality there is only ever one token, and multi word terms were not fully developed.
21
+
A corpus or document is represented by a folder on disk. This is the GalahadFolder class.
22
+
23
+
## annotations
24
+
The smallest unit of information in Galahad is an annotation, e.g. the part of speech "NOU-C(number=sg)".
25
+
Annotations are bundled together on a Term. The TEI XML `<w lemma="hello" pos="INT">hello</w>` results in a Term with 3 annotations:
26
+
- token: hello (token is also an annotation)
27
+
- lemma: hello
28
+
- pos: INT
29
+
30
+
Terms are contained in a SentenceLayer (TEI XML `<s>`). Span annotations are defined on this level too. For example, the named entity "LOC" defined as a span over two tokens: "The", "Netherlands".
31
+
32
+
Sentences are contained in a ParagraphLayer (TEI XML `<p>`). Paragraphs are contained in a DocumentLayer (TEI XML `<text>`).
33
+
We are not done yet however, as a single _file_ can contain multiple _documents_ in formats like TEI and CoNLL-U. And so, documents are contained in a Layer. This is the main class that is used.
34
+
35
+
A Layer provides a summary: this is the number of annotations of each type. It also provides a preview of the first couple of terms.
36
+
Converting a layer to plaintext is as simple as .toString().
37
+
38
+
## corpora
39
+
40
+
## documents
41
+
42
+
## exceptions
43
+
A bunch of Galahad-specific exceptions that include a HTTP status.
44
+
45
+
## export
46
+
47
+
## formats
48
+
Contains all document readers, converters and mergers for the supported formats in Galahad.
21
49
22
50
## evaluation
23
51
For evaluating a single layer (the frequency distribution) or comparing two layers (part of speech confusion and accuracy metrics), where one represents the absolute truth (called the "reference") and one is being tested against it (called the "hypothesis"). The main use case is setting the sourceLayer as the absolute truth reference.
@@ -50,8 +78,6 @@ In order to show a leaderboard of taggers on datasets, we have so-called 'assays
50
78
## jobs
51
79
The process of a tagger tagging a document, which creates a new annotation layer, is called a job.
52
80
53
-
## port
54
-
Contains all document readers, converters and mergers for the supported formats in Galahad.
55
-
56
81
## tagset & tagger
57
-
Both relatively simple packages. Read out yaml files in a folder and make them available in a singleton-like manner.
82
+
Both relatively simple packages. Read out yaml files in a folder and make them available in a singleton-like manner.
0 commit comments