Python 3 Library for Reading the Text Content and Metadata of TEI P5 (Lite) Files
The library focuses on extracting the main text content from a file and providing the available metadata about the text.
It was originally created to support importing TEI formatted corpora into GrETEL, using the corpus2alpino library.
pip install tei-readerfrom tei_reader import TeiReader
reader = TeiReader()
corpora = reader.read_file('example-tei.xml') # or read_string
print(corpora.text)
# show element attributes before the actual element text
print(corpora.tostring(lambda x, text: str(list(a.key + '=' + a.text for a in x.attributes)) + text))A reader can be opened using TeiReader(). It is then possible to either call read_file(file_name) or read_string(str). Both will return a Corpora object containing the following properties:
| Property | Description |
|---|---|
corpora[] |
A corpora can contain sub-corpora. |
documents[] |
The Document objects directly part of this corpora. |
Corpora and Document all inherit from Element. In all objects deriving from this it is possible to call:
| Property | Description |
|---|---|
attributes{} |
Contain attributes applicable to this element. If an attribute contains attributes these are also returned. (e.g. encodingDesc::editorialDecl::normalization) |
text |
Get the entire text content as str |
divisions[] |
Recursively get all the text divisions in document order. If an element contains parts or text without tag. Those will be returned in order and wrapped with a PlaceholderDivision. |
all_parts[] |
Recursively get the parts in document order constituting the entire text e.g. if something has emphasis, a footnote or is marked as foreign. Text without a container element will be returned in order and wrapped with a PlaceholderPart. |
parts[] |
Get the parts in document order directly below the current element. |
Attribute, PlaceholderDivision and PlaceholderPart all support the same properties as Element.
python setup.py sdist
twine upload dist/*