For linguistic processing, such as tokenization and stemming, Vespa integrates with Apache OpenNLP . The downside is that not that many languages are supported.
One way to expand language support is to use language-independent tokenizers such as sentencepiece and index the token ids.
We could:
- Create a Linguistic implementation wrapping sentencepiece.
- Use sentencepiece, but add integrations in indexing language to convert the produced tensor to string for indexing, and similar on the query side.
Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters
<container version="1.0">
<component id="spiece"
class="com.yahoo.language.sentencepiece.SentencePieceEmbedder"
bundle="linguistics-components">
<config name="language.sentencepiece.sentence-piece">;
<model>
<item>
<language>unknown</language>
<path>model/en.wiki.bpe.vs10000.model</path>
</item>
</model>
</config>
</component>
</container>
On the document side, it could look like this using indexing language converters
field title_tokens type string {
indexing: (input title || "")| embed spiece | to_string | summary
}
Note the above fails as the embed IL function expects that the field type is tensor.
On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids.
{
"query": "foo bar",
"yql": "select * from doc where userQuery()",
"input.query(tokens)" : "embed(spiece, foo bar)"
}
The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval.
For linguistic processing, such as tokenization and stemming, Vespa integrates with Apache OpenNLP . The downside is that not that many languages are supported.
One way to expand language support is to use language-independent tokenizers such as sentencepiece and index the token ids.
We could:
Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters
On the document side, it could look like this using indexing language converters
Note the above fails as the embed IL function expects that the field type is tensor.
On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids.
The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval.