Ability to use sentencepiece tokenization as lingustic implementation

For [linguistic processing](https://docs.vespa.ai/en/linguistics.html), such as tokenization and stemming, Vespa integrates with [Apache OpenNLP ](https://opennlp.apache.org/models.html). The downside is that not that many languages are supported. 

One way to expand language support is to use language-independent tokenizers such as [sentencepiece](https://docs.vespa.ai/en/embedding.html#sentencepiece-embedder) and index the token ids. 

We could:

- Create a [Linguistic implementation](https://docs.vespa.ai/en/linguistics.html#creating-a-custom-linguistics-implementation) wrapping sentencepiece.
- Use sentencepiece, but add integrations in indexing language to convert the produced tensor to string for indexing, and similar on the query side. 

Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters

```
<container version="1.0">
  <component id="spiece"
           class="com.yahoo.language.sentencepiece.SentencePieceEmbedder"
           bundle="linguistics-components">
    <config name="language.sentencepiece.sentence-piece">;
        <model>
            <item>
              <language>unknown</language>
              <path>model/en.wiki.bpe.vs10000.model</path>
            </item>
        </model>
      </config>
  </component>
</container>
```

On the document side, it could look like this using [indexing language converters](https://docs.vespa.ai/en/reference/advanced-indexing-language.html#converter)

```
 field title_tokens type string {
        indexing: (input title || "")| embed spiece | to_string | summary
    }
```
Note the above fails as the embed IL function expects that the field type is tensor. 

On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids. 

```
{
"query": "foo bar",
"yql": "select * from doc where userQuery()",
"input.query(tokens)" : "embed(spiece, foo bar)"
}
```
The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval. 










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to use sentencepiece tokenization as lingustic implementation #27039

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ability to use sentencepiece tokenization as lingustic implementation #27039

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions