Skip to content

Ability to use sentencepiece tokenization as lingustic implementation #27039

@jobergum

Description

@jobergum

For linguistic processing, such as tokenization and stemming, Vespa integrates with Apache OpenNLP . The downside is that not that many languages are supported.

One way to expand language support is to use language-independent tokenizers such as sentencepiece and index the token ids.

We could:

  • Create a Linguistic implementation wrapping sentencepiece.
  • Use sentencepiece, but add integrations in indexing language to convert the produced tensor to string for indexing, and similar on the query side.

Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters

<container version="1.0">
  <component id="spiece"
           class="com.yahoo.language.sentencepiece.SentencePieceEmbedder"
           bundle="linguistics-components">
    <config name="language.sentencepiece.sentence-piece">;
        <model>
            <item>
              <language>unknown</language>
              <path>model/en.wiki.bpe.vs10000.model</path>
            </item>
        </model>
      </config>
  </component>
</container>

On the document side, it could look like this using indexing language converters

 field title_tokens type string {
        indexing: (input title || "")| embed spiece | to_string | summary
    }

Note the above fails as the embed IL function expects that the field type is tensor.

On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids.

{
"query": "foo bar",
"yql": "select * from doc where userQuery()",
"input.query(tokens)" : "embed(spiece, foo bar)"
}

The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions