Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions en/learn/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,54 @@ select * from sources * where fieldsetOrField1 contains text(@query) or fieldset
```
More details on [stack overflow](https://stackoverflow.com/questions/72784136/why-vepsa-easily-warning-me-this-may-lead-to-recall-and-ranking-issues).

### Why can common words like "the" hurt recall and collapse significance across a fieldset?
Symptoms — can appear when a term's DF differs substantially between member fields:
- Poor recall for queries mixing a common term with a rare one (e.g. `"the cure"`, `"the X"`).
[weakAnd](../ranking/wand.html) may drop the common term, so good matches never surface.
- `term(n).significance` and `fieldMatch(field).significance` read identical across member fields
in rank-feature dumps — even in fields where the term is actually rare.

Cause: when a term matches a [fieldset](../reference/schemas/schemas.html#fieldset)
(including the implicit `default` used by [userQuery()](../reference/querying/yql.html#userquery)),
Vespa aggregates the document frequency across all member fields.

If the DF differs substantially between members, the high-DF field dominates
and pulls the term's significance down for the whole fieldset.

**Example:**

With `fieldset default { fields: title, artist }`, `"the"` is common in `title`
(countless _"The Watcher"_, _"The Best Of the ..."_) but rare in `artist`.

Its aggregated significance is pulled down toward the `title` DF,
so searching for the artist `"The Cure"` loses the signal from `"the"`.

The same aggregated DF drives every DF/IDF feature: [bm25](../ranking/bm25.html),
[nativeRank](../ranking/nativerank.html), `term(n).significance`, `fieldMatch.significance`.

Matches that survive retrieval are scored using the aggregated DF rather than per-field statistics.

Fix: rewrite as OR'd [userInput](../reference/querying/yql.html#userinput) clauses
with a [defaultIndex](../reference/querying/yql.html#defaultindex) annotation per field.
Each field then uses its own DF:

*Combined-fieldset DF:*

```bash
vespa query 'select * from sources * where userQuery()' \
query='the cure'
```

*Per-field DF:*

```bash
vespa query 'select * from sources * where ({defaultIndex:"title"}userInput(@q)) or ({defaultIndex:"artist"}userInput(@q))' \
q='the cure'
```

{% include important.html content="BM25 and significance feature values shift scale when switching to per-field DF.
Retrain any learned ranker on features collected with the new query formulation."%}

### How is the query timeout computed?
Find query timeout details in the [Query API Guide](../querying/query-api.html#timeout)
and the [Query API Reference](../reference/api/query.html#timeout).
Expand Down
3 changes: 3 additions & 0 deletions en/ranking/wand.html
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,9 @@
<code>weakAnd</code> integrates with linguistic processing (tokenization and stemming).
It uses the per-term inverted document frequency and query term weight in the inner scoring
but does not use document term frequency in the scoring.
Note that when searching a fieldset, the document frequency is aggregated across all member fields,
which can cause common terms like <em>"the"</em> to be pruned from the query — see
<a href="../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">this FAQ entry</a>.
</li>
<li>
The <code>wand</code> query operator which does not integrate with linguistic processing
Expand Down
8 changes: 8 additions & 0 deletions en/reference/schemas/schemas.html
Original file line number Diff line number Diff line change
Expand Up @@ -1381,6 +1381,14 @@ <h2 id="fieldset">fieldset</h2>
}
</pre>
<p>Adding a fieldset will not create extra index structures in memory / on disk; it is just a mapping.</p>
<p>
Note that document frequency is aggregated across all member fields when matching a fieldset,
which affects <a href="../../ranking/bm25.html">BM25</a> and significance values, and can cause
<a href="../../ranking/wand.html">weakAnd</a> to prune matches for common terms like <em>"the"</em>
when they are frequent in one member field but rare in another.
See the <a href="../../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">FAQ</a>
for details and the per-field workaround.
</p>


<h2 id="compression">compression</h2>
Expand Down