Skip to content

Commit 9c950ba

Browse files
authored
Thomasht86/fieldset combined idf docs (#4670)
* faq on combined df * cross-link faq
1 parent 9013423 commit 9c950ba

3 files changed

Lines changed: 59 additions & 0 deletions

File tree

en/learn/faq.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,54 @@ select * from sources * where fieldsetOrField1 contains text(@query) or fieldset
285285
```
286286
More details on [stack overflow](https://stackoverflow.com/questions/72784136/why-vepsa-easily-warning-me-this-may-lead-to-recall-and-ranking-issues).
287287

288+
### Why can common words like "the" hurt recall and collapse significance across a fieldset?
289+
Symptoms — can appear when a term's DF differs substantially between member fields:
290+
- Poor recall for queries mixing a common term with a rare one (e.g. `"the cure"`, `"the X"`).
291+
[weakAnd](../ranking/wand.html) may drop the common term, so good matches never surface.
292+
- `term(n).significance` and `fieldMatch(field).significance` read identical across member fields
293+
in rank-feature dumps — even in fields where the term is actually rare.
294+
295+
Cause: when a term matches a [fieldset](../reference/schemas/schemas.html#fieldset)
296+
(including the implicit `default` used by [userQuery()](../reference/querying/yql.html#userquery)),
297+
Vespa aggregates the document frequency across all member fields.
298+
299+
If the DF differs substantially between members, the high-DF field dominates
300+
and pulls the term's significance down for the whole fieldset.
301+
302+
**Example:**
303+
304+
With `fieldset default { fields: title, artist }`, `"the"` is common in `title`
305+
(countless _"The Watcher"_, _"The Best Of the ..."_) but rare in `artist`.
306+
307+
Its aggregated significance is pulled down toward the `title` DF,
308+
so searching for the artist `"The Cure"` loses the signal from `"the"`.
309+
310+
The same aggregated DF drives every DF/IDF feature: [bm25](../ranking/bm25.html),
311+
[nativeRank](../ranking/nativerank.html), `term(n).significance`, `fieldMatch.significance`.
312+
313+
Matches that survive retrieval are scored using the aggregated DF rather than per-field statistics.
314+
315+
Fix: rewrite as OR'd [userInput](../reference/querying/yql.html#userinput) clauses
316+
with a [defaultIndex](../reference/querying/yql.html#defaultindex) annotation per field.
317+
Each field then uses its own DF:
318+
319+
*Combined-fieldset DF:*
320+
321+
```bash
322+
vespa query 'select * from sources * where userQuery()' \
323+
query='the cure'
324+
```
325+
326+
*Per-field DF:*
327+
328+
```bash
329+
vespa query 'select * from sources * where ({defaultIndex:"title"}userInput(@q)) or ({defaultIndex:"artist"}userInput(@q))' \
330+
q='the cure'
331+
```
332+
333+
{% include important.html content="BM25 and significance feature values shift scale when switching to per-field DF.
334+
Retrain any learned ranker on features collected with the new query formulation."%}
335+
288336
### How is the query timeout computed?
289337
Find query timeout details in the [Query API Guide](../querying/query-api.html#timeout)
290338
and the [Query API Reference](../reference/api/query.html#timeout).

en/ranking/wand.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,9 @@
6969
<code>weakAnd</code> integrates with linguistic processing (tokenization and stemming).
7070
It uses the per-term inverted document frequency and query term weight in the inner scoring
7171
but does not use document term frequency in the scoring.
72+
Note that when searching a fieldset, the document frequency is aggregated across all member fields,
73+
which can cause common terms like <em>"the"</em> to be pruned from the query — see
74+
<a href="../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">this FAQ entry</a>.
7275
</li>
7376
<li>
7477
The <code>wand</code> query operator which does not integrate with linguistic processing

en/reference/schemas/schemas.html

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1381,6 +1381,14 @@ <h2 id="fieldset">fieldset</h2>
13811381
}
13821382
</pre>
13831383
<p>Adding a fieldset will not create extra index structures in memory / on disk; it is just a mapping.</p>
1384+
<p>
1385+
Note that document frequency is aggregated across all member fields when matching a fieldset,
1386+
which affects <a href="../../ranking/bm25.html">BM25</a> and significance values, and can cause
1387+
<a href="../../ranking/wand.html">weakAnd</a> to prune matches for common terms like <em>"the"</em>
1388+
when they are frequent in one member field but rare in another.
1389+
See the <a href="../../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">FAQ</a>
1390+
for details and the per-field workaround.
1391+
</p>
13841392

13851393

13861394
<h2 id="compression">compression</h2>

0 commit comments

Comments
 (0)