Thomasht86/fieldset combined idf docs (#4670)

thomasht86 · web-flow · commit 9c950badb0e6 · 2026-04-24T14:42:11.000+02:00
* faq on combined df

* cross-link faq
diff --git a/en/learn/faq.md b/en/learn/faq.md
@@ -285,6 +285,54 @@ select * from sources * where fieldsetOrField1 contains text(@query) or fieldset
 ```
 More details on [stack overflow](https://stackoverflow.com/questions/72784136/why-vepsa-easily-warning-me-this-may-lead-to-recall-and-ranking-issues).
 
+### Why can common words like "the" hurt recall and collapse significance across a fieldset?
+Symptoms — can appear when a term's DF differs substantially between member fields:
+- Poor recall for queries mixing a common term with a rare one (e.g. `"the cure"`, `"the X"`).
+  [weakAnd](../ranking/wand.html) may drop the common term, so good matches never surface.
+- `term(n).significance` and `fieldMatch(field).significance` read identical across member fields
+  in rank-feature dumps — even in fields where the term is actually rare.
+
+Cause: when a term matches a [fieldset](../reference/schemas/schemas.html#fieldset)
+(including the implicit `default` used by [userQuery()](../reference/querying/yql.html#userquery)),
+Vespa aggregates the document frequency across all member fields.
+
+If the DF differs substantially between members, the high-DF field dominates
+and pulls the term's significance down for the whole fieldset.
+
+**Example:**
+
+With `fieldset default { fields: title, artist }`, `"the"` is common in `title`
+(countless _"The Watcher"_, _"The Best Of the ..."_) but rare in `artist`.
+
+Its aggregated significance is pulled down toward the `title` DF,
+so searching for the artist `"The Cure"` loses the signal from `"the"`.
+
+The same aggregated DF drives every DF/IDF feature: [bm25](../ranking/bm25.html),
+[nativeRank](../ranking/nativerank.html), `term(n).significance`, `fieldMatch.significance`.
+
+Matches that survive retrieval are scored using the aggregated DF rather than per-field statistics.
+
+Fix: rewrite as OR'd [userInput](../reference/querying/yql.html#userinput) clauses
+with a [defaultIndex](../reference/querying/yql.html#defaultindex) annotation per field.
+Each field then uses its own DF:
+
+*Combined-fieldset DF:*
+
+```bash
+vespa query 'select * from sources * where userQuery()' \
+  query='the cure'
+```
+
+*Per-field DF:*
+
+```bash
+vespa query 'select * from sources * where ({defaultIndex:"title"}userInput(@q)) or ({defaultIndex:"artist"}userInput(@q))' \
+  q='the cure'
+```
+
+{% include important.html content="BM25 and significance feature values shift scale when switching to per-field DF.
+Retrain any learned ranker on features collected with the new query formulation."%}
+
 ### How is the query timeout computed?
 Find query timeout details in the [Query API Guide](../querying/query-api.html#timeout)
 and the [Query API Reference](../reference/api/query.html#timeout).
diff --git a/en/ranking/wand.html b/en/ranking/wand.html
@@ -69,6 +69,9 @@
   <code>weakAnd</code> integrates with linguistic processing (tokenization and stemming).
   It uses the per-term inverted document frequency and query term weight in the inner scoring
   but does not use document term frequency in the scoring.
+  Note that when searching a fieldset, the document frequency is aggregated across all member fields,
+  which can cause common terms like <em>"the"</em> to be pruned from the query — see
+  <a href="../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">this FAQ entry</a>.
 </li>
 <li>
   The <code>wand</code> query operator which does not integrate with linguistic processing
diff --git a/en/reference/schemas/schemas.html b/en/reference/schemas/schemas.html
@@ -1381,6 +1381,14 @@ <h2 id="fieldset">fieldset</h2>
 }
 </pre>
 <p>Adding a fieldset will not create extra index structures in memory / on disk; it is just a mapping.</p>
+<p>
+  Note that document frequency is aggregated across all member fields when matching a fieldset,
+  which affects <a href="../../ranking/bm25.html">BM25</a> and significance values, and can cause
+  <a href="../../ranking/wand.html">weakAnd</a> to prune matches for common terms like <em>"the"</em>
+  when they are frequent in one member field but rare in another.
+  See the <a href="../../learn/faq.html#why-can-common-words-like-the-hurt-recall-and-collapse-significance-across-a-fieldset">FAQ</a>
+  for details and the per-field workaround.
+</p>
 
 
 <h2 id="compression">compression</h2>