@@ -285,6 +285,54 @@ select * from sources * where fieldsetOrField1 contains text(@query) or fieldset
285285```
286286More details on [ stack overflow] ( https://stackoverflow.com/questions/72784136/why-vepsa-easily-warning-me-this-may-lead-to-recall-and-ranking-issues ) .
287287
288+ ### Why can common words like "the" hurt recall and collapse significance across a fieldset?
289+ Symptoms — can appear when a term's DF differs substantially between member fields:
290+ - Poor recall for queries mixing a common term with a rare one (e.g. ` "the cure" ` , ` "the X" ` ).
291+ [ weakAnd] ( ../ranking/wand.html ) may drop the common term, so good matches never surface.
292+ - ` term(n).significance ` and ` fieldMatch(field).significance ` read identical across member fields
293+ in rank-feature dumps — even in fields where the term is actually rare.
294+
295+ Cause: when a term matches a [ fieldset] ( ../reference/schemas/schemas.html#fieldset )
296+ (including the implicit ` default ` used by [ userQuery()] ( ../reference/querying/yql.html#userquery ) ),
297+ Vespa aggregates the document frequency across all member fields.
298+
299+ If the DF differs substantially between members, the high-DF field dominates
300+ and pulls the term's significance down for the whole fieldset.
301+
302+ ** Example:**
303+
304+ With ` fieldset default { fields: title, artist } ` , ` "the" ` is common in ` title `
305+ (countless _ "The Watcher"_ , _ "The Best Of the ..."_ ) but rare in ` artist ` .
306+
307+ Its aggregated significance is pulled down toward the ` title ` DF,
308+ so searching for the artist ` "The Cure" ` loses the signal from ` "the" ` .
309+
310+ The same aggregated DF drives every DF/IDF feature: [ bm25] ( ../ranking/bm25.html ) ,
311+ [ nativeRank] ( ../ranking/nativerank.html ) , ` term(n).significance ` , ` fieldMatch.significance ` .
312+
313+ Matches that survive retrieval are scored using the aggregated DF rather than per-field statistics.
314+
315+ Fix: rewrite as OR'd [ userInput] ( ../reference/querying/yql.html#userinput ) clauses
316+ with a [ defaultIndex] ( ../reference/querying/yql.html#defaultindex ) annotation per field.
317+ Each field then uses its own DF:
318+
319+ * Combined-fieldset DF:*
320+
321+ ``` bash
322+ vespa query ' select * from sources * where userQuery()' \
323+ query=' the cure'
324+ ```
325+
326+ * Per-field DF:*
327+
328+ ``` bash
329+ vespa query ' select * from sources * where ({defaultIndex:"title"}userInput(@q)) or ({defaultIndex:"artist"}userInput(@q))' \
330+ q=' the cure'
331+ ```
332+
333+ {% include important.html content="BM25 and significance feature values shift scale when switching to per-field DF.
334+ Retrain any learned ranker on features collected with the new query formulation."%}
335+
288336### How is the query timeout computed?
289337Find query timeout details in the [ Query API Guide] ( ../querying/query-api.html#timeout )
290338and the [ Query API Reference] ( ../reference/api/query.html#timeout ) .
0 commit comments