Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions en/basics/ranking.html
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,12 @@ <h2 id="phased-ranking">Phased ranking</h2>

second-phase {
expression: xgboost(my_xgboost_reranker)
rerank-count: 1000 # per content node
total-rerank-count: 1000 # Over all nodes
}

global-phase {
expression: sum(onnx(my_large_onnx_model))
rerank-count: 20 # globally
rerank-count: 20
}

}
Expand Down
2 changes: 1 addition & 1 deletion en/clients/vespa-cli.html
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ <h3 id="queries">Queries</h3>
<p>Example query file:</p>
<pre>{% highlight json %}
{
"yql": "select product_id, title from products where {targetHits: 200}nearestNeighbor(dense_embedding, q_vector)",
"yql": "select product_id, title from products where {totalTargetHits: 200}nearestNeighbor(dense_embedding, q_vector)",
"input.query(q_vector)": [-0.050548091530799866, ... ,0.028366032987833023],
"ranking": "vector_distance"
}
Expand Down
4 changes: 2 additions & 2 deletions en/content/attributes.html
Original file line number Diff line number Diff line change
Expand Up @@ -586,7 +586,7 @@ <h2 id="paged-attributes">Paged attributes</h2>
where the number of attribute accesses are limited by the re-ranking phase count.
</p>
<p>
For example using a second phase <a href="../reference/schemas/schemas.html#secondphase-rerank-count">rerank-count</a>
For example using a second phase <a href="../reference/schemas/schemas.html#secondphase-total-rerank-count">total-rerank-count</a>
of 100 will limit the maximum number of page-ins/disk access per query to 100.
Running at 100 QPS would need up to 10K disk accesses per second.
This is the worst case if none of the accessed attribute data were paged into memory already.
Expand All @@ -608,7 +608,7 @@ <h2 id="paged-attributes">Paged attributes</h2>
rank-profile foo {
first-phase {}
second-phase {
rerank-count: 100
total-rerank-count: 100
expression: sum(attribute(tensordata))
}
}
Expand Down
2 changes: 1 addition & 1 deletion en/learn/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ of a double. This can happen in two cases:

- The [ranking](../basics/ranking.html) expression used a feature which became `NaN` (Not a Number). For example, `log(0)` would produce
-Infinity. One can use [isNan](../reference/ranking/ranking-expressions.html#isnan-x) to guard against this.
- Surfacing low scoring hits using [grouping](../querying/grouping.html), that is, rendering low ranking hits with `each(output(summary()))` that are outside of what Vespa computed and caches on a heap. This is controlled by the [keep-rank-count](../reference/schemas/schemas.html#keep-rank-count).
- Surfacing low scoring hits using [grouping](../querying/grouping.html), that is, rendering low ranking hits with `each(output(summary()))` that are outside what Vespa computed and caches on a heap. This is controlled by the [total-keep-rank-count](../reference/schemas/schemas.html#total-keep-rank-count) perameter.
Comment thread
bratseth marked this conversation as resolved.
Outdated

### How to pin query results?
To hard-code documents to positions in the result set,
Expand Down
13 changes: 6 additions & 7 deletions en/learn/tutorials/rag-blueprint.md
Original file line number Diff line number Diff line change
Expand Up @@ -570,8 +570,7 @@ not the case for most real-world RAG applications, so this is cruical to have in

![phased ranking overview](/assets/img/phased-ranking-rag.png)

It is worth noting that parameters such as `targetHits` (for the match phase) and `rerank-count`
(for first and second phase) are applied **per content node**. Also note that the stateless container nodes can
That the stateless container nodes can
Comment thread
bratseth marked this conversation as resolved.
Outdated
also be [scaled independently](../../performance/sizing-search.html) to handle increased query load.

## Configuring match-phase (retrieval)
Expand Down Expand Up @@ -1380,8 +1379,8 @@ We run the evaluation script on a set of unseen test queries, and get the follow
```

For the first phase ranking, we care most about recall, as we just want to make sure that the candidate documents are
ranked high enough to be included in the second-phase ranking. (the default number of documents that will be exposed to
second-phase is 10 000, but can be controlled by the `rerank-count` parameter).
ranked high enough to be included in the second-phase ranking. The number of documents to be reranked in second-phase
in total over all content nodes is controlled by the `total-rerank-count` parameter.

We can see that our results are already very good. This is of course due to the fact that we have a small,synthetic dataset.
In reality, you should align the metric expectations with your dataset and test queries.
Expand All @@ -1392,7 +1391,7 @@ within your latency budget, as you want some headroom for second-phase ranking.
## Second-phase ranking

For the second-phase ranking, we can afford to use a more expensive ranking expression, since we will only run it
on the top-k documents from the first-phase ranking (defined by the `rerank-count` parameter, which defaults to 10,000 documents).
on the top-k documents from the first-phase ranking (decided by the `total-rerank-count` parameter).

This is where we can significantly improve ranking quality by using more sophisticated models and features that would
be too expensive to compute for all matched documents.
Expand Down Expand Up @@ -1589,7 +1588,7 @@ vespa query \
**Performance monitoring:**

* Monitor latency impact of second-phase ranking
* Adjust `rerank-count` based on quality vs. performance trade-offs
* Adjust `total-rerank-count` based on quality vs. performance trade-offs
* Consider using different models for different query types or use cases

The second-phase ranking represents a crucial step in building high-quality RAG applications,
Expand All @@ -1598,7 +1597,7 @@ providing the precision needed for effective LLM context while maintaining reaso
## (Optional) Global-phase ranking

We also have the option of configuring [global-phase](../../reference/schemas/schemas.html#globalphase-rank) ranking, which can rerank the top k
(as set by `rerank-count` parameter) documents from the second-phase ranking.
(as set by `total-rerank-count` parameter) documents from the second-phase ranking.

Common options for global-phase are [cross-encoders](../../ranking/cross-encoders.html) or another GBDT model, trained for
better separating top ranked documents on objectives such as [LambdaMart](https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html). For RAG applications,
Expand Down
14 changes: 7 additions & 7 deletions en/performance/graceful-degradation.html
Original file line number Diff line number Diff line change
Expand Up @@ -177,21 +177,21 @@ <h2 id="match-phase-degradation">Match phase degradation</h2>
<p>
Match-phase works by specifying an <code>attribute</code> that measures document
quality in some way (popularity, click-through rate, pagerank, ad bid value, price, text quality).
In addition, a <code>max-hits</code> value is specified
that specifies how many hits are "more than enough" for the application.
In addition, a <code>total.max-hits</code> value is specified
that specifies how many hits in total over the content nodes are "more than enough" for the application.
Then an estimate is made after collecting a reasonable amount of hits for the query,
and if the estimate is higher than the configured <code>max-hits</code> value,
and if the estimate is higher than the node's share of the <code>total-max-hits</code> value,
an extra limitation is added to the query,
ensuring that only the highest quality documents can become hits.
</p><p>
In effect, this limits the documents actually queried to the highest quality documents,
a subset of the full corpus,
where the size of subset is calculated in such a way
that the query is estimated to give <code>max-hits</code> hits.
that the query is estimated to give the node's share of <code>total-max-hits</code> hits.
Since some (low-quality) hits will already have been collected to do the estimation,
the actual number of hits returned will usually be higher than max-hits.
the actual number of hits returned will usually be higher than total-max-hits.
But since the distribution of documents isn't perfectly smooth,
you risk sometimes getting less than the configured <code>max-hits</code> hits back.
you risk sometimes getting less than the configured <code>total-max-hits</code> hits back.
</p><p>
Note that limiting hits in the match-phase also affects <a href="../querying/grouping.html">aggregation/grouping</a>,
and total-hit-count since it actually limits, so the query gets fewer hits.
Expand All @@ -200,7 +200,7 @@ <h2 id="match-phase-degradation">Match phase degradation</h2>
since they both operate in the same manner,
and you would get interference between them that could cause unpredictable results.
The graph shows possible hits versus actual hits in a corpus with 100 000 documents,
where <code>max-hits</code> is configured to 10 000.
where <code>total-max-hits</code> is configured to 10 000 per node.
The corpus is a synthetic (slightly randomized) data set,
in practice the graph will be less smooth:
</p>
Expand Down
2 changes: 1 addition & 1 deletion en/performance/practical-search-performance-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1122,7 +1122,7 @@ Repeating the query from above, replacing `dotProduct` with `wand`:
<button class="d-icon d-duplicate pre-copy-button" onclick="copyPreContent(this)"></button>
<pre data-test="exec" data-test-assert-contains="Vastarannan valssi">
$ vespa query \
'yql=select track_id, title, artist, tags from track where {targetHits:10}wand(tags, @userProfile)' \
'yql=select track_id, title, artist, tags from track where {totalTargetHits:10}wand(tags, @userProfile)' \
'userProfile={"hard rock":1, "rock":1,"metal":1, "finnish metal":1}' \
'hits=1' \
'ranking=personalized'
Expand Down
21 changes: 11 additions & 10 deletions en/querying/approximate-nn-hnsw.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ or exact (brute-force) search by using the [approximate query annotation](../ref

<pre>
{
"yql": "select * from doc where {targetHits: 100, approximate:false}nearestNeighbor(image_embeddings,query_image_embedding)",
"yql": "select * from doc where {totalTargetHits: 10, approximate:false}nearestNeighbor(image_embeddings,query_image_embedding)",
"hits": 10
"input.query(query_image_embedding)": [0.21,0.12,....],
"ranking.profile": "image_similarity"
Expand All @@ -150,9 +150,9 @@ Note that exact searches over a large vector volume require adjustment of the
The default [query timeout](../reference/api/query.html#timeout) is 500ms,
which will be too low for an exact search over many vectors.

In addition to [targetHits](../reference/querying/yql.html#targethits),
In addition to [totalTargetHits](../reference/querying/yql.html#totaltargethits),
there is a [hnsw.exploreAdditionalHits](../reference/querying/yql.html#hnsw-exploreadditionalhits) parameter
which controls how many extra nodes in the graph (in addition to `targetHits`)
which controls how many extra nodes in the graph (in addition to `totalTargetHits`)
that are explored during the graph search. This parameter is used to tune accuracy quality versus query performance.

## Combining approximate nearest neighbor search with filters
Expand All @@ -174,22 +174,23 @@ Note that when using `pre-filtering` the following query operators are not inclu
* [predicate](../reference/querying/yql.html#predicate)

These are instead evaluated after the approximate nearest neighbors are retrieved, more like a `post-filter`.
This might cause the search to expose fewer hits to ranking than the wanted `targetHits`.
This might cause the search to expose fewer hits to ranking than the wanted `totalTargetHits`.

Since {% include version.html version="8.78" %} the `pre-filter` can be evaluated using
[multiple threads per query](../performance/practical-search-performance-guide.html#multithreaded-search-and-ranking).
This can be used to reduce query latency for larger vector datasets where the cost of evaluating the `pre-filter` is significant.
Note that searching the `HNSW` index is always single-threaded per query.
Multithreaded evaluation when using `post-filtering` has always been supported,
but this is less relevant as the `HNSW` index search first reduces the document candidate set based on `targetHits`.
but this is less relevant as the `HNSW` index search first reduces the document candidate set based on `totalTargetHits`.

## Nearest Neighbor Search Considerations

* **targetHits**:
The [targetHits](../reference/querying/yql.html#targethits)
specifies how many hits one wants to expose to [ranking](../basics/ranking.html) *per content node*.
Approximate search exposes exactly `targetHits` hits to `first-phase` ranking on every content node
as long as `targetHits` hits are actually found and not filtered out afterwards.
* **totalTargetHits**:
The [totalTargetHits](../reference/querying/yql.html#totaltargethits) parameter
specifies how many hits one wants to expose to [ranking](../basics/ranking.html) in total over the content nodes
participating in the query (you can also set this per node using [targetHits](../reference/querying/yql.html#targethits)).
Approximate search exposes exactly `totalTargetHits` hits to `first-phase` ranking over the content nodes
as long as `totalTargetHits` hits are actually found and not filtered out.
Nearest neighbor search is typically used as an efficient retriever in a [phased ranking](../ranking/phased-ranking.html)
pipeline. See [performance sizing](../performance/sizing-search.html).

Expand Down
Loading