Knowledge search: benchmark and improve ranking quality using real note-finding tasks

## What's wrong

The current knowledge search is good enough to surface the right area of the note corpus, but not consistently good enough to rank the most useful note first for broad topical queries.

A recent example: searching for personal finance surfaced the right cluster of notes, but the top hits included a `## Related` chunk and a maths note before the dedicated `areas/personal-finances/` notes. That is good recall, but weak top-result precision for note-finding.

This matters because the current system is more complex and slower than simple global search, but it does not yet clearly beat grep / Ctrl+F for the kinds of note-finding tasks we actually do today. With the current corpus size, exact lexical search may still be competitive for many queries.

The core gap is not "add GraphRAG" or "add a reranker" in the abstract. The gap is that we do not yet have an empirical understanding of:

1. which real search tasks matter most,
2. where the current hybrid search beats plain global search,
3. where it loses, and
4. which ranking improvements would help enough to justify their cost and complexity.

## What done looks like

1. We have a small but representative benchmark of real note-finding tasks drawn from actual use, not invented toy examples. Each task includes the query, the note(s) a human would consider correct, and whether the task is primarily exact lookup, broad topical search, or relationship discovery.
2. The current search is evaluated against that benchmark and compared with a simple lexical baseline (for example grep / title-path matching / full-text only), so we know where hybrid retrieval is already useful and where it is not.
3. The main failure modes are documented from real results. Expected examples include chunk-level false positives (`## Related` sections, formula-only chunks), missing title/path boosts, and note-level vs chunk-level ranking problems.
4. The issue body is updated with a prioritized list of follow-up improvements based on evidence, not speculation. Candidate improvements may include document-first ranking, metadata boosts for title/path/tags, section-quality penalties, top-k reranking, and link-aware retrieval.
5. We make an explicit decision about whether GraphRAG is warranted. It should only be considered if the benchmark shows recurring multi-hop relationship questions that simpler ranking improvements cannot handle well.

## What the agent can't discover

Current search behavior from discussion and code review:

- Query ranking is chunk-level hybrid retrieval: vector similarity plus PostgreSQL full-text keyword search, merged with Reciprocal Rank Fusion (RRF).
- Displayed scores are RRF scores, not calibrated relevance probabilities. Low absolute values are expected and are not themselves a bug.
- The practical ranking weakness appears to be shape, not score scale: broad topical queries can rank adjacent or low-information chunks highly even when the right note family is present.
- The likely highest-value improvement path is to fix ranking before adding graph complexity.

Known likely improvement candidates from discussion:

- Rank notes first, then select the best chunk per note.
- Boost note title, path, folder, aliases, and tags instead of ranking on chunk body alone.
- Downrank or filter low-signal sections such as `## Related`, backlink lists, navigation blocks, and formula-only chunks for broad topical search.
- Add a lightweight reranking pass over the top candidates.
- Use the existing note-link graph for expansion or reranking before considering full GraphRAG.

Important product framing from discussion:

- For the current note corpus, hybrid RAG may not yet clearly outperform simple global search.
- The value of semantic retrieval increases when the corpus is larger, wording is less memorable, or the user is asking fuzzy topical questions rather than exact-term lookups.
- GraphRAG is more likely to help with cross-note relationship questions (for example "how do these topics connect?") than with the current ranking-quality problem.

## What must not break

- The current SSH-driven knowledge-search workflow must continue to work.
- Exact-term and path-based lookup must remain strong; improvements for broad semantic search must not make simple known-item retrieval worse.
- Search output should remain easy to inspect from the terminal.
- Any future ranking changes should be validated against real note-finding tasks rather than tuned against a single anecdotal query.

## Out of scope for this issue

- Implementing GraphRAG immediately.
- Rewriting the entire ingestion pipeline before we have benchmark evidence.
- Treating low RRF score magnitudes as the problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge search: benchmark and improve ranking quality using real note-finding tasks #233

What's wrong

What done looks like

What the agent can't discover

What must not break

Out of scope for this issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Knowledge search: benchmark and improve ranking quality using real note-finding tasks #233

Description

What's wrong

What done looks like

What the agent can't discover

What must not break

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions