Skip to content

feat(data): FTS5 entity search engine#960

Merged
cpcloud merged 1 commit intomicasa-dev:mainfrom
cpcloud:fts-engine
Apr 20, 2026
Merged

feat(data): FTS5 entity search engine#960
cpcloud merged 1 commit intomicasa-dev:mainfrom
cpcloud:fts-engine

Conversation

@cpcloud
Copy link
Copy Markdown
Collaborator

@cpcloud cpcloud commented Apr 20, 2026

Summary

  • SQLite FTS5 virtual table entities_fts indexing seven entity types (projects, vendors, appliances, incidents, quotes, maintenance items, service logs) with porter + unicode61 tokenization.
  • RebuildFTSIndex populates the index from the live entity tables. Safe to call repeatedly; skips soft-deleted rows (own rows and soft-deleted parents in JOINs).
  • SearchEntities runs a MATCH query and returns BM25-ranked results with a stable entity_id tiebreaker, capped at 20 rows.
  • EntitySummary fetches a tri-state result (found / stale / missing) so callers can revalidate cached search hits.
  • truncateField clips indexed content by rune count without splitting UTF-8 sequences.

This PR is the index and query layer only. No callers are wired up yet — triggers, query hardening, and the eval subcommand land in follow-up PRs.

Refs #707

Adds a SQLite FTS5 virtual table `entities_fts` indexing seven entity
types (projects, vendors, appliances, incidents, quotes, maintenance
items, service logs) and the Go engine to populate and query it.

- `entities_fts` virtual table with porter + unicode61 tokenization for
  stem-folded cross-language matching.
- `RebuildFTSIndex` rebuilds the index from scratch from the live
  entity tables; safe to call repeatedly.
- `populateEntitiesFTS` skips soft-deleted rows (including soft-deleted
  parents when JOINing) so the index never surfaces deleted data.
- `SearchEntities` runs a MATCH query, returns BM25-ranked results
  with a stable `entity_id` tiebreaker, capped at 20 rows.
- `EntitySummary` fetches a tri-state result (found / stale / missing)
  so callers can revalidate cached search hits before using them.
- `truncateField` clips indexed content by rune count to keep the FTS
  shadow table bounded without splitting UTF-8 sequences.

Test coverage: creation, rebuild, soft-delete exclusion (own rows
and parents), cross-entity search, stemming, graceful degradation on
corrupted schema, stale revalidation, Unicode truncation, tiebreaker
determinism.

This is the index and query layer only. Context-formatting helpers
and caller wiring land in follow-up PRs.

Refs micasa-dev#707.
@cpcloud cpcloud merged commit 2535cf6 into micasa-dev:main Apr 20, 2026
28 checks passed
@cpcloud cpcloud deleted the fts-engine branch April 20, 2026 14:46
cpcloud added a commit that referenced this pull request Apr 20, 2026
## Summary

Installs AFTER INSERT / UPDATE / DELETE triggers on every source table
that contributes to `entities_fts` (projects, vendors, appliances,
maintenance_items, incidents, service_log_entries, quotes) so the index
stays current without `RebuildFTSIndex` on every app open.

- Parent tables whose text is embedded in a child's `entity_name`
(project.title and vendor.name in quote, maintenance_item.name in
service_log) get companion `_au_cascade` triggers that rebuild the
child's FTS row when the parent is updated.
- Cascade JOINs filter on `parent.deleted_at IS NULL` so a parent
soft-delete degrades the child's `entity_name` (project title disappears
from the quote; vendor name disappears; SLE name blanks out) instead of
leaving stale text in the index.
- The populate path carries the same filter so initial rebuilds match
the trigger invariant.
- Trigger installation is idempotent (DROP IF EXISTS + CREATE), so
schema drift heals on the next `Store.Open`. FK constraints (RESTRICT on
quote parents, CASCADE on SLE parents) keep the trigger semantics
consistent with the rest of the domain.

Stacked on top of #960 (FTS engine). Diff will shrink to just the
trigger additions once that merges.

Refs #707
cpcloud added a commit that referenced this pull request Apr 21, 2026
…962)

## Summary

Hardens `SearchEntities` against real-world natural-language queries and
against single-type result floods.

**Ranking**: three-tier window-function query replaces the flat `LIMIT
20`:

- Tier 1 takes exactly one row per matching entity type (guarantees
cross-type representation).
- Tier 2 raises each type up to `ftsEntityKPerType` rows so single noisy
types can't dominate.
- Tier 3 fills the remaining room up to `ftsEntityTotalCap` from
whatever's left, globally ranked. Single-type searches use the full cap
this way.

Package-level tuning constants (not user-configurable — the eval harness
is the tuning channel):

    ftsEntityKPerType    = 5
    ftsEntityRankCeiling = 0.0   // permissive; eval will tighten
    ftsEntityTotalCap    = 20

`entity_id` tiebreaks rank in every `ORDER BY` so results are stable
when BM25 produces identical ranks.

**Query tolerance**:

- `prepareFTSEntityQuery` lowercases, strips non-alphanum, drops short
and stopword tokens, and OR-joins the survivors as quoted prefix
phrases.
- Returns early when no content words survive so a pure-stopword
question like "what is it?" doesn't hammer FTS with an empty MATCH.

Stacked on top of #961 (triggers), which is stacked on #960 (engine).

Refs #707
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Data layer, models, database enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant