Skip to content

refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction#2

Merged
Helweg merged 4 commits intoHelweg:mainfrom
lucreiss:refactor/type-safe-embedding-models-and-tool-utils
Feb 22, 2026
Merged

refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction#2
Helweg merged 4 commits intoHelweg:mainfrom
lucreiss:refactor/type-safe-embedding-models-and-tool-utils

Conversation

@lucreiss
Copy link
Copy Markdown
Contributor

@lucreiss lucreiss commented Feb 20, 2026

Summary

Restructures the embedding model/provider type system to be data-driven and type-safe,
fixes several bugs in the reindexing and compatibility check flow, extracts tool formatting
logic into maintainable utilities, and adds support for Google's task-specific embeddings
and newer embedding models.

Changes

  • Config restructuring: Extract EMBEDDING_MODELS constant as single source of truth
    for all model metadata; derive EmbeddingProvider, EmbeddingModelName, and related
    types from it at compile time
  • Embedding provider improvements: Split embed() into embedQuery()/embedDocument()
    to support Google's task-type optimization (CODE_RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT);
    add gemini-embedding-001 with Matryoshka truncation; replace deprecated
    text-embedding-004 with text-embedding-005; add Google batch request support
    (up to 20 texts per call)
  • Reindexing safety fixes: checkCompatibility() no longer silently returns
    compatible: true on uninitialized state (throws instead); clearIndex() now deletes
    stale index metadata so force-rebuild works correctly with a new provider; index()
    blocks before writing if the index is incompatible; search() and findSimilar() now
    call ensureInitialized() before the compatibility check (previously ran on null state);
    findSimilar() gains a compatibility check it previously lacked; add provider mismatch
    detection via new IncompatibilityCode enum
  • Tool formatting extraction: Move all formatting functions from src/tools/index.ts
    to src/tools/utils.ts for maintainability and organization
  • Indexer type exports: Export SearchResult, HealthCheckResult, StatusResult
    interfaces from the indexer module
  • Test coverage: Comprehensive tests for config parsing, model validation, and all
    extracted formatting utilities

Testing

  • Unit tests added/updated
  • Manual testing performed
  • Build passes (npm run build)
  • Tests pass (npm run test:run)
  • Lint passes (npm run lint)

Related Issues

N/A

…nd extract tool formatting

- Extract embedding model constants to src/config/constants.ts with typed EMBEDDING_MODELS registry
- Derive EmbeddingProvider and model types from EMBEDDING_MODELS for compile-time safety
- Split embed() into embedQuery() and embedDocument() for task-specific embeddings
- Add gemini-embedding-001 model, replace deprecated text-embedding-004 with text-embedding-005
- Rename DetectedProvider to ConfiguredProviderInfo, split auto-detection into tryDetectProvider()
- Fix reindexing safety: checkCompatibility() no longer silently returns compatible on
  uninitialized state; clearIndex() now deletes stale metadata so force-rebuild works
  correctly; index() blocks on incompatible index; search()/findSimilar() call
  ensureInitialized() before compatibility check; add provider mismatch detection
- Extract tool formatting logic to src/tools/utils.ts for maintainability
- Add IncompatibilityCode enum; export SearchResult, HealthCheckResult, StatusResult interfaces
- Improve Google provider batching and outputDimensionality support
…l utils

- Update config tests for nested EMBEDDING_MODELS structure and isValidModel validation
- Add embeddingModel parsing tests covering provider/model cross-validation
- Add DEFAULT_PROVIDER_MODELS consistency tests
- Update cost and embeddings tests for new type structure
- Update watcher test config with requireProjectMarker and debug fields
- Add comprehensive tests for all extracted tool formatting functions
…requests

The Google embedding provider was using the embedContent endpoint for
both single and batch requests, sending multiple texts as parts in one
content object. This caused embedContent to concatenate texts into a
single embedding instead of producing one per text, and the response
was parsed incorrectly (plural 'embeddings' vs singular 'embedding').

- embedQuery/embedDocument now use embedContent with correct singular
  response parsing (data.embedding.values)
- embedBatch now uses batchEmbedContents with per-text requests array,
  each carrying its own model, content, taskType, and outputDimensionality
Copy link
Copy Markdown
Owner

@Helweg Helweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great restructuring — the data-driven model system and embedQuery/embedDocument split for Google task types are clean improvements.

A few things:

  1. Bug in createEmbeddingProvider default branch (src/embeddings/provider.ts): The error message interpolates the whole configuredProviderInfo object — throw new Error(\Unsupported embedding provider: ${configuredProviderInfo}`)— which will produce[object Object]. Should be configuredProviderInfo.provider`.

  2. Score label change in codebase_search: formatSearchResults now labels all scores as (similarity: X%). Previously codebase_search used (score: 0.85) while only find_similar used the percentage format. For hybrid search results (semantic + BM25 fusion), "similarity" is misleading since the fused score is a relevance rank, not a cosine similarity. Consider keeping the old format for codebase_search.

  3. Provider mismatch check: The new PROVIDER_MISMATCH incompatibility will force a full rebuild when switching between github-copilot and openai with the same model (text-embedding-3-small), even though Copilot proxies to OpenAI and produces identical embeddings. The dimension and model checks already catch genuinely incompatible scenarios — is the provider-only block intentional, or should it be a warning?

  4. Missing trailing newlines: src/config/constants.ts, src/config/index.ts, and src/tools/index.ts are all missing a final newline.

Otherwise LGTM.

…h, and exhaustiveness

- Replace unreachable default branch in createEmbeddingProvider with
  exhaustive never check for compile-time safety
- Restore raw score format (score: 0.85) for codebase_search hybrid
  results; keep similarity percentage only for find_similar
- Downgrade PROVIDER_MISMATCH from hard incompatibility to a warning
  when model and dimensions match (avoids unnecessary rebuilds for
  providers that proxy to the same backend)
- Add missing trailing newlines to constants.ts, config/index.ts,
  and tools/index.ts
@lucreiss
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! I addressed all four points:

1. createEmbeddingProvider default branch — You're right about the [object Object] bug.
I went a step further and replaced the default with an exhaustive never check. Since
ConfiguredProviderInfo is a discriminated union over the four known providers, the default
was already unreachable — removing it and adding never turns a missing case into a
compile-time error rather than a runtime one.
2. Score label in codebase_search — Good catch, I missed this when extracting
formatSearchResults into utils. codebase_search now uses (score: 0.85) format again
(appropriate for fused hybrid results that blend cosine similarity + BM25), while
find_similar keeps the (similarity: 92.0%) percentage format (pure cosine similarity).
I added a ScoreFormat parameter to formatSearchResults and two new tests covering both
formats.
3. Provider mismatch — I had this as intentional as a defensive measure — model names
like text-embedding-3-small are generic enough that a different provider could theoretically
reuse the name with different embeddings. That said, you're right that forcing a full rebuild
for github-copilotopenai with matching model + dimensions is unnecessarily disruptive.
I changed it to return compatible: true with a logged warning instead. The dimension and
model checks remain as the real safety net.
4. Trailing newlines — Fixed in all three files.

All checks pass: build, typecheck, lint, 321/321 tests (2 new).

Copy link
Copy Markdown
Owner

@Helweg Helweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The never exhaustive check is a nice touch — strictly better than the runtime default. And the ScoreFormat parameter is a clean way to handle the two contexts.

LGTM, ship it.

@Helweg Helweg merged commit d73de49 into Helweg:main Feb 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants