refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction by lucreiss · Pull Request #2 · Helweg/opencode-codebase-index

lucreiss · 2026-02-20T02:42:36Z

Summary

Restructures the embedding model/provider type system to be data-driven and type-safe,
fixes several bugs in the reindexing and compatibility check flow, extracts tool formatting
logic into maintainable utilities, and adds support for Google's task-specific embeddings
and newer embedding models.

Changes

Config restructuring: Extract EMBEDDING_MODELS constant as single source of truth
for all model metadata; derive EmbeddingProvider, EmbeddingModelName, and related
types from it at compile time
Embedding provider improvements: Split embed() into embedQuery()/embedDocument()
to support Google's task-type optimization (CODE_RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT);
add gemini-embedding-001 with Matryoshka truncation; replace deprecated
text-embedding-004 with text-embedding-005; add Google batch request support
(up to 20 texts per call)
Reindexing safety fixes: checkCompatibility() no longer silently returns
compatible: true on uninitialized state (throws instead); clearIndex() now deletes
stale index metadata so force-rebuild works correctly with a new provider; index()
blocks before writing if the index is incompatible; search() and findSimilar() now
call ensureInitialized() before the compatibility check (previously ran on null state);
findSimilar() gains a compatibility check it previously lacked; add provider mismatch
detection via new IncompatibilityCode enum
Tool formatting extraction: Move all formatting functions from src/tools/index.ts
to src/tools/utils.ts for maintainability and organization
Indexer type exports: Export SearchResult, HealthCheckResult, StatusResult
interfaces from the indexer module
Test coverage: Comprehensive tests for config parsing, model validation, and all
extracted formatting utilities

Testing

Related Issues

N/A

…nd extract tool formatting - Extract embedding model constants to src/config/constants.ts with typed EMBEDDING_MODELS registry - Derive EmbeddingProvider and model types from EMBEDDING_MODELS for compile-time safety - Split embed() into embedQuery() and embedDocument() for task-specific embeddings - Add gemini-embedding-001 model, replace deprecated text-embedding-004 with text-embedding-005 - Rename DetectedProvider to ConfiguredProviderInfo, split auto-detection into tryDetectProvider() - Fix reindexing safety: checkCompatibility() no longer silently returns compatible on uninitialized state; clearIndex() now deletes stale metadata so force-rebuild works correctly; index() blocks on incompatible index; search()/findSimilar() call ensureInitialized() before compatibility check; add provider mismatch detection - Extract tool formatting logic to src/tools/utils.ts for maintainability - Add IncompatibilityCode enum; export SearchResult, HealthCheckResult, StatusResult interfaces - Improve Google provider batching and outputDimensionality support

…l utils - Update config tests for nested EMBEDDING_MODELS structure and isValidModel validation - Add embeddingModel parsing tests covering provider/model cross-validation - Add DEFAULT_PROVIDER_MODELS consistency tests - Update cost and embeddings tests for new type structure - Update watcher test config with requireProjectMarker and debug fields - Add comprehensive tests for all extracted tool formatting functions

…requests The Google embedding provider was using the embedContent endpoint for both single and batch requests, sending multiple texts as parts in one content object. This caused embedContent to concatenate texts into a single embedding instead of producing one per text, and the response was parsed incorrectly (plural 'embeddings' vs singular 'embedding'). - embedQuery/embedDocument now use embedContent with correct singular response parsing (data.embedding.values) - embedBatch now uses batchEmbedContents with per-text requests array, each carrying its own model, content, taskType, and outputDimensionality

Helweg

Great restructuring — the data-driven model system and embedQuery/embedDocument split for Google task types are clean improvements.

A few things:

Bug in createEmbeddingProvider default branch (src/embeddings/provider.ts): The error message interpolates the whole configuredProviderInfo object — throw new Error(\Unsupported embedding provider: ${configuredProviderInfo}`)— which will produce[object Object]. Should be configuredProviderInfo.provider`.
Score label change in codebase_search: formatSearchResults now labels all scores as (similarity: X%). Previously codebase_search used (score: 0.85) while only find_similar used the percentage format. For hybrid search results (semantic + BM25 fusion), "similarity" is misleading since the fused score is a relevance rank, not a cosine similarity. Consider keeping the old format for codebase_search.
Provider mismatch check: The new PROVIDER_MISMATCH incompatibility will force a full rebuild when switching between github-copilot and openai with the same model (text-embedding-3-small), even though Copilot proxies to OpenAI and produces identical embeddings. The dimension and model checks already catch genuinely incompatible scenarios — is the provider-only block intentional, or should it be a warning?
Missing trailing newlines: src/config/constants.ts, src/config/index.ts, and src/tools/index.ts are all missing a final newline.

Otherwise LGTM.

…h, and exhaustiveness - Replace unreachable default branch in createEmbeddingProvider with exhaustive never check for compile-time safety - Restore raw score format (score: 0.85) for codebase_search hybrid results; keep similarity percentage only for find_similar - Downgrade PROVIDER_MISMATCH from hard incompatibility to a warning when model and dimensions match (avoids unnecessary rebuilds for providers that proxy to the same backend) - Add missing trailing newlines to constants.ts, config/index.ts, and tools/index.ts

lucreiss · 2026-02-20T14:50:15Z

Thanks for the thorough review! I addressed all four points:

1. createEmbeddingProvider default branch — You're right about the [object Object] bug.
I went a step further and replaced the default with an exhaustive never check. Since
ConfiguredProviderInfo is a discriminated union over the four known providers, the default
was already unreachable — removing it and adding never turns a missing case into a
compile-time error rather than a runtime one.
2. Score label in codebase_search — Good catch, I missed this when extracting
formatSearchResults into utils. codebase_search now uses (score: 0.85) format again
(appropriate for fused hybrid results that blend cosine similarity + BM25), while
find_similar keeps the (similarity: 92.0%) percentage format (pure cosine similarity).
I added a ScoreFormat parameter to formatSearchResults and two new tests covering both
formats.
3. Provider mismatch — I had this as intentional as a defensive measure — model names
like text-embedding-3-small are generic enough that a different provider could theoretically
reuse the name with different embeddings. That said, you're right that forcing a full rebuild
for github-copilot ↔ openai with matching model + dimensions is unnecessarily disruptive.
I changed it to return compatible: true with a logged warning instead. The dimension and
model checks remain as the real safety net.
4. Trailing newlines — Fixed in all three files.

All checks pass: build, typecheck, lint, 321/321 tests (2 new).

Helweg

The never exhaustive check is a nice touch — strictly better than the runtime default. And the ScoreFormat parameter is a clean way to handle the two contexts.

LGTM, ship it.

lucreiss added 3 commits February 20, 2026 02:42

Helweg reviewed Feb 20, 2026

View reviewed changes

Helweg approved these changes Feb 22, 2026

View reviewed changes

Helweg merged commit d73de49 into Helweg:main Feb 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction#2

refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction#2
Helweg merged 4 commits intoHelweg:mainfrom
lucreiss:refactor/type-safe-embedding-models-and-tool-utils

lucreiss commented Feb 20, 2026 •

edited

Loading

Uh oh!

Helweg left a comment

Uh oh!

lucreiss commented Feb 20, 2026

Uh oh!

Helweg left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucreiss commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Related Issues

Uh oh!

Helweg left a comment

Choose a reason for hiding this comment

Uh oh!

lucreiss commented Feb 20, 2026

Uh oh!

Helweg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucreiss commented Feb 20, 2026 •

edited

Loading