refactor: type-safe embedding model system, reindexing safety fixes, and tool formatting extraction#2
Conversation
…nd extract tool formatting - Extract embedding model constants to src/config/constants.ts with typed EMBEDDING_MODELS registry - Derive EmbeddingProvider and model types from EMBEDDING_MODELS for compile-time safety - Split embed() into embedQuery() and embedDocument() for task-specific embeddings - Add gemini-embedding-001 model, replace deprecated text-embedding-004 with text-embedding-005 - Rename DetectedProvider to ConfiguredProviderInfo, split auto-detection into tryDetectProvider() - Fix reindexing safety: checkCompatibility() no longer silently returns compatible on uninitialized state; clearIndex() now deletes stale metadata so force-rebuild works correctly; index() blocks on incompatible index; search()/findSimilar() call ensureInitialized() before compatibility check; add provider mismatch detection - Extract tool formatting logic to src/tools/utils.ts for maintainability - Add IncompatibilityCode enum; export SearchResult, HealthCheckResult, StatusResult interfaces - Improve Google provider batching and outputDimensionality support
…l utils - Update config tests for nested EMBEDDING_MODELS structure and isValidModel validation - Add embeddingModel parsing tests covering provider/model cross-validation - Add DEFAULT_PROVIDER_MODELS consistency tests - Update cost and embeddings tests for new type structure - Update watcher test config with requireProjectMarker and debug fields - Add comprehensive tests for all extracted tool formatting functions
…requests The Google embedding provider was using the embedContent endpoint for both single and batch requests, sending multiple texts as parts in one content object. This caused embedContent to concatenate texts into a single embedding instead of producing one per text, and the response was parsed incorrectly (plural 'embeddings' vs singular 'embedding'). - embedQuery/embedDocument now use embedContent with correct singular response parsing (data.embedding.values) - embedBatch now uses batchEmbedContents with per-text requests array, each carrying its own model, content, taskType, and outputDimensionality
Helweg
left a comment
There was a problem hiding this comment.
Great restructuring — the data-driven model system and embedQuery/embedDocument split for Google task types are clean improvements.
A few things:
-
Bug in
createEmbeddingProviderdefault branch (src/embeddings/provider.ts): The error message interpolates the wholeconfiguredProviderInfoobject —throw new Error(\Unsupported embedding provider: ${configuredProviderInfo}`)— which will produce[object Object]. Should beconfiguredProviderInfo.provider`. -
Score label change in
codebase_search:formatSearchResultsnow labels all scores as(similarity: X%). Previouslycodebase_searchused(score: 0.85)while onlyfind_similarused the percentage format. For hybrid search results (semantic + BM25 fusion), "similarity" is misleading since the fused score is a relevance rank, not a cosine similarity. Consider keeping the old format forcodebase_search. -
Provider mismatch check: The new
PROVIDER_MISMATCHincompatibility will force a full rebuild when switching betweengithub-copilotandopenaiwith the same model (text-embedding-3-small), even though Copilot proxies to OpenAI and produces identical embeddings. The dimension and model checks already catch genuinely incompatible scenarios — is the provider-only block intentional, or should it be a warning? -
Missing trailing newlines:
src/config/constants.ts,src/config/index.ts, andsrc/tools/index.tsare all missing a final newline.
Otherwise LGTM.
…h, and exhaustiveness - Replace unreachable default branch in createEmbeddingProvider with exhaustive never check for compile-time safety - Restore raw score format (score: 0.85) for codebase_search hybrid results; keep similarity percentage only for find_similar - Downgrade PROVIDER_MISMATCH from hard incompatibility to a warning when model and dimensions match (avoids unnecessary rebuilds for providers that proxy to the same backend) - Add missing trailing newlines to constants.ts, config/index.ts, and tools/index.ts
|
Thanks for the thorough review! I addressed all four points: 1. All checks pass: build, typecheck, lint, 321/321 tests (2 new). |
Helweg
left a comment
There was a problem hiding this comment.
The never exhaustive check is a nice touch — strictly better than the runtime default. And the ScoreFormat parameter is a clean way to handle the two contexts.
LGTM, ship it.
Summary
Restructures the embedding model/provider type system to be data-driven and type-safe,
fixes several bugs in the reindexing and compatibility check flow, extracts tool formatting
logic into maintainable utilities, and adds support for Google's task-specific embeddings
and newer embedding models.
Changes
EMBEDDING_MODELSconstant as single source of truthfor all model metadata; derive
EmbeddingProvider,EmbeddingModelName, and relatedtypes from it at compile time
embed()intoembedQuery()/embedDocument()to support Google's task-type optimization (CODE_RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT);
add
gemini-embedding-001with Matryoshka truncation; replace deprecatedtext-embedding-004withtext-embedding-005; add Google batch request support(up to 20 texts per call)
checkCompatibility()no longer silently returnscompatible: trueon uninitialized state (throws instead);clearIndex()now deletesstale index metadata so force-rebuild works correctly with a new provider;
index()blocks before writing if the index is incompatible;
search()andfindSimilar()nowcall
ensureInitialized()before the compatibility check (previously ran on null state);findSimilar()gains a compatibility check it previously lacked; add provider mismatchdetection via new
IncompatibilityCodeenumsrc/tools/index.tsto
src/tools/utils.tsfor maintainability and organizationSearchResult,HealthCheckResult,StatusResultinterfaces from the indexer module
extracted formatting utilities
Testing
npm run build)npm run test:run)npm run lint)Related Issues
N/A