Skip to content

Commit e197bec

Browse files
committed
feat(embeddings): add support for multiple embedding providers
This change introduces support for multiple embedding model providers beyond the default OpenAI, allowing users to configure providers like Google Vertex AI, Google Gemini, AWS Bedrock, and Azure OpenAI via the `DOCS_MCP_EMBEDDING_MODEL` environment variable (format: `provider:model_name`). **Key Changes:** * **Embedding Factory (`EmbeddingFactory.ts`):** * Refactored to dynamically instantiate LangChain embedding classes based on the specified provider. * Added support for `vertex`, `gemini`, `aws`, and `microsoft` providers. * Includes checks for required environment variables per provider. * **Dimension Handling (`FixedDimensionEmbeddings.ts`):** * Introduced a new `FixedDimensionEmbeddings` wrapper class. * This wrapper ensures all vectors match the database's fixed dimension (1536). * Pads vectors smaller than 1536 with zeros. * Truncates vectors larger than 1536 *only* if `allowTruncate` is true (currently enabled for the `gemini` provider, which supports MRL). * Throws a `DimensionError` if a non-truncatable model produces vectors > 1536. * **Factory Integration:** * Updated `EmbeddingFactory.ts` to wrap the `gemini` provider's embeddings with `FixedDimensionEmbeddings(..., allowTruncate: true)`. * **Configuration (`.env.example`, `Dockerfile`):** * Added necessary environment variables for all supported providers. * Updated examples and comments. * **Testing:** * Added comprehensive tests for `FixedDimensionEmbeddings.ts`. * Updated tests for `EmbeddingFactory.ts` to cover new providers and the wrapper integration. * **Documentation (`README.md`, `ARCHITECTURE.md`):** * Updated `README.md` to list supported providers, required environment variables, and simplified the vector dimension explanation. * Updated `ARCHITECTURE.md` with details on the embedding factory, the `FixedDimensionEmbeddings` wrapper, and the dimension handling logic (padding, MRL truncation, errors). * Removed examples using unsupported large-dimension models (e.g., `text-embedding-3-large`). This enhancement provides greater flexibility in choosing embedding models while maintaining compatibility with the existing database schema. Implements #28
1 parent 636978f commit e197bec

13 files changed

+5431
-2470
lines changed

.clinerules

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,8 @@
66
- ALWAYS use the latest version of the programming language and libraries.
77
- ALWAYS prefer the simplest solution.
88
- When importing a relative path, avoid using file extensions like ".js" and ".ts".
9-
- ALWAYS add and update TSDoc for all classes, methods and functions. Focus on functionality and reasoning. Avoid documenting individual parameters or return values if their use can easily be derived from their name.
10-
- ALWAYS format Git commit messages as markdown.
11-
- ALWAYS adhere to the Conventional Commits specification for all Git commit messages
9+
- ALWAYS add and update TSDoc for all classes, methods and functions. Focus on functionality and reasoning.
10+
- NEVER document individual parameters or return values if their use can easily be derived from their name.
1211

1312
## Architecture Documentation Guidelines
1413

@@ -19,3 +18,12 @@ Keep `ARCHITECTURE.md` high-level:
1918
- Use simple MermaidJS diagrams for visualization
2019
- Put implementation details in source code
2120
- Update when architecture changes
21+
22+
## Git
23+
24+
- The repository owner and name is `arabold/docs-mcp-server` on GitHub.
25+
- AWLAYS create new branches locally first before pushing them to the GitHub repository.
26+
- ALWAYS format Git commit messages as markdown.
27+
- ALWAYS adhere to the Conventional Commits specification for all Git commit messages
28+
- ALWAYS prefix branch names with the type of work being done, such as `feature/`, `bugfix/`, `chore/`, etc.
29+
- ALWAYS include the issue number in the branch name, such as `feature/1234-issue-name`.

.env.example

Lines changed: 36 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,43 @@
1-
# OpenAI Configuration
2-
# Required: Your OpenAI API Key
3-
OPENAI_API_KEY=your-key-here
1+
# Embedding Model Configuration
2+
# Optional: Format is "provider:model_name" or just "model_name" for OpenAI (default)
3+
# Examples:
4+
# - openai:text-embedding-3-small (default if no provider specified)
5+
# - vertex:text-embedding-004 (Google Cloud Vertex AI)
6+
# - gemini:gemini-embedding-exp-03-07 (Google Generative AI)
7+
# - aws:amazon.titan-embed-text-v1
8+
# - microsoft:text-embedding-ada-002
9+
DOCS_MCP_EMBEDDING_MODEL=
410

5-
# Optional: Your OpenAI Organization ID (handled automatically by LangChain if set)
11+
# OpenAI Provider Configuration (Default)
12+
# Required for OpenAI provider or as fallback
13+
OPENAI_API_KEY=your-key-here
14+
# Optional: Your OpenAI Organization ID
615
OPENAI_ORG_ID=
7-
8-
# Optional: Custom base URL for OpenAI API (e.g., for Azure OpenAI or compatible APIs)
16+
# Optional: Custom base URL for OpenAI-compatible APIs (e.g., Ollama, Azure OpenAI)
917
OPENAI_API_BASE=
1018

11-
# Optional: Embedding model name (defaults to "text-embedding-3-small")
12-
# Must produce vectors with ≤1536 dimensions (smaller dimensions are padded with zeros)
13-
# Examples: text-embedding-3-small (1536), text-embedding-ada-002 (1536)
14-
# Note: text-embedding-3-large (3072) is not supported due to dimension limit
15-
DOCS_MCP_EMBEDDING_MODEL=
19+
# Google Cloud Vertex AI Configuration
20+
# Required for vertex provider: Path to service account JSON key file
21+
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-key.json
22+
23+
# Google Generative AI (Gemini) Configuration
24+
# Required for gemini provider: Google API key
25+
GOOGLE_API_KEY=your-google-api-key
26+
27+
# AWS Bedrock Configuration
28+
# Required for aws provider
29+
AWS_ACCESS_KEY_ID=your-aws-key
30+
AWS_SECRET_ACCESS_KEY=your-aws-secret
31+
AWS_REGION=us-east-1
32+
# Optional: Use BEDROCK_AWS_REGION instead of AWS_REGION if needed
33+
# BEDROCK_AWS_REGION=us-east-1
34+
35+
# Azure OpenAI Configuration
36+
# Required for microsoft provider
37+
AZURE_OPENAI_API_KEY=your-azure-key
38+
AZURE_OPENAI_API_INSTANCE_NAME=your-instance
39+
AZURE_OPENAI_API_DEPLOYMENT_NAME=your-deployment
40+
AZURE_OPENAI_API_VERSION=2024-02-01
1641

1742
# Optional: Specify a custom directory to store the SQLite database file (documents.db).
1843
# If set, this path takes precedence over the default locations.

ARCHITECTURE.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,63 @@ graph TD
153153

154154
The project uses SQLite for document storage, providing a lightweight and efficient database solution that requires no separate server setup.
155155

156+
#### Embedding Generation
157+
158+
Document embeddings are generated using a flexible provider system implemented in `src/store/embeddings/EmbeddingFactory.ts`. This factory supports multiple embedding providers through LangChain.js integrations:
159+
160+
```mermaid
161+
graph TD
162+
subgraph Input
163+
EM[DOCS_MCP_EMBEDDING_MODEL]
164+
DC[Document Content]
165+
end
166+
167+
subgraph EmbeddingFactory
168+
P[Parse provider:model]
169+
PV[Provider Selection]
170+
Config[Provider Configuration]
171+
LangChain[LangChain Integration]
172+
end
173+
174+
subgraph Providers
175+
OpenAI[OpenAI Embeddings]
176+
VertexAI[Google Vertex AI]
177+
Bedrock[AWS Bedrock]
178+
Azure[Azure OpenAI]
179+
end
180+
181+
subgraph Output
182+
Vec[1536d Vector]
183+
Pad[Zero Padding if needed]
184+
end
185+
186+
EM --> P
187+
P --> PV
188+
PV --> Config
189+
Config --> LangChain
190+
DC --> LangChain
191+
192+
LangChain --> |provider selection| OpenAI
193+
LangChain --> |provider selection| VertexAI
194+
LangChain --> |provider selection| Bedrock
195+
LangChain --> |provider selection| Azure
196+
197+
OpenAI & VertexAI & Bedrock & Azure --> Vec
198+
Vec --> |if dimension < 1536| Pad
199+
```
200+
201+
The factory:
202+
203+
- Parses the `DOCS_MCP_EMBEDDING_MODEL` environment variable to determine the provider and model
204+
- Configures the appropriate LangChain embeddings class based on provider-specific environment variables
205+
- Ensures consistent vector dimensions through the `FixedDimensionEmbeddings` wrapper:
206+
- Models producing vectors < 1536 dimensions: Padded with zeros
207+
- Models with MRL support (e.g., Gemini): Safely truncated to 1536 dimensions
208+
- Other models producing vectors > 1536: Not supported, throws error
209+
- Maintains a fixed database dimension of 1536 for all embeddings for compatibility with `sqlite-vec`
210+
211+
This design allows easy addition of new embedding providers while maintaining consistent vector dimensions in the database.
212+
156213
**Database Location:** The application determines the database file (`documents.db`) location dynamically:
157214

158215
1. It first checks for a `.store` directory in the current working directory (`process.cwd()`). If `.store/documents.db` exists, it uses this path. This prioritizes local development databases.
@@ -251,6 +308,44 @@ This hierarchy ensures:
251308
- Easy to add new tools
252309
- Simple to add new interfaces (e.g., REST API) using same tools
253310

311+
## Testing Conventions
312+
313+
This section outlines conventions and best practices for writing tests within this project.
314+
315+
### Mocking with Vitest
316+
317+
When mocking modules or functions using `vitest`, it's crucial to follow a specific order due to how `vi.mock` hoisting works. `vi.mock` calls are moved to the top of the file before any imports. This means you cannot define helper functions _before_ `vi.mock` and then use them _within_ the mock setup directly.
318+
319+
To correctly mock dependencies, follow these steps:
320+
321+
1. **Declare the Mock:** Call `vi.mock('./path/to/module-to-mock')` at the top of your test file, before any imports or other code.
322+
2. **Define Mock Implementations:** _After_ the `vi.mock` call, define any helper functions, variables, or mock implementations you'll need.
323+
3. **Import the Actual Module:** Import the specific functions or classes you intend to mock from the original module.
324+
4. **Apply the Mock:** Use the defined mock implementations to replace the behavior of the imported functions/classes. You might need to cast the imported item as a `Mock` type (`import { type Mock } from 'vitest'`).
325+
326+
**Example Structure:**
327+
328+
```typescript
329+
import { vi, type Mock } from "vitest";
330+
331+
// 1. Declare the mock (hoisted to top)
332+
vi.mock("./dependency");
333+
334+
// 2. Define mock function/variable *after* vi.mock
335+
const mockImplementation = vi.fn(() => "mocked result");
336+
337+
// 3. Import the actual function/class *after* defining mocks
338+
import { functionToMock } from "./dependency";
339+
340+
// 4. Apply the mock implementation
341+
(functionToMock as Mock).mockImplementation(mockImplementation);
342+
343+
// ... rest of your test code using the mocked functionToMock ...
344+
// expect(functionToMock()).toBe('mocked result');
345+
```
346+
347+
This structure ensures that mocks are set up correctly before the modules that depend on them are imported and used in your tests.
348+
254349
## Future Considerations
255350

256351
When adding new functionality:

Dockerfile

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,34 @@ RUN npm ci --omit=dev
3030
COPY --from=builder /app/dist ./dist
3131
RUN ln -s /app/dist/cli.js /app/docs-cli
3232

33-
# Define the data directory environment variable and volume
34-
# Environment variables
33+
# Define environment variables with defaults
34+
# OpenAI (default provider)
35+
ENV OPENAI_API_BASE=""
36+
ENV OPENAI_ORG_ID=""
37+
38+
# Google Cloud - Vertex AI
39+
ENV GOOGLE_APPLICATION_CREDENTIALS=""
40+
41+
# Google Generative AI (Gemini)
42+
ENV GOOGLE_API_KEY=""
43+
44+
# AWS Bedrock
45+
ENV AWS_ACCESS_KEY_ID=""
46+
ENV AWS_SECRET_ACCESS_KEY=""
47+
ENV AWS_REGION=""
48+
ENV BEDROCK_AWS_REGION=""
49+
50+
# Azure OpenAI
51+
ENV AZURE_OPENAI_API_KEY=""
52+
ENV AZURE_OPENAI_API_INSTANCE_NAME=""
53+
ENV AZURE_OPENAI_API_DEPLOYMENT_NAME=""
54+
ENV AZURE_OPENAI_API_VERSION=""
55+
56+
# Core configuration
3557
ENV DOCS_MCP_STORE_PATH=/data
36-
ENV OPENAI_API_BASE=
37-
ENV OPENAI_ORG_ID=
38-
ENV DOCS_MCP_EMBEDDING_MODEL=
58+
ENV DOCS_MCP_EMBEDDING_MODEL=""
3959

60+
# Define volumes
4061
VOLUME /data
4162

4263
# Set the command to run the application

README.md

Lines changed: 81 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,43 @@ The server exposes MCP tools for:
2828

2929
## Configuration
3030

31-
The following environment variables are supported to configure the OpenAI API and embedding behavior:
31+
The following environment variables are supported to configure the embedding model behavior:
3232

33-
- `OPENAI_API_KEY`: **Required.** Your OpenAI API key for generating embeddings.
34-
- `OPENAI_ORG_ID`: **Optional.** Your OpenAI Organization ID (handled automatically by LangChain if set).
35-
- `OPENAI_API_BASE`: **Optional.** Custom base URL for OpenAI API (e.g., for Azure OpenAI or compatible APIs).
36-
- `DOCS_MCP_EMBEDDING_MODEL`: **Optional.** Embedding model name (defaults to "text-embedding-3-small"). Must produce vectors with ≤1536 dimensions. Smaller dimensions are automatically padded with zeros.
33+
### Embedding Model Configuration
3734

38-
The database schema uses a fixed dimension of 1536 for embedding vectors. Models that produce larger vectors are not supported and will cause an error. Models with smaller vectors (e.g., older embedding models) are automatically padded with zeros to match the required dimension.
35+
- `DOCS_MCP_EMBEDDING_MODEL`: **Optional.** Format: `provider:model_name` or just `model_name` (defaults to `text-embedding-3-small`). Supported providers and their required environment variables:
36+
37+
- `openai` (default): Uses OpenAI's embedding models
38+
39+
- `OPENAI_API_KEY`: **Required.** Your OpenAI API key
40+
- `OPENAI_ORG_ID`: **Optional.** Your OpenAI Organization ID
41+
- `OPENAI_API_BASE`: **Optional.** Custom base URL for OpenAI-compatible APIs (e.g., Ollama, Azure OpenAI)
42+
43+
- `vertex`: Uses Google Cloud Vertex AI embeddings
44+
45+
- `GOOGLE_APPLICATION_CREDENTIALS`: **Required.** Path to service account JSON key file
46+
47+
- `gemini`: Uses Google Generative AI (Gemini) embeddings
48+
49+
- `GOOGLE_API_KEY`: **Required.** Your Google API key
50+
51+
- `aws`: Uses AWS Bedrock embeddings
52+
53+
- `AWS_ACCESS_KEY_ID`: **Required.** AWS access key
54+
- `AWS_SECRET_ACCESS_KEY`: **Required.** AWS secret key
55+
- `AWS_REGION` or `BEDROCK_AWS_REGION`: **Required.** AWS region for Bedrock
56+
57+
- `microsoft`: Uses Azure OpenAI embeddings
58+
- `AZURE_OPENAI_API_KEY`: **Required.** Azure OpenAI API key
59+
- `AZURE_OPENAI_API_INSTANCE_NAME`: **Required.** Azure instance name
60+
- `AZURE_OPENAI_API_DEPLOYMENT_NAME`: **Required.** Azure deployment name
61+
- `AZURE_OPENAI_API_VERSION`: **Required.** Azure API version
62+
63+
### Vector Dimensions
64+
65+
The database schema uses a fixed dimension of 1536 for embedding vectors. Only models that produce vectors with dimension ≤ 1536 are supported, except for certain providers (like Gemini) that support dimension reduction.
66+
67+
For OpenAI-compatible APIs (like Ollama), use the `openai` provider with `OPENAI_API_BASE` pointing to your endpoint.
3968

4069
These variables can be set regardless of how you run the server (Docker, npx, or from source).
4170

@@ -92,10 +121,54 @@ This is the recommended approach for most users. It's easy, straightforward, and
92121
Any of the configuration environment variables (see [Configuration](#configuration) above) can be passed to the container using the `-e` flag. For example:
93122

94123
```bash
124+
# Example 1: Using OpenAI embeddings (default)
125+
docker run -i --rm \
126+
-e OPENAI_API_KEY="your-key-here" \
127+
-e DOCS_MCP_EMBEDDING_MODEL="text-embedding-3-small" \
128+
-v docs-mcp-data:/data \
129+
ghcr.io/arabold/docs-mcp-server:latest
130+
131+
# Example 2: Using OpenAI-compatible API (like Ollama)
95132
docker run -i --rm \
96133
-e OPENAI_API_KEY="your-key-here" \
97-
-e DOCS_MCP_EMBEDDING_MODEL="text-embedding-3-large" \
98-
-e OPENAI_API_BASE="http://your-api-endpoint" \
134+
-e OPENAI_API_BASE="http://localhost:11434/v1" \
135+
-e DOCS_MCP_EMBEDDING_MODEL="embeddings" \
136+
-v docs-mcp-data:/data \
137+
ghcr.io/arabold/docs-mcp-server:latest
138+
139+
# Example 3a: Using Google Cloud Vertex AI embeddings
140+
docker run -i --rm \
141+
-e OPENAI_API_KEY="your-openai-key" \ # Keep for fallback to OpenAI
142+
-e DOCS_MCP_EMBEDDING_MODEL="vertex:text-embedding-004" \
143+
-e GOOGLE_APPLICATION_CREDENTIALS="/app/gcp-key.json" \
144+
-v docs-mcp-data:/data \
145+
-v /path/to/gcp-key.json:/app/gcp-key.json:ro \
146+
ghcr.io/arabold/docs-mcp-server:latest
147+
148+
# Example 3b: Using Google Generative AI (Gemini) embeddings
149+
docker run -i --rm \
150+
-e OPENAI_API_KEY="your-openai-key" \ # Keep for fallback to OpenAI
151+
-e DOCS_MCP_EMBEDDING_MODEL="gemini:embedding-001" \
152+
-e GOOGLE_API_KEY="your-google-api-key" \
153+
-v docs-mcp-data:/data \
154+
ghcr.io/arabold/docs-mcp-server:latest
155+
156+
# Example 4: Using AWS Bedrock embeddings
157+
docker run -i --rm \
158+
-e AWS_ACCESS_KEY_ID="your-aws-key" \
159+
-e AWS_SECRET_ACCESS_KEY="your-aws-secret" \
160+
-e AWS_REGION="us-east-1" \
161+
-e DOCS_MCP_EMBEDDING_MODEL="aws:amazon.titan-embed-text-v1" \
162+
-v docs-mcp-data:/data \
163+
ghcr.io/arabold/docs-mcp-server:latest
164+
165+
# Example 5: Using Azure OpenAI embeddings
166+
docker run -i --rm \
167+
-e AZURE_OPENAI_API_KEY="your-azure-key" \
168+
-e AZURE_OPENAI_API_INSTANCE_NAME="your-instance" \
169+
-e AZURE_OPENAI_API_DEPLOYMENT_NAME="your-deployment" \
170+
-e AZURE_OPENAI_API_VERSION="2024-02-01" \
171+
-e DOCS_MCP_EMBEDDING_MODEL="microsoft:text-embedding-ada-002" \
99172
-v docs-mcp-data:/data \
100173
ghcr.io/arabold/docs-mcp-server:latest
101174
```

0 commit comments

Comments
 (0)