Skip to content

Commit 6229f97

Browse files
committed
feat: Refactor scraper and introduce document processing pipeline
- **Scraper Refactoring:** - Replaced `DocumentationScraperDispatcher` with `ScraperService` and `ScraperRegistry`. - Created individual scraper strategies for GitHub, npm, PyPI, and a default strategy, all using `HtmlScraper`. - Moved scraper strategies to `src/scraper/strategies/`. - Deleted old scraper and strategy files. - Updated tests for new scraper architecture. - **Document Processing Pipeline:** - Introduced `DocumentProcessingPipeline` for handling scraping, embedding, and storing documents. - Added tests for the pipeline. - Centralized progress reporting and error handling in the pipeline. - **Vector Store Updates:** - Renamed `src/store/index.ts` to `VectorStoreManager.ts`. - Updated `VectorStoreManager` to handle the new pipeline and store structure. - Added tests for `VectorStoreManager`. - **Tooling and Type Changes:** - Updated `scrape` and `search` tools to use the new pipeline and `VectorStoreManager`. - Refactored types in `src/types/index.ts` to support the new architecture. - Removed old test files related to types. - **Architecture Updates:** - Updated `ARCHITECTURE.md` to reflect the new scraper architecture, pipeline, and conventions.
1 parent ba8a6f1 commit 6229f97

32 files changed

+2498
-1560
lines changed

ARCHITECTURE.md

Lines changed: 119 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,19 @@
44

55
The Documentation MCP Server is designed with a modular architecture that ensures feature parity and code reuse between its two main interfaces:
66

7+
### File Naming and Code Quality Conventions
8+
9+
- Files containing classes use PascalCase (e.g., `DocumentProcessingPipeline.ts`, `VectorStoreManager.ts`)
10+
- Other files use kebab-case or regular camelCase (e.g., `index.ts`, `scraper-service.ts`)
11+
- Avoid type casting where possible. Never use `any` type but prefer `unknown` or `never`.
12+
713
1. Command Line Interface (CLI)
814
2. Model Context Protocol (MCP) Server
915

16+
### Testing
17+
18+
- We use `vitest` for testing.
19+
1020
## Core Design Principles
1121

1222
### 1. Shared Tooling
@@ -92,10 +102,12 @@ server.tool(
92102

93103
### 4. Progress Reporting
94104

95-
Tools that involve long-running operations support progress reporting through callback functions. This allows both interfaces to provide appropriate feedback:
105+
The project uses a unified progress reporting system with typed callbacks for all long-running operations. This design:
96106

97-
- CLI: Console output with progress information
98-
- MCP: Structured progress updates through the MCP protocol
107+
- Provides real-time feedback at multiple levels (page, document, storage)
108+
- Ensures consistent progress tracking across components
109+
- Supports different output formats for CLI and MCP interfaces
110+
- Enables parallel processing with individual progress tracking
99111

100112
### 5. Logging Strategy
101113

@@ -169,85 +181,138 @@ src/
169181
└── url.ts # URL normalization utilities
170182
```
171183

172-
## Scraper Strategy Pattern
184+
## Scraper Architecture
173185

174-
The documentation scraper uses the Strategy pattern to handle different types of documentation sources:
186+
The scraper module is responsible for extracting content from various documentation sources. It employs a strategy pattern to handle different website structures and content formats.
175187

176188
```mermaid
177189
graph TD
178-
A[DocumentationScraperDispatcher] --> B{Determine Strategy}
179-
B -->|npmjs.org/com| C[NpmScraperStrategy]
180-
B -->|pypi.org| D[PyPiScraperStrategy]
181-
B -->|other domains| E[DefaultScraperStrategy]
182-
C -->|extends| E
183-
D -->|extends| E
190+
A[ScraperService] --> B[ScraperRegistry]
191+
B --> C{Select Strategy}
192+
C -->|github.com| D[GitHubScraperStrategy]
193+
C -->|npmjs.org| E[NpmScraperStrategy]
194+
C -->|pypi.org| F[PyPiScraperStrategy]
195+
C -->|other domains| G[DefaultScraperStrategy]
196+
D & E & F & G --> H[HtmlScraper]
184197
```
185198

186199
### Scraper Components
187200

188-
1. **DocumentationScraperDispatcher**
201+
1. **ScraperService**
202+
203+
- The main entry point for scraping operations.
204+
- Receives a URL and delegates to the ScraperRegistry to select the appropriate scraping strategy.
205+
- Handles overall scraping process and error management.
206+
207+
2. **ScraperRegistry**
208+
209+
- Responsible for selecting the appropriate scraping strategy based on the URL.
210+
- Maintains a list of available strategies and their associated domains.
211+
- Returns a default strategy if no specific strategy is found for the given URL.
189212

190-
- Entry point for scraping operations
191-
- Analyzes the target URL to determine appropriate strategy
192-
- Instantiates and delegates to specific scraper strategies
213+
3. **ScraperStrategy Interface (Implicit)**
193214

194-
2. **ScraperStrategy Interface**
215+
- Defines the contract for all scraper strategies.
216+
- Each strategy must implement a `scrape` method that takes a URL and returns the scraped content.
195217

196-
- Defines contract for all scraper implementations
197-
- Ensures consistent scraping behavior across strategies
218+
4. **HtmlScraper**
198219

199-
3. **DefaultScraperStrategy**
220+
- A general-purpose HTML scraper that uses `scrape-it` to extract content from web pages.
221+
- Converts HTML content to Markdown using `turndown`.
222+
- Implements a retry mechanism with exponential backoff to handle temporary network issues.
223+
- Allows customization of content and link selectors.
200224

201-
- Base implementation for web scraping
202-
- Handles generic documentation sites
203-
- Can be extended by specific strategies
225+
5. **Specialized Strategies**
204226

205-
4. **Specialized Strategies**
206-
- **NpmScraperStrategy**: Optimized for npm package documentation
207-
- Uses removeQuery URL normalization
208-
- Extends DefaultScraperStrategy
209-
- **PyPiScraperStrategy**: Handles Python Package Index docs
210-
- Uses removeQuery URL normalization
211-
- Extends DefaultScraperStrategy
227+
- **DefaultScraperStrategy**: A base strategy that uses HtmlScraper to scrape generic web pages.
228+
- **NpmScraperStrategy**: A strategy for scraping npm package documentation.
229+
- **PyPiScraperStrategy**: A strategy for scraping Python Package Index documentation.
230+
- **GitHubScraperStrategy**: A strategy for scraping GitHub repository documentation.
212231

213232
### Benefits of Strategy Pattern
214233

215234
1. **Flexibility**
216235

217-
- Easy to add new strategies for different documentation sources
218-
- Each strategy can implement custom scraping logic
219-
- Common functionality shared through DefaultScraperStrategy
236+
- New strategies can be easily added to support different documentation sources.
237+
- Each strategy can be customized to handle the specific structure and content of its target website.
220238

221239
2. **Maintainability**
222240

223-
- Clear separation of concerns
224-
- Each strategy isolated and focused
225-
- Easy to modify specific scraping behavior
241+
- The scraper logic is well-organized and easy to understand.
242+
- Changes to one strategy do not affect other strategies.
226243

227244
3. **Extensibility**
228-
- New strategies can be added without modifying existing code
229-
- Future support for sites like Mintlify, ReadMe.com planned
230245

231-
### Error Handling & Retry Logic
246+
- The scraper can be extended to support new documentation sources without modifying existing code.
247+
248+
## Vector Store Architecture
249+
250+
The vector store module manages document storage, retrieval, and search operations using a store-centric design that ensures clear lifecycle management and operation boundaries.
251+
252+
```mermaid
253+
graph TD
254+
A[DocumentProcessingPipeline] --> B[VectorStoreManager]
255+
B --> C{Store Operations}
256+
C -->|Create| D[createStore]
257+
C -->|Load| E[loadStore]
258+
C -->|Delete| F[deleteStore]
259+
D & E --> G[MemoryVectorStore]
260+
G --> H{Document Operations}
261+
H -->|Add| I[addDocument]
262+
H -->|Search| J[searchStore]
263+
```
264+
265+
### Vector Store Components
266+
267+
1. **VectorStoreManager**
268+
269+
- Manages store lifecycle (create, load, delete)
270+
- Handles store persistence and retrieval
271+
- Provides store-centric document operations
272+
273+
2. **Store Operations**
274+
275+
- Clear separation between store management and document operations
276+
- Explicit store lifecycle (create/load/delete)
277+
- Store must exist before document operations
278+
279+
3. **Document Operations**
280+
281+
- Add and search operations require existing store
282+
- Ensures consistent document processing
283+
- Maintains store integrity
284+
285+
### Benefits of Store-Centric Design
286+
287+
1. **Predictability**
288+
289+
- Clear store lifecycle
290+
- Explicit store dependencies
291+
- Consistent store state
292+
293+
2. **Performance**
294+
295+
- No redundant store creation
296+
- Efficient document batching
297+
- Optimized search operations
298+
299+
### Error Handling & Retry Mechanism
300+
301+
The `HtmlScraper` implements a robust retry mechanism to handle temporary network issues and improve scraping reliability.
232302

233-
The scraper implements a robust error handling system with clear distinction between different types of failures:
303+
1. **Retry Logic**
234304

235-
1. **Error Classification**
305+
- The `scrapePageWithRetry` method attempts to scrape a page multiple times if the initial attempt fails.
306+
- It uses exponential backoff to increase the delay between retries, reducing the load on the target server.
307+
- The maximum number of retries and the base delay are configurable.
236308

237-
- **InvalidUrlError**: Validation errors for malformed URLs
238-
- **ScraperError**: Base error class for scraping operations
239-
- `isRetryable`: Flag indicating if error can be retried
240-
- `cause`: Original error that caused the failure
241-
- `statusCode`: HTTP status code if applicable
309+
2. **Error Classification**
242310

243-
2. **Retry Strategy**
311+
- The scraper distinguishes between different types of errors to determine whether a retry is appropriate.
312+
- It retries on 4xx errors, which are typically caused by temporary network issues or server overload.
313+
- It does not retry on other errors, such as 5xx errors, which are typically caused by server-side problems.
244314

245-
- Only retry on 4xx HTTP errors
246-
- Non-4xx errors fail immediately
247-
- Exponential backoff between retry attempts
248-
- Clear separation between scraping and retry logic
315+
3. **Customizable Options**
249316

250-
3. **Implementation Pattern**
251-
- `scrapePageContent`: Core scraping logic
252-
- `scrapePageContentWithRetry`: Retry mechanism wrapper
253-
- Clean separation of concerns for better maintainability
317+
- The retry mechanism can be customized by passing a `RetryOptions` object to the `scrapePageWithRetry` method.
318+
- The `RetryOptions` object allows you to configure the maximum number of retries and the base delay.

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,33 +11,48 @@ This is a TypeScript-based MCP server that implements a simple notes system. It
1111
## Features
1212

1313
### Resources
14+
1415
- List and access notes via `note://` URIs
1516
- Each note has a title, content and metadata
1617
- Plain text mime type for simple content access
1718

1819
### Tools
20+
1921
- `create_note` - Create new text notes
2022
- Takes title and content as required parameters
2123
- Stores note in server state
2224

2325
### Prompts
26+
2427
- `summarize_notes` - Generate a summary of all stored notes
2528
- Includes all note contents as embedded resources
2629
- Returns structured prompt for LLM summarization
2730

31+
### Version Handling
32+
33+
This server supports partial version matching, selecting the best available version based on these rules:
34+
35+
- If no version is specified, the latest indexed version is used.
36+
- If a full version (e.g., `1.2.3`) is specified, that exact version is used, if available.
37+
- If a partial version (e.g., `1.2`) is specified, the latest matching version (e.g., `1.2.5`) is used.
38+
- If the specified version (full or partial) is not found, the server will attempt to find the closest preceding version.
39+
2840
## Development
2941

3042
Install dependencies:
43+
3144
```bash
3245
npm install
3346
```
3447

3548
Build the server:
49+
3650
```bash
3751
npm run build
3852
```
3953

4054
For development with auto-rebuild:
55+
4156
```bash
4257
npm run watch
4358
```

src/cli.ts

Lines changed: 19 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
#!/usr/bin/env node
22
import "dotenv/config";
33
import { Command } from "commander";
4-
import fs from "node:fs/promises";
54
import path from "node:path";
65
import os from "node:os";
7-
import { VectorStoreManager } from "./store/index.js";
8-
import type { VectorStoreProgress } from "./types/index.js";
9-
import type { DocContent } from "./types/index.js";
6+
import { VectorStoreManager } from "./store/VectorStoreManager.js";
107
import { findVersion, listLibraries } from "./tools/library.js";
118
import { search } from "./tools/search.js";
129
import { scrape } from "./tools/scrape.js";
@@ -19,13 +16,7 @@ const program = new Command();
1916
const baseDir = path.join(os.homedir(), ".docs-mcp", "data");
2017

2118
// Initialize the store manager
22-
const store = new VectorStoreManager(baseDir, {
23-
onProgress: (progress: VectorStoreProgress) => {
24-
console.log(
25-
`Processing document ${progress.documentsProcessed}/${progress.totalDocuments}: "${progress.currentDocument.title}" (${progress.currentDocument.numChunks} chunks)`
26-
);
27-
},
28-
});
19+
const store = new VectorStoreManager(baseDir);
2920

3021
program
3122
.name("docs-mcp")
@@ -44,25 +35,25 @@ program
4435
)
4536
.action(async (library, version, url, options) => {
4637
try {
47-
const result = await scrape({
48-
url,
49-
library,
50-
version,
51-
maxPages: Number.parseInt(options.maxPages),
52-
maxDepth: Number.parseInt(options.maxDepth),
53-
subpagesOnly: options.subpagesOnly,
54-
store,
55-
onProgress: (progress) => {
56-
console.log(
57-
`Scraping page ${progress.pagesScraped}/${options.maxPages} (depth ${progress.depth}/${options.maxDepth}): ${progress.currentUrl}`
58-
);
59-
return undefined;
38+
const result = await scrape(
39+
{
40+
url,
41+
library,
42+
version,
43+
options: {
44+
maxPages: Number.parseInt(options.maxPages),
45+
maxDepth: Number.parseInt(options.maxDepth),
46+
},
6047
},
61-
});
62-
63-
console.log(
64-
`Successfully scraped ${result.pagesScraped} pages and indexed ${result.documentsIndexed} documents`
48+
(progress) => {
49+
// Log progress messages to console
50+
for (const content of progress.content) {
51+
console.log(content.text);
52+
}
53+
}
6554
);
55+
56+
console.log(`✅ Successfully scraped ${result.pagesScraped} pages`);
6657
} catch (error: unknown) {
6758
console.error(
6859
"Error:",

0 commit comments

Comments
 (0)