|
4 | 4 |
|
5 | 5 | The Documentation MCP Server is designed with a modular architecture that ensures feature parity and code reuse between its two main interfaces: |
6 | 6 |
|
| 7 | +### File Naming and Code Quality Conventions |
| 8 | + |
| 9 | +- Files containing classes use PascalCase (e.g., `DocumentProcessingPipeline.ts`, `VectorStoreManager.ts`) |
| 10 | +- Other files use kebab-case or regular camelCase (e.g., `index.ts`, `scraper-service.ts`) |
| 11 | +- Avoid type casting where possible. Never use `any` type but prefer `unknown` or `never`. |
| 12 | + |
7 | 13 | 1. Command Line Interface (CLI) |
8 | 14 | 2. Model Context Protocol (MCP) Server |
9 | 15 |
|
| 16 | +### Testing |
| 17 | + |
| 18 | +- We use `vitest` for testing. |
| 19 | + |
10 | 20 | ## Core Design Principles |
11 | 21 |
|
12 | 22 | ### 1. Shared Tooling |
@@ -92,10 +102,12 @@ server.tool( |
92 | 102 |
|
93 | 103 | ### 4. Progress Reporting |
94 | 104 |
|
95 | | -Tools that involve long-running operations support progress reporting through callback functions. This allows both interfaces to provide appropriate feedback: |
| 105 | +The project uses a unified progress reporting system with typed callbacks for all long-running operations. This design: |
96 | 106 |
|
97 | | -- CLI: Console output with progress information |
98 | | -- MCP: Structured progress updates through the MCP protocol |
| 107 | +- Provides real-time feedback at multiple levels (page, document, storage) |
| 108 | +- Ensures consistent progress tracking across components |
| 109 | +- Supports different output formats for CLI and MCP interfaces |
| 110 | +- Enables parallel processing with individual progress tracking |
99 | 111 |
|
100 | 112 | ### 5. Logging Strategy |
101 | 113 |
|
@@ -169,85 +181,138 @@ src/ |
169 | 181 | └── url.ts # URL normalization utilities |
170 | 182 | ``` |
171 | 183 |
|
172 | | -## Scraper Strategy Pattern |
| 184 | +## Scraper Architecture |
173 | 185 |
|
174 | | -The documentation scraper uses the Strategy pattern to handle different types of documentation sources: |
| 186 | +The scraper module is responsible for extracting content from various documentation sources. It employs a strategy pattern to handle different website structures and content formats. |
175 | 187 |
|
176 | 188 | ```mermaid |
177 | 189 | graph TD |
178 | | - A[DocumentationScraperDispatcher] --> B{Determine Strategy} |
179 | | - B -->|npmjs.org/com| C[NpmScraperStrategy] |
180 | | - B -->|pypi.org| D[PyPiScraperStrategy] |
181 | | - B -->|other domains| E[DefaultScraperStrategy] |
182 | | - C -->|extends| E |
183 | | - D -->|extends| E |
| 190 | + A[ScraperService] --> B[ScraperRegistry] |
| 191 | + B --> C{Select Strategy} |
| 192 | + C -->|github.com| D[GitHubScraperStrategy] |
| 193 | + C -->|npmjs.org| E[NpmScraperStrategy] |
| 194 | + C -->|pypi.org| F[PyPiScraperStrategy] |
| 195 | + C -->|other domains| G[DefaultScraperStrategy] |
| 196 | + D & E & F & G --> H[HtmlScraper] |
184 | 197 | ``` |
185 | 198 |
|
186 | 199 | ### Scraper Components |
187 | 200 |
|
188 | | -1. **DocumentationScraperDispatcher** |
| 201 | +1. **ScraperService** |
| 202 | + |
| 203 | + - The main entry point for scraping operations. |
| 204 | + - Receives a URL and delegates to the ScraperRegistry to select the appropriate scraping strategy. |
| 205 | + - Handles overall scraping process and error management. |
| 206 | + |
| 207 | +2. **ScraperRegistry** |
| 208 | + |
| 209 | + - Responsible for selecting the appropriate scraping strategy based on the URL. |
| 210 | + - Maintains a list of available strategies and their associated domains. |
| 211 | + - Returns a default strategy if no specific strategy is found for the given URL. |
189 | 212 |
|
190 | | - - Entry point for scraping operations |
191 | | - - Analyzes the target URL to determine appropriate strategy |
192 | | - - Instantiates and delegates to specific scraper strategies |
| 213 | +3. **ScraperStrategy Interface (Implicit)** |
193 | 214 |
|
194 | | -2. **ScraperStrategy Interface** |
| 215 | + - Defines the contract for all scraper strategies. |
| 216 | + - Each strategy must implement a `scrape` method that takes a URL and returns the scraped content. |
195 | 217 |
|
196 | | - - Defines contract for all scraper implementations |
197 | | - - Ensures consistent scraping behavior across strategies |
| 218 | +4. **HtmlScraper** |
198 | 219 |
|
199 | | -3. **DefaultScraperStrategy** |
| 220 | + - A general-purpose HTML scraper that uses `scrape-it` to extract content from web pages. |
| 221 | + - Converts HTML content to Markdown using `turndown`. |
| 222 | + - Implements a retry mechanism with exponential backoff to handle temporary network issues. |
| 223 | + - Allows customization of content and link selectors. |
200 | 224 |
|
201 | | - - Base implementation for web scraping |
202 | | - - Handles generic documentation sites |
203 | | - - Can be extended by specific strategies |
| 225 | +5. **Specialized Strategies** |
204 | 226 |
|
205 | | -4. **Specialized Strategies** |
206 | | - - **NpmScraperStrategy**: Optimized for npm package documentation |
207 | | - - Uses removeQuery URL normalization |
208 | | - - Extends DefaultScraperStrategy |
209 | | - - **PyPiScraperStrategy**: Handles Python Package Index docs |
210 | | - - Uses removeQuery URL normalization |
211 | | - - Extends DefaultScraperStrategy |
| 227 | + - **DefaultScraperStrategy**: A base strategy that uses HtmlScraper to scrape generic web pages. |
| 228 | + - **NpmScraperStrategy**: A strategy for scraping npm package documentation. |
| 229 | + - **PyPiScraperStrategy**: A strategy for scraping Python Package Index documentation. |
| 230 | + - **GitHubScraperStrategy**: A strategy for scraping GitHub repository documentation. |
212 | 231 |
|
213 | 232 | ### Benefits of Strategy Pattern |
214 | 233 |
|
215 | 234 | 1. **Flexibility** |
216 | 235 |
|
217 | | - - Easy to add new strategies for different documentation sources |
218 | | - - Each strategy can implement custom scraping logic |
219 | | - - Common functionality shared through DefaultScraperStrategy |
| 236 | + - New strategies can be easily added to support different documentation sources. |
| 237 | + - Each strategy can be customized to handle the specific structure and content of its target website. |
220 | 238 |
|
221 | 239 | 2. **Maintainability** |
222 | 240 |
|
223 | | - - Clear separation of concerns |
224 | | - - Each strategy isolated and focused |
225 | | - - Easy to modify specific scraping behavior |
| 241 | + - The scraper logic is well-organized and easy to understand. |
| 242 | + - Changes to one strategy do not affect other strategies. |
226 | 243 |
|
227 | 244 | 3. **Extensibility** |
228 | | - - New strategies can be added without modifying existing code |
229 | | - - Future support for sites like Mintlify, ReadMe.com planned |
230 | 245 |
|
231 | | -### Error Handling & Retry Logic |
| 246 | + - The scraper can be extended to support new documentation sources without modifying existing code. |
| 247 | + |
| 248 | +## Vector Store Architecture |
| 249 | + |
| 250 | +The vector store module manages document storage, retrieval, and search operations using a store-centric design that ensures clear lifecycle management and operation boundaries. |
| 251 | + |
| 252 | +```mermaid |
| 253 | +graph TD |
| 254 | + A[DocumentProcessingPipeline] --> B[VectorStoreManager] |
| 255 | + B --> C{Store Operations} |
| 256 | + C -->|Create| D[createStore] |
| 257 | + C -->|Load| E[loadStore] |
| 258 | + C -->|Delete| F[deleteStore] |
| 259 | + D & E --> G[MemoryVectorStore] |
| 260 | + G --> H{Document Operations} |
| 261 | + H -->|Add| I[addDocument] |
| 262 | + H -->|Search| J[searchStore] |
| 263 | +``` |
| 264 | + |
| 265 | +### Vector Store Components |
| 266 | + |
| 267 | +1. **VectorStoreManager** |
| 268 | + |
| 269 | + - Manages store lifecycle (create, load, delete) |
| 270 | + - Handles store persistence and retrieval |
| 271 | + - Provides store-centric document operations |
| 272 | + |
| 273 | +2. **Store Operations** |
| 274 | + |
| 275 | + - Clear separation between store management and document operations |
| 276 | + - Explicit store lifecycle (create/load/delete) |
| 277 | + - Store must exist before document operations |
| 278 | + |
| 279 | +3. **Document Operations** |
| 280 | + |
| 281 | + - Add and search operations require existing store |
| 282 | + - Ensures consistent document processing |
| 283 | + - Maintains store integrity |
| 284 | + |
| 285 | +### Benefits of Store-Centric Design |
| 286 | + |
| 287 | +1. **Predictability** |
| 288 | + |
| 289 | + - Clear store lifecycle |
| 290 | + - Explicit store dependencies |
| 291 | + - Consistent store state |
| 292 | + |
| 293 | +2. **Performance** |
| 294 | + |
| 295 | + - No redundant store creation |
| 296 | + - Efficient document batching |
| 297 | + - Optimized search operations |
| 298 | + |
| 299 | +### Error Handling & Retry Mechanism |
| 300 | + |
| 301 | +The `HtmlScraper` implements a robust retry mechanism to handle temporary network issues and improve scraping reliability. |
232 | 302 |
|
233 | | -The scraper implements a robust error handling system with clear distinction between different types of failures: |
| 303 | +1. **Retry Logic** |
234 | 304 |
|
235 | | -1. **Error Classification** |
| 305 | + - The `scrapePageWithRetry` method attempts to scrape a page multiple times if the initial attempt fails. |
| 306 | + - It uses exponential backoff to increase the delay between retries, reducing the load on the target server. |
| 307 | + - The maximum number of retries and the base delay are configurable. |
236 | 308 |
|
237 | | - - **InvalidUrlError**: Validation errors for malformed URLs |
238 | | - - **ScraperError**: Base error class for scraping operations |
239 | | - - `isRetryable`: Flag indicating if error can be retried |
240 | | - - `cause`: Original error that caused the failure |
241 | | - - `statusCode`: HTTP status code if applicable |
| 309 | +2. **Error Classification** |
242 | 310 |
|
243 | | -2. **Retry Strategy** |
| 311 | + - The scraper distinguishes between different types of errors to determine whether a retry is appropriate. |
| 312 | + - It retries on 4xx errors, which are typically caused by temporary network issues or server overload. |
| 313 | + - It does not retry on other errors, such as 5xx errors, which are typically caused by server-side problems. |
244 | 314 |
|
245 | | - - Only retry on 4xx HTTP errors |
246 | | - - Non-4xx errors fail immediately |
247 | | - - Exponential backoff between retry attempts |
248 | | - - Clear separation between scraping and retry logic |
| 315 | +3. **Customizable Options** |
249 | 316 |
|
250 | | -3. **Implementation Pattern** |
251 | | - - `scrapePageContent`: Core scraping logic |
252 | | - - `scrapePageContentWithRetry`: Retry mechanism wrapper |
253 | | - - Clean separation of concerns for better maintainability |
| 317 | + - The retry mechanism can be customized by passing a `RetryOptions` object to the `scrapePageWithRetry` method. |
| 318 | + - The `RetryOptions` object allows you to configure the maximum number of retries and the base delay. |
0 commit comments