Skip to content

Commit d058b48

Browse files
committed
feat(scraper): Implement local file scraping and refactor strategy pattern
This commit adds local file system scraping capabilities and improves the overall scraper architecture: Core Changes: - Add LocalFileStrategy for handling file:// URLs with directory traversal - Rename DefaultScraperStrategy to WebScraperStrategy - Introduce ContentFetcher & ContentProcessor abstractions - Add HtmlProcessor and MarkdownProcessor implementations Architecture Improvements: - Separate content fetching from processing logic - Add scraping strategy tests with proper mocking - Update ARCHITECTURE.md with new component documentation The changes make the scraper more modular and extensible while maintaining a clean separation of concerns. Local file system scraping now works with both HTML and Markdown files, using the same content processing pipeline as web content.
1 parent 59b4a33 commit d058b48

32 files changed

+1893
-1524
lines changed

ARCHITECTURE.md

Lines changed: 51 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,18 +17,52 @@ The Documentation MCP Server is designed with a modular architecture that ensure
1717

1818
```
1919
src/
20-
├── cli.ts # CLI interface implementation
21-
├── index.ts # MCP server interface
22-
├── pipeline/ # Document processing pipeline
23-
├── scraper/ # Web scraping implementation
24-
│ └── strategies/ # Scraping strategies for different sources
25-
├── splitter/ # Document splitting and chunking
26-
├── store/ # Vector store and document storage
27-
├── tools/ # Core functionality tools
28-
├── types/ # Shared type definitions
29-
└── utils/ # Common utilities and helpers
20+
├── cli.ts # CLI interface implementation
21+
├── index.ts # MCP server interface
22+
├── pipeline/ # Document processing pipeline
23+
├── scraper/ # Web scraping implementation
24+
│ ├── strategies/ # Scraping strategies for different sources
25+
│ │ ├── WebScraperStrategy.ts # Handles HTTP/HTTPS content
26+
│ │ └── LocalFileStrategy.ts # Handles local filesystem content
27+
│ ├── fetcher/ # Content fetching abstractions
28+
│ ├── processor/ # Content processing abstractions
29+
│ └── types.ts # Shared scraper types
30+
├── splitter/ # Document splitting and chunking
31+
├── store/ # Vector store and document storage
32+
├── tools/ # Core functionality tools
33+
├── types/ # Shared type definitions
34+
└── utils/ # Common utilities and helpers
3035
```
3136

37+
## Scraper Architecture
38+
39+
The scraping system uses a strategy pattern combined with content abstractions to handle different documentation sources uniformly:
40+
41+
### Content Sources
42+
43+
- Web-based content (HTTP/HTTPS)
44+
- Local filesystem content (file://)
45+
- Package registry content (e.g., npm, PyPI)
46+
47+
Each source type has a dedicated strategy that understands its specific protocol and structure, while sharing common processing logic.
48+
49+
### Content Processing Flow
50+
51+
```mermaid
52+
graph LR
53+
S[Source URL] --> R[Registry]
54+
R --> ST[Strategy Selection]
55+
ST --> F[Fetch Content]
56+
F --> P[Process Content]
57+
P --> D[Document Creation]
58+
```
59+
60+
The registry automatically selects the appropriate strategy based on the URL scheme, ensuring:
61+
62+
- Consistent handling across different content sources
63+
- Unified document format for storage
64+
- Reusable content processing logic
65+
3266
## Tools Layer
3367

3468
The project maintains a `tools/` directory containing modular implementations of core functionality. This design choice ensures that:
@@ -138,3 +172,10 @@ When adding new functionality:
138172
4. Add CLI command in `cli.ts`
139173
5. Add MCP tool in `index.ts`
140174
6. Maintain consistent error handling and progress reporting
175+
176+
When adding new scraping capabilities:
177+
178+
1. Implement a new strategy in `scraper/strategies/`
179+
2. Update the registry to handle the new source type
180+
3. Reuse existing content processing where possible
181+
4. Consider bulk operations and progress reporting

0 commit comments

Comments
 (0)