Skip to content

Commit 00f9a2f

Browse files
committed
feat: refactor content processing to middleware pipeline
Refactors the scraper's content processing logic from dedicated processor classes (HtmlProcessor, MarkdownProcessor) to a more flexible middleware pipeline architecture (`src/scraper/middleware/`). Highlights: - Introduces `ContentProcessingPipeline` and `ContentProcessorMiddleware` interface. - Creates individual middleware components for parsing, metadata/link extraction, sanitization/cleaning, and HTML-to-Markdown conversion. - Updates strategies (`WebScraperStrategy`, `LocalFileStrategy`) and `FetchUrlTool` to construct and use the appropriate middleware pipeline based on content type. - Removes old processor classes and related files/tests. - Adds new tests for individual middleware components and updates strategy tests. - Updates `ARCHITECTURE.md` to reflect the new design. This improves modularity, testability, and configurability of the content processing flow. Closes #17
1 parent 2725c19 commit 00f9a2f

36 files changed

+2596
-1265
lines changed

ARCHITECTURE.md

Lines changed: 64 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,10 @@ src/
2929
│ │ └── LocalFileStrategy.ts # Handles local filesystem content
3030
│ │ └── ...
3131
│ ├── fetcher/ # Content fetching abstractions
32-
│ ├── processor/ # Content processing abstractions
32+
│ ├── middleware/ # Content processing pipeline & middleware
33+
│ │ ├── Pipeline.ts # Orchestrates middleware execution
34+
│ │ ├── types.ts # Context and middleware interfaces
35+
│ │ └── components/ # Individual middleware implementations
3336
│ └── ...
3437
├── splitter/ # Document splitting and chunking
3538
├── store/ # Document storage and retrieval
@@ -56,20 +59,69 @@ Each source type has a dedicated strategy that understands its specific protocol
5659

5760
### Content Processing Flow
5861

62+
Raw content fetched by a strategy's `fetcher` (e.g., HTML, Markdown) is processed through a configurable middleware pipeline. See the Middleware Pipeline section below for details.
63+
5964
```mermaid
60-
graph LR
61-
S[Source URL] --> R[Registry]
62-
R --> ST[Strategy Selection]
63-
ST --> F[Fetch Content]
64-
F --> P[Process Content]
65-
P --> D[Document Creation]
66-
```
65+
graph TD
66+
subgraph Strategy Execution
67+
F[Fetcher Fetches RawContent]
68+
CtxIn[Create Initial Context]
69+
Pipe[Run Pipeline]
70+
CtxOut[Get Final Context]
71+
Doc[Create Document from Context]
72+
end
6773
68-
The registry automatically selects the appropriate strategy based on the URL scheme, ensuring:
74+
subgraph ContentProcessingPipeline
75+
direction LR
76+
M1[Middleware 1] --> M2[Middleware 2] --> M3[...]
77+
end
78+
79+
F --> CtxIn
80+
CtxIn --> Pipe
81+
Pipe -- Passes Context --> M1
82+
M1 -- Passes Context --> M2
83+
M2 -- Passes Context --> M3
84+
M3 -- Returns Final Context --> CtxOut
85+
CtxOut --> Doc
86+
```
6987

70-
- Consistent handling across different content sources
71-
- Unified document format for storage
72-
- Reusable content processing logic
88+
- **`ContentProcessingContext`**: An object passed through the pipeline, carrying the content (initially raw, potentially transformed), MIME type, source URL, extracted metadata, links, errors, and options. HTML processing also uses a `dom` property on the context to hold the parsed JSDOM object.
89+
- **`ContentProcessorMiddleware`**: Individual, reusable components that perform specific tasks on the context, such as:
90+
- Parsing HTML (`HtmlDomParserMiddleware`)
91+
- Extracting metadata (`HtmlMetadataExtractorMiddleware`, `MarkdownMetadataExtractorMiddleware`)
92+
- Extracting links (`HtmlLinkExtractorMiddleware`, `MarkdownLinkExtractorMiddleware`)
93+
- Sanitizing and cleaning HTML (`HtmlSanitizerMiddleware`)
94+
- Converting HTML to Markdown (`HtmlToMarkdownMiddleware`)
95+
- **`ContentProcessingPipeline`**: Executes a sequence of middleware components in order, passing the context object between them.
96+
- **Strategies (`WebScraperStrategy`, `LocalFileStrategy`, etc.)**: Construct and run the appropriate pipeline based on the fetched content's MIME type. After the pipeline completes, the strategy uses the final `content` and `metadata` from the context to create the `Document` object.
97+
98+
This middleware approach ensures:
99+
100+
- **Modularity:** Processing steps are isolated and reusable.
101+
- **Configurability:** Pipelines can be easily assembled for different content types.
102+
- **Testability:** Individual middleware components can be tested independently.
103+
- **Consistency:** Ensures a unified document format regardless of the source.
104+
105+
### Middleware Pipeline
106+
107+
The core of content processing is the middleware pipeline (`ContentProcessingPipeline` located in `src/scraper/middleware/`). This pattern allows for modular and reusable processing steps.
108+
109+
- **`ContentProcessingContext`**: An object passed through the pipeline, carrying the content (initially raw, potentially transformed), MIME type, source URL, extracted metadata, links, errors, and options. HTML processing also uses a `dom` property on the context to hold the parsed JSDOM object.
110+
- **`ContentProcessorMiddleware`**: Individual, reusable components that perform specific tasks on the context, such as:
111+
- Parsing HTML (`HtmlDomParserMiddleware`)
112+
- Extracting metadata (`HtmlMetadataExtractorMiddleware`, `MarkdownMetadataExtractorMiddleware`)
113+
- Extracting links (`HtmlLinkExtractorMiddleware`, `MarkdownLinkExtractorMiddleware`)
114+
- Sanitizing and cleaning HTML (`HtmlSanitizerMiddleware`)
115+
- Converting HTML to Markdown (`HtmlToMarkdownMiddleware`)
116+
- **`ContentProcessingPipeline`**: Executes a sequence of middleware components in order, passing the context object between them.
117+
- **Strategies (`WebScraperStrategy`, `LocalFileStrategy`, etc.)**: Construct and run the appropriate pipeline based on the fetched content's MIME type. After the pipeline completes, the strategy uses the final `content` and `metadata` from the context to create the `Document` object.
118+
119+
This middleware approach ensures:
120+
121+
- **Modularity:** Processing steps are isolated and reusable.
122+
- **Configurability:** Pipelines can be easily assembled for different content types.
123+
- **Testability:** Individual middleware components can be tested independently.
124+
- **Consistency:** Ensures a unified document format regardless of the source.
73125

74126
## Tools Layer
75127

biome.json

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,17 @@
1212
},
1313
"files": {
1414
"include": ["src/**/*.ts"]
15-
}
15+
},
16+
"overrides": [
17+
{
18+
"include": ["src/**/*.test.ts"],
19+
"linter": {
20+
"rules": {
21+
"style": {
22+
"noNonNullAssertion": "off"
23+
}
24+
}
25+
}
26+
}
27+
]
1628
}

package-lock.json

Lines changed: 9 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
"db:push": "drizzle-kit push"
3636
},
3737
"dependencies": {
38+
"@joplin/turndown-plugin-gfm": "^1.0.61",
3839
"@langchain/aws": "^0.1.8",
3940
"@langchain/community": "^0.3.34",
4041
"@langchain/google-genai": "^0.2.3",

src/cli.ts

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ import packageJson from "../package.json";
55
import { DEFAULT_MAX_CONCURRENCY, DEFAULT_MAX_DEPTH, DEFAULT_MAX_PAGES } from "./config";
66
import { PipelineManager } from "./pipeline/PipelineManager";
77
import { FileFetcher, HttpFetcher } from "./scraper/fetcher";
8-
import { HtmlProcessor } from "./scraper/processor";
98
import { DocumentManagementService } from "./store/DocumentManagementService";
109
import {
1110
FetchUrlTool,
@@ -36,11 +35,7 @@ async function main() {
3635
findVersion: new FindVersionTool(docService),
3736
scrape: new ScrapeTool(docService, pipelineManager), // Pass manager
3837
search: new SearchTool(docService),
39-
fetchUrl: new FetchUrlTool(
40-
new HttpFetcher(),
41-
new FileFetcher(),
42-
new HtmlProcessor(),
43-
),
38+
fetchUrl: new FetchUrlTool(new HttpFetcher(), new FileFetcher()),
4439
};
4540

4641
const program = new Command();

src/mcp/index.ts

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ import { DEFAULT_MAX_DEPTH, DEFAULT_MAX_PAGES } from "../config";
77
import { PipelineManager } from "../pipeline/PipelineManager";
88
import { PipelineJobStatus } from "../pipeline/types";
99
import { FileFetcher, HttpFetcher } from "../scraper/fetcher";
10-
import { HtmlProcessor } from "../scraper/processor";
1110
import { DocumentManagementService } from "../store/DocumentManagementService";
1211
import {
1312
CancelJobTool,
@@ -49,11 +48,8 @@ export async function startServer() {
4948
getJobInfo: new GetJobInfoTool(pipelineManager),
5049
cancelJob: new CancelJobTool(pipelineManager),
5150
remove: new RemoveTool(docService),
52-
fetchUrl: new FetchUrlTool(
53-
new HttpFetcher(),
54-
new FileFetcher(),
55-
new HtmlProcessor(),
56-
),
51+
// FetchUrlTool now uses middleware pipeline internally
52+
fetchUrl: new FetchUrlTool(new HttpFetcher(), new FileFetcher()),
5753
};
5854

5955
const server = new McpServer(

0 commit comments

Comments
 (0)