You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: refactor content processing to middleware pipeline
Refactors the scraper's content processing logic from dedicated processor classes (HtmlProcessor, MarkdownProcessor) to a more flexible middleware pipeline architecture (`src/scraper/middleware/`).
Highlights:
- Introduces `ContentProcessingPipeline` and `ContentProcessorMiddleware` interface.
- Creates individual middleware components for parsing, metadata/link extraction, sanitization/cleaning, and HTML-to-Markdown conversion.
- Updates strategies (`WebScraperStrategy`, `LocalFileStrategy`) and `FetchUrlTool` to construct and use the appropriate middleware pipeline based on content type.
- Removes old processor classes and related files/tests.
- Adds new tests for individual middleware components and updates strategy tests.
- Updates `ARCHITECTURE.md` to reflect the new design.
This improves modularity, testability, and configurability of the content processing flow.
Closes#17
@@ -56,20 +59,69 @@ Each source type has a dedicated strategy that understands its specific protocol
56
59
57
60
### Content Processing Flow
58
61
62
+
Raw content fetched by a strategy's `fetcher` (e.g., HTML, Markdown) is processed through a configurable middleware pipeline. See the Middleware Pipeline section below for details.
63
+
59
64
```mermaid
60
-
graph LR
61
-
S[Source URL] --> R[Registry]
62
-
R --> ST[Strategy Selection]
63
-
ST --> F[Fetch Content]
64
-
F --> P[Process Content]
65
-
P --> D[Document Creation]
66
-
```
65
+
graph TD
66
+
subgraph Strategy Execution
67
+
F[Fetcher Fetches RawContent]
68
+
CtxIn[Create Initial Context]
69
+
Pipe[Run Pipeline]
70
+
CtxOut[Get Final Context]
71
+
Doc[Create Document from Context]
72
+
end
67
73
68
-
The registry automatically selects the appropriate strategy based on the URL scheme, ensuring:
74
+
subgraph ContentProcessingPipeline
75
+
direction LR
76
+
M1[Middleware 1] --> M2[Middleware 2] --> M3[...]
77
+
end
78
+
79
+
F --> CtxIn
80
+
CtxIn --> Pipe
81
+
Pipe -- Passes Context --> M1
82
+
M1 -- Passes Context --> M2
83
+
M2 -- Passes Context --> M3
84
+
M3 -- Returns Final Context --> CtxOut
85
+
CtxOut --> Doc
86
+
```
69
87
70
-
- Consistent handling across different content sources
71
-
- Unified document format for storage
72
-
- Reusable content processing logic
88
+
-**`ContentProcessingContext`**: An object passed through the pipeline, carrying the content (initially raw, potentially transformed), MIME type, source URL, extracted metadata, links, errors, and options. HTML processing also uses a `dom` property on the context to hold the parsed JSDOM object.
89
+
-**`ContentProcessorMiddleware`**: Individual, reusable components that perform specific tasks on the context, such as:
- Sanitizing and cleaning HTML (`HtmlSanitizerMiddleware`)
94
+
- Converting HTML to Markdown (`HtmlToMarkdownMiddleware`)
95
+
-**`ContentProcessingPipeline`**: Executes a sequence of middleware components in order, passing the context object between them.
96
+
-**Strategies (`WebScraperStrategy`, `LocalFileStrategy`, etc.)**: Construct and run the appropriate pipeline based on the fetched content's MIME type. After the pipeline completes, the strategy uses the final `content` and `metadata` from the context to create the `Document` object.
97
+
98
+
This middleware approach ensures:
99
+
100
+
-**Modularity:** Processing steps are isolated and reusable.
101
+
-**Configurability:** Pipelines can be easily assembled for different content types.
102
+
-**Testability:** Individual middleware components can be tested independently.
103
+
-**Consistency:** Ensures a unified document format regardless of the source.
104
+
105
+
### Middleware Pipeline
106
+
107
+
The core of content processing is the middleware pipeline (`ContentProcessingPipeline` located in `src/scraper/middleware/`). This pattern allows for modular and reusable processing steps.
108
+
109
+
-**`ContentProcessingContext`**: An object passed through the pipeline, carrying the content (initially raw, potentially transformed), MIME type, source URL, extracted metadata, links, errors, and options. HTML processing also uses a `dom` property on the context to hold the parsed JSDOM object.
110
+
-**`ContentProcessorMiddleware`**: Individual, reusable components that perform specific tasks on the context, such as:
- Sanitizing and cleaning HTML (`HtmlSanitizerMiddleware`)
115
+
- Converting HTML to Markdown (`HtmlToMarkdownMiddleware`)
116
+
-**`ContentProcessingPipeline`**: Executes a sequence of middleware components in order, passing the context object between them.
117
+
-**Strategies (`WebScraperStrategy`, `LocalFileStrategy`, etc.)**: Construct and run the appropriate pipeline based on the fetched content's MIME type. After the pipeline completes, the strategy uses the final `content` and `metadata` from the context to create the `Document` object.
118
+
119
+
This middleware approach ensures:
120
+
121
+
-**Modularity:** Processing steps are isolated and reusable.
122
+
-**Configurability:** Pipelines can be easily assembled for different content types.
123
+
-**Testability:** Individual middleware components can be tested independently.
124
+
-**Consistency:** Ensures a unified document format regardless of the source.
0 commit comments