Commit 45d0e93
committed
feat(scraper): enhance crawler controls with scope and redirect options
This commit introduces more powerful and flexible control over the web scraper's
crawling behavior. It closes #15:
1. Added a new 'scope' option with three levels of crawling boundaries:
- 'subpages': Only crawl URLs on the same hostname and path (default)
- 'hostname': Crawl any URL on the same hostname regardless of path
- 'domain': Crawl across subdomains of the same top-level domain
2. Added a 'followRedirects' option to control HTTP redirect handling:
- When true: Redirects are followed automatically (default)
- When false: Redirects trigger a RedirectError and are skipped
3. Removed the deprecated 'subpagesOnly' option, replacing it with the more
flexible 'scope' option for better crawling control
4. Added CLI options to expose these new features:
- '--scope <scope>' to control crawling boundaries
- '--no-follow-redirects' to disable following HTTP redirects
These changes give users more precise control over what gets crawled, helping
with cases like:
- Only crawling documentation pages in a specific section of a site
- Crawling across different sections of the same site
- Following documentation across subdomains (api.example.com, docs.example.com)
- Preventing redirects to external sites
The implementation leverages the Public Suffix List (psl) for proper domain
boundary detection and adds comprehensive tests for all new functionality.1 parent 5602894 commit 45d0e93
File tree
15 files changed
+570
-43
lines changed- src
- mcp
- scraper
- fetcher
- strategies
- tools
- utils
15 files changed
+570
-43
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
51 | 52 | | |
52 | 53 | | |
53 | 54 | | |
| |||
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
| 74 | + | |
73 | 75 | | |
74 | 76 | | |
75 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
57 | 74 | | |
58 | 75 | | |
59 | 76 | | |
| |||
65 | 82 | | |
66 | 83 | | |
67 | 84 | | |
| 85 | + | |
| 86 | + | |
68 | 87 | | |
69 | 88 | | |
70 | 89 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
| 69 | + | |
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
84 | 89 | | |
85 | 90 | | |
86 | 91 | | |
87 | | - | |
| 92 | + | |
88 | 93 | | |
89 | 94 | | |
90 | | - | |
| 95 | + | |
91 | 96 | | |
92 | 97 | | |
93 | | - | |
94 | 98 | | |
95 | 99 | | |
96 | 100 | | |
| |||
100 | 104 | | |
101 | 105 | | |
102 | 106 | | |
103 | | - | |
| 107 | + | |
| 108 | + | |
104 | 109 | | |
105 | 110 | | |
106 | 111 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
107 | 189 | | |
108 | 190 | | |
109 | 191 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | | - | |
| 1 | + | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | | - | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
32 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
33 | 39 | | |
34 | 40 | | |
35 | 41 | | |
| |||
42 | 48 | | |
43 | 49 | | |
44 | 50 | | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
45 | 59 | | |
46 | 60 | | |
47 | 61 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
30 | 32 | | |
31 | 33 | | |
32 | 34 | | |
| |||
0 commit comments