Skip to content

Commit 4296373

Browse files
SlacklukyascorbicPrincesseuh
authored
Feature(sitemap): named files chunking strategy (#14471)
* feat(sitemap): add chunking strategy for sitemaps Adds the ability to split sitemap generation into chunks based on customizable logic. This allows for better management of large sitemaps and improved performance. The new `chunks` option in the sitemap configuration allows users to define functions that categorize sitemap items into different chunks. Each chunk is then written to a separate sitemap file. This change introduces a new `writeSitemapChunk` function to handle the writing of individual sitemap chunks. * feat(sitemap): add chunks option to sitemap config Adds a `chunks` option to the sitemap configuration schema. This allows users to define custom chunking strategies for generating sitemaps, providing flexibility in how the sitemap is split into multiple files. * feat(sitemap): add sitemap chunk writing functionality * fix(sitemap): fix empty callback in writeSitemap The empty callback function in the `writeSitemap` function was causing unnecessary function calls. This commit fixes this by removing the empty callback. * feat(sitemap): add test fixture for sitemap chunking This commit adds a test fixture to verify the sitemap chunking functionality. It includes a configuration file, dependencies, and several pages to simulate a real-world scenario. * test(sitemap): add test for sitemap chunking with files * feat(sitemap): add changeset for sitemap chunking Adds changeset to document the new sitemap chunking feature. This feature allows splitting sitemap generation into chunks based on customizable logic, improving management of large sitemaps and performance. * build: update dependencies and add astro * chore: remove unused astro dependency * chore: remove unused entries from lockfile * refactor(sitemap): improve import ordering and formatting * refactor(sitemap): improve import ordering The import order of `AstroConfig` has been moved to align with other imports, improving code readability and consistency. This change ensures that type imports are grouped together, making the codebase easier to maintain. * refactor(sitemap): improve import ordering * refactor(sitemap): improve import ordering * refactor(sitemap): improve import ordering * refactor(sitemap): improve chunk file test readability Simplify the chunk file test by using `path.resolve` and `includes` for better readability and maintainability. This change improves the test's clarity without altering its functionality. * test(sitemap): fix flaky chunk file tests The tests were failing intermittently because the `readXML` function was not properly resolving the file path. This commit updates the `readXML` function to use `fixture.readFile` to ensure that the file path is resolved correctly. Additionally, the `flatMapUrls` function is now async to ensure that the `readXML` function is awaited. * refactor(sitemap): improve import ordering * Update .changeset/floppy-times-grab.md Co-authored-by: Matt Kane <m@mk.gg> * chore(sitemap): update changeset to minor The previous changeset incorrectly marked the sitemap chunking feature as a major change. This commit corrects the changeset to reflect that it is a minor feature addition. * feat(sitemap): add chunking support for sitemap generation * fix: attempt to fix lockfile * fix: conflict * fix: lockfile --------- Co-authored-by: Matt Kane <m@mk.gg> Co-authored-by: Princesseuh <3019731+Princesseuh@users.noreply.github.com>
1 parent 051a62d commit 4296373

16 files changed

Lines changed: 419 additions & 34 deletions

File tree

.changeset/floppy-times-grab.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
'@astrojs/sitemap': minor
3+
---
4+
5+
Adds the ability to split sitemap generation into chunks based on customizable logic. This allows for better management of large sitemaps and improved performance. The new `chunks` option in the sitemap configuration allows users to define functions that categorize sitemap items into different chunks. Each chunk is then written to a separate sitemap file.
6+
7+
```
8+
integrations: [
9+
sitemap({
10+
serialize(item) { th
11+
return item
12+
},
13+
chunks: { // this property will be treated last on the configuration
14+
'blog': (item) => { // will produce a sitemap file with `blog` name (sitemap-blog-0.xml)
15+
if (/blog/.test(item.url)) { // filter path that will be included in this specific sitemap file
16+
item.changefreq = 'weekly';
17+
item.lastmod = new Date();
18+
item.priority = 0.9; // define specific properties for this filtered path
19+
return item;
20+
}
21+
},
22+
'glossary': (item) => {
23+
if (/glossary/.test(item.url)) {
24+
item.changefreq = 'weekly';
25+
item.lastmod = new Date();
26+
item.priority = 0.7;
27+
return item;
28+
}
29+
}
30+
31+
// the rest of the path will be stored in `sitemap-pages.0.xml`
32+
},
33+
}),
34+
],
35+
36+
```

packages/integrations/sitemap/src/index.ts

Lines changed: 76 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@ import { ZodError } from 'zod';
77
import { generateSitemap } from './generate-sitemap.js';
88
import { validateOptions } from './validate-options.js';
99
import { writeSitemap } from './write-sitemap.js';
10+
import { writeSitemapChunk } from './write-sitemap-chunk.js';
1011

1112
export { EnumChangefreq as ChangeFreqEnum } from 'sitemap';
13+
1214
export type ChangeFreq = `${EnumChangefreq}`;
1315
export type SitemapItem = Pick<
1416
SitemapItemLoose,
@@ -18,36 +20,35 @@ export type LinkItem = LinkItemBase;
1820

1921
export type SitemapOptions =
2022
| {
21-
filenameBase?: string;
22-
filter?(page: string): boolean;
23-
customSitemaps?: string[];
24-
customPages?: string[];
25-
26-
i18n?: {
27-
defaultLocale: string;
28-
locales: Record<string, string>;
29-
};
30-
// number of entries per sitemap file
31-
entryLimit?: number;
32-
33-
// sitemap specific
34-
changefreq?: ChangeFreq;
35-
lastmod?: Date;
36-
priority?: number;
37-
38-
// called for each sitemap item just before to save them on disk, sync or async
39-
serialize?(item: SitemapItem): SitemapItem | Promise<SitemapItem | undefined> | undefined;
40-
41-
xslURL?: string;
42-
43-
// namespace configuration
44-
namespaces?: {
45-
news?: boolean;
46-
xhtml?: boolean;
47-
image?: boolean;
48-
video?: boolean;
49-
};
50-
}
23+
filenameBase?: string;
24+
filter?(page: string): boolean;
25+
customSitemaps?: string[];
26+
customPages?: string[];
27+
28+
i18n?: {
29+
defaultLocale: string;
30+
locales: Record<string, string>;
31+
};
32+
// number of entries per sitemap file
33+
entryLimit?: number;
34+
// sitemap specific
35+
changefreq?: ChangeFreq;
36+
lastmod?: Date;
37+
priority?: number;
38+
39+
// called for each sitemap item just before to save them on disk, sync or async
40+
serialize?(item: SitemapItem): SitemapItem | Promise<SitemapItem | undefined> | undefined;
41+
42+
xslURL?: string;
43+
chunks?: Record<string, (item: SitemapItem) => SitemapItem | Promise<SitemapItem | undefined> | undefined>
44+
// namespace configuration
45+
namespaces?: {
46+
news?: boolean;
47+
xhtml?: boolean;
48+
image?: boolean;
49+
video?: boolean;
50+
};
51+
}
5152
| undefined;
5253

5354
function formatConfigErrorMessage(err: ZodError) {
@@ -102,8 +103,7 @@ const createPlugin = (options?: SitemapOptions): AstroIntegration => {
102103

103104
const opts = validateOptions(config.site, options);
104105

105-
const { filenameBase, filter, customPages, customSitemaps, serialize, entryLimit } = opts;
106-
106+
const { filenameBase, filter, customPages, customSitemaps, serialize, entryLimit, chunks } = opts;
107107
const outFile = `${filenameBase}-index.xml`;
108108
const finalSiteUrl = new URL(config.base, config.site);
109109
const shouldIgnoreStatus = isStatusCodePage(Object.keys(opts.i18n?.locales ?? {}));
@@ -179,9 +179,53 @@ const createPlugin = (options?: SitemapOptions): AstroIntegration => {
179179
return;
180180
}
181181
}
182+
182183
const destDir = fileURLToPath(dir);
183184
const lastmod = opts.lastmod?.toISOString();
184185
const xslURL = opts.xslURL ? new URL(opts.xslURL, finalSiteUrl).href : undefined;
186+
187+
if (chunks) {
188+
try {
189+
let groupedUrlCollection: SitemapItem['url'][] = []
190+
const chunksItem: Record<string, SitemapItem[]> = {};
191+
for (const [key, cb] of Object.entries(chunks)) {
192+
// Create a new, separate collection for each key
193+
const collection: SitemapItem[] = [];
194+
195+
for (const item of urlData) {
196+
// Await the asynchronous operation
197+
const collect = await Promise.resolve(cb(item));
198+
if (collect) {
199+
collection.push(collect);
200+
}
201+
}
202+
203+
// Assign the specific collection to its key
204+
chunksItem[key] = collection;
205+
groupedUrlCollection = [...groupedUrlCollection, ...collection.map((coll) => coll.url)]
206+
}
207+
chunksItem['pages'] = urlData.filter((urlDataItem) => !(groupedUrlCollection.includes(urlDataItem.url)))
208+
// Process each chunk here
209+
await writeSitemapChunk({
210+
filenameBase,
211+
hostname: finalSiteUrl.href,
212+
sitemapHostname: finalSiteUrl.href,
213+
sourceData: chunksItem,
214+
destinationDir: destDir,
215+
publicBasePath: config.base,
216+
customSitemaps,
217+
limit: entryLimit,
218+
xslURL,
219+
lastmod,
220+
namespaces: opts.namespaces,
221+
}, config);
222+
logger.info(`\`${outFile}\` created at \`${path.relative(process.cwd(), destDir)}\``);
223+
return
224+
} catch (err) {
225+
logger.error(`Error chunking sitemaps\n${(err as any).toString()}`);
226+
return;
227+
}
228+
}
185229
await writeSitemap(
186230
{
187231
filenameBase: filenameBase,

packages/integrations/sitemap/src/schema.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ export const SitemapOptionsSchema = z
4747
})
4848
.optional()
4949
.default(SITEMAP_CONFIG_DEFAULTS.namespaces),
50-
})
50+
chunks: z.record(z.function().args(z.any()).returns(z.any())).optional(),
51+
})
5152
.strict()
5253
.default(SITEMAP_CONFIG_DEFAULTS);
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import { createWriteStream, type WriteStream } from 'node:fs';
2+
import { mkdir } from 'node:fs/promises';
3+
import { normalize, resolve } from 'node:path';
4+
import { pipeline, Readable } from 'node:stream';
5+
import { promisify } from 'node:util';
6+
import type { AstroConfig } from 'astro';
7+
import { SitemapAndIndexStream, SitemapIndexStream, SitemapStream } from 'sitemap';
8+
import replace from 'stream-replace-string';
9+
import type { SitemapItem } from './index.js';
10+
11+
12+
type WriteSitemapChunkConfig = {
13+
filenameBase: string;
14+
hostname: string;
15+
sitemapHostname?: string;
16+
sourceData: Record<string, SitemapItem[]>;
17+
destinationDir: string;
18+
customSitemaps?: string[];
19+
publicBasePath?: string;
20+
limit?: number;
21+
xslURL?: string;
22+
lastmod?: string;
23+
namespaces?: {
24+
news?: boolean;
25+
xhtml?: boolean;
26+
image?: boolean;
27+
video?: boolean;
28+
};
29+
};
30+
31+
// adapted from sitemap.js/sitemap-simple
32+
export async function writeSitemapChunk(
33+
{
34+
filenameBase,
35+
hostname,
36+
sitemapHostname = hostname,
37+
sourceData,
38+
destinationDir,
39+
limit = 50000,
40+
customSitemaps = [],
41+
publicBasePath = './',
42+
xslURL: xslUrl,
43+
lastmod,
44+
namespaces = { news: true, xhtml: true, image: true, video: true },
45+
}: WriteSitemapChunkConfig,
46+
astroConfig: AstroConfig,
47+
) {
48+
await mkdir(destinationDir, { recursive: true });
49+
50+
// Normalize publicBasePath
51+
let normalizedPublicBasePath = publicBasePath;
52+
if (!normalizedPublicBasePath.endsWith('/')) {
53+
normalizedPublicBasePath += '/';
54+
}
55+
56+
// Array to collect all sitemap URLs for the index
57+
const sitemapUrls: Array<{ url: string; lastmod?: string }> = [];
58+
59+
// Process each chunk separately
60+
for (const [chunkName, items] of Object.entries(sourceData)) {
61+
const sitemapAndIndexStream = new SitemapAndIndexStream({
62+
limit,
63+
xslUrl,
64+
getSitemapStream: (i) => {
65+
const sitemapStream = new SitemapStream({
66+
hostname,
67+
xslUrl,
68+
// Custom namespace handling
69+
xmlns: {
70+
news: namespaces?.news !== false,
71+
xhtml: namespaces?.xhtml !== false,
72+
image: namespaces?.image !== false,
73+
video: namespaces?.video !== false,
74+
},
75+
});
76+
77+
const path = `./${filenameBase}-${chunkName}-${i}.xml`;
78+
const writePath = resolve(destinationDir, path);
79+
const publicPath = normalize(normalizedPublicBasePath + path);
80+
81+
let stream: WriteStream;
82+
if (astroConfig.trailingSlash === 'never' || astroConfig.build.format === 'file') {
83+
// workaround for trailing slash issue in sitemap.js
84+
const host = hostname.endsWith('/') ? hostname.slice(0, -1) : hostname;
85+
const searchStr = `<loc>${host}/</loc>`;
86+
const replaceStr = `<loc>${host}</loc>`;
87+
stream = sitemapStream
88+
.pipe(replace(searchStr, replaceStr))
89+
.pipe(createWriteStream(writePath));
90+
} else {
91+
stream = sitemapStream.pipe(createWriteStream(writePath));
92+
}
93+
94+
const url = new URL(publicPath, sitemapHostname).toString();
95+
96+
// Collect this sitemap URL for the index
97+
sitemapUrls.push({ url, lastmod });
98+
99+
return [{ url, lastmod }, sitemapStream, stream];
100+
},
101+
});
102+
103+
// Create a readable stream from this chunk's items
104+
const dataStream = Readable.from(items);
105+
106+
// Write this chunk's sitemap(s)
107+
await promisify(pipeline)(dataStream, sitemapAndIndexStream);
108+
}
109+
110+
// Now create the sitemap index with all the generated sitemaps
111+
const indexStream = new SitemapIndexStream({ xslUrl });
112+
const indexPath = resolve(destinationDir, `./${filenameBase}-index.xml`);
113+
const indexWriteStream = createWriteStream(indexPath);
114+
115+
// Add custom sitemaps to the index
116+
for (const url of customSitemaps) {
117+
indexStream.write({ url, lastmod });
118+
}
119+
120+
// Add all generated sitemaps to the index
121+
for (const sitemapUrl of sitemapUrls) {
122+
indexStream.write(sitemapUrl);
123+
}
124+
125+
indexStream.end();
126+
127+
return await promisify(pipeline)(indexStream, indexWriteStream);
128+
}

packages/integrations/sitemap/src/write-sitemap.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ export async function writeSitemap(
9393
sitemapAndIndexStream,
9494
{ url, lastmod },
9595
'utf8',
96-
() => {},
96+
() => { },
9797
);
9898
}
9999
return promisify(pipeline)(src, sitemapAndIndexStream, createWriteStream(indexPath));
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import assert from 'node:assert/strict';
2+
import { before, describe, it } from 'node:test';
3+
import { sitemap } from './fixtures/static/deps.mjs';
4+
import { loadFixture, readXML } from './test-utils.js';
5+
6+
describe('Sitemap with chunked files', () => {
7+
/** @type {import('./test-utils.js').Fixture} */
8+
let fixture;
9+
/** @type {string[]} */
10+
let blogUrls;
11+
let glossaryUrls;
12+
let pagesUrls;
13+
14+
before(async () => {
15+
fixture = await loadFixture({
16+
root: './fixtures/chunks/',
17+
integrations: [
18+
sitemap({
19+
serialize(item) {
20+
return item
21+
},
22+
chunks: {
23+
'blog': (item) => {
24+
if (item.url.includes('blog')) {
25+
item.changefreq = 'weekly';
26+
item.lastmod = new Date();
27+
item.priority = 0.9;
28+
return item;
29+
}
30+
},
31+
'glossary': (item) => {
32+
if (item.url.includes('glossary')) {
33+
item.changefreq = 'weekly';
34+
item.lastmod = new Date();
35+
item.priority = 0.9;
36+
return item;
37+
}
38+
}
39+
},
40+
}),
41+
],
42+
});
43+
await fixture.build();
44+
const flatMapUrls = async (file) => {
45+
const data = await readXML(fixture.readFile(file))
46+
return data.urlset.url.map((url) => url.loc[0])
47+
};
48+
blogUrls = await flatMapUrls('sitemap-blog-0.xml');
49+
glossaryUrls = await flatMapUrls('sitemap-glossary-0.xml')
50+
pagesUrls = await flatMapUrls('sitemap-pages-0.xml')
51+
});
52+
53+
it('includes defined custom pages', async () => {
54+
assert.equal(blogUrls.includes('http://example.com/blog/one/'), true);
55+
assert.equal(blogUrls.includes('http://example.com/blog/two/'), true);
56+
assert.equal(glossaryUrls.includes('http://example.com/glossary/one/'), true);
57+
assert.equal(glossaryUrls.includes('http://example.com/glossary/two/'), true);
58+
assert.equal(pagesUrls.includes('http://example.com/one/'), true);
59+
assert.equal(pagesUrls.includes('http://example.com/two/'), true);
60+
});
61+
});

0 commit comments

Comments
 (0)