Skip to content

Commit 52acd0a

Browse files
author
Sylvain Pace
authored
Doc/update content (#379)
* # This is a combination of 2 commits. # This is the 1st commit message: # This is a combination of 3 commits. # This is the 1st commit message: # This is a combination of 3 commits. # This is the 1st commit message: chore(deps): update dependency onchange to v4.1.0 integrate previous work enhance tyle/content reformat part 1 wait for review adance start_urls enhance attributes description fix typo proofread documentation/docsearch add apiKey mention intefrate review and small fixes finished proofreading update sclient-rendering use unseen review # This is the commit message #2: update README #269 # This is the commit message #3: fix json # This is the commit message #2: enhance as algolia/docsearch-configs#387 # This is the commit message #3: updating flavicon # This is the commit message #2: Update 1-customize-configuration-file.html.md.erb * documenting algolia/docsearch-scraper#387
1 parent b510acb commit 52acd0a

File tree

2 files changed

+21
-2
lines changed

2 files changed

+21
-2
lines changed

docs/source/documentation/1-docsearch/3-recommendations.html.md.erb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,4 +91,4 @@ For those reasons we highly recommend that you use a [**Sitemap**](https://www.s
9191

9292
This lists every page of your web site and will be used as the **main source of truth**
9393
and it will define the roadmap of our scraping.
94-
Beside this exhaustivity, using a sitemap introduces a significant performance improvement for our scraper.
94+
Beside this exhaustivity, using a sitemap introduces a significant performance improvement for our scraper.

docs/source/documentation/2-docsearch-scraper/2-config-options.html.md.erb

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,11 +64,11 @@ Name of the Algolia index where all the data will be pushed.
6464
**On our own infrastructure, this name must be equal to the configuration file name**
6565

6666
We mostly attribute it on our own regarding plenty of underlying factors. The `apiKey` that we provide is generated with a restriction on the `index_name`. Changing the `index_name` would require to ask for a new key. Thus if you want to **change the name**, please **submit a new configuration**, we will generate a new key accordingly.
67-
6867
### `start_urls` _Mandatory_
6968
You can pass either a string or an array of urls. The crawler will go to each
7069
page in order, following every link it finds on the page. It will only stop if
7170
the domain is outside of the `allowed_domains` or if the link is blacklisted in
71+
7272
`stop_urls`.
7373

7474
Note that it currently does not follow *301* redirects.
@@ -290,12 +290,31 @@ Specifies if the matched URLs should not respect the same rules as the crawled h
290290
```
291291
Given this configuration, every webpage of the sitemap whose URL contains '/doc/' will be scraped even if they don't comply with `start_urls` or `stop_urls`.
292292

293+
### `sitemap_alternate_links` _Optional_
294+
295+
This parameter is only useful when you are using a sitemap to crawl your website.
296+
297+
It specifies if alternate links should be followed. Your sitemap should inlcude localized versions of your page in such format:
298+
299+
```
300+
<url>
301+
<loc>http://example.com/</loc>
302+
<xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
303+
</url>
304+
```
305+
306+
If `sitemap_alternate_links` is not set, the link "http://example.com/de" will not be parsed from the sitemap.
307+
308+
Default is `false`
309+
293310
### `allowed_domains` _Optional_
294311

295312
You can pass an array of strings. This is the whitelist of
296313
domains the crawler will browse. If a link targets a page that is not in the
297314
whitelist, the crawler will not follow it.
298315

316+
### Sitemap crawling _Optional_
317+
299318
Default is the domain of the first elements in the `start_urls`.
300319

301320
### `min_indexed_level` _Optional_

0 commit comments

Comments
 (0)