Skip to content

Commit 4d58b3c

Browse files
authored
Merge pull request #13 from grove-platform/add-get-docs-markdown
Add a new tool to get the markdown versions of docs pages
2 parents 1a43308 + a298a8f commit 4d58b3c

13 files changed

Lines changed: 827 additions & 0 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,9 @@ uses perform various tasks related to the Grove platform.
1010
database in Atlas.
1111
- `dodec`, or the Database of Devoured Example Code: a query tool that lets us find code examples and related
1212
metadata in the database for reporting or to perform manual updates.
13+
- `create-url-list`: A Go CLI tool that extracts and ranks URLs by pageviews from CSV data containing page analytics.
1314
- `dependency-manager`: A Go CLI project to help us manage dependencies for multiple ecosystems in the docs monorepo
15+
- `get-docs-markdown`: A Go CLI tool that downloads the markdown versions of documentation pages from an input csv file.
1416
- `github-metrics`: a Node.js script that gets engagement metrics from GitHub for specified repos and writes them
1517
to a database in Atlas.
1618
- `query-docs-feedback`: a Go project with type definitions that queries the MongoDB

get-docs-markdown/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
get-docs-markdown

get-docs-markdown/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# get-docs-markdown
2+
3+
A Go utility that downloads markdown versions of MongoDB documentation pages from a CSV file.
4+
5+
## Overview
6+
7+
This tool reads a CSV file containing MongoDB documentation URLs (typically output from `create-url-list`) and downloads the markdown version of each page.
8+
9+
## Usage
10+
11+
```bash
12+
./get-docs-markdown -csv <path-to-csv> -output <output-directory> [options]
13+
```
14+
15+
### Flags
16+
17+
- `-csv`: (Required) Path to the CSV file containing URLs
18+
- `-output`: (Optional) Output directory for markdown files (default: `markdown-output`)
19+
- `-workers`: (Optional) Number of concurrent download workers (default: `10`)
20+
- `-rate-limit`: (Optional) Maximum requests per second (default: `5.0`, use `0` for unlimited)
21+
22+
### Examples
23+
24+
```bash
25+
# Build the tool
26+
go build
27+
28+
# Download markdown files with default settings (10 workers, 5 req/sec)
29+
./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files
30+
31+
# Use more workers and higher rate limit for faster downloads
32+
./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 20 -rate-limit 10
33+
34+
# Conservative settings to avoid server load
35+
./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 5 -rate-limit 2
36+
```
37+
38+
## CSV Format
39+
40+
The tool expects a CSV file with the following format:
41+
42+
```
43+
Rank,Page,Number of Page Views
44+
1,www.mongodb.com/docs/manual/administration/install-community/,55197
45+
2,www.mongodb.com/docs/get-started/,45669
46+
```
47+
48+
Input CSVs may include or omit the header row. If a header row is present, the tool will skip it.
49+
50+
The tool reads the URL from the second column (index 1).
51+
52+
## How It Works
53+
54+
1. **CSV Reading**: Reads all URLs from the CSV file into memory
55+
56+
2. **Concurrent Processing**: Spawns multiple worker goroutines (default: 10) to download files in parallel
57+
58+
3. **Rate Limiting**: Uses a token bucket algorithm to limit requests per second (default: 5 req/sec)
59+
- Prevents overwhelming the server
60+
- Ensures respectful crawling behavior
61+
62+
4. **URL Processing**: For each URL:
63+
- Removes trailing slashes
64+
- Removes query parameters and anchor tags
65+
- Adds `.md` extension to get the markdown version
66+
- Adds User-Agent header to avoid 503 errors
67+
68+
5. **Slug Extraction**: Extracts the page slug from the URL (everything after `www.mongodb.com/docs/`)
69+
- Includes language and version prefixes to ensure uniqueness
70+
- Examples:
71+
- `www.mongodb.com/docs/manual/installation/``manual/installation`
72+
- `www.mongodb.com/zh-cn/docs/manual/installation/``zh-cn/manual/installation`
73+
- `www.mongodb.com/docs/v7.0/manual/installation/``v7.0/manual/installation`
74+
75+
6. **File Naming**: Saves files as `<output-dir>/<page-slug>.md`
76+
- Preserves directory structure from the URL path including language/version prefixes
77+
- Skips download if file already exists
78+
- Examples: `manual/installation.md`, `zh-cn/manual/installation.md`, `v7.0/manual/installation.md`
79+
80+
7. **Download**: Downloads the markdown content and saves it to the output directory
81+
82+
## Output
83+
84+
The tool creates a directory structure matching the URL paths, including language and version prefixes:
85+
86+
```
87+
markdown-output/
88+
├── manual/
89+
│ ├── administration/
90+
│ │ └── install-community.md
91+
│ └── reference/
92+
│ └── connection-string.md
93+
├── zh-cn/
94+
│ └── manual/
95+
│ └── administration/
96+
│ └── install-community.md
97+
├── v7.0/
98+
│ └── administration/
99+
│ └── install-community.md
100+
├── get-started.md
101+
├── mongodb-shell/
102+
│ └── install.md
103+
└── compass/
104+
└── install.md
105+
```
106+
107+
This ensures that different language versions and versioned documentation are saved separately without conflicts.
108+
109+
## Error Handling
110+
111+
- If a URL cannot be downloaded (404, network error, etc.), the tool logs the error and continues with the next URL
112+
- At the end, it reports the number of successful downloads and errors
113+
114+
## Performance
115+
116+
With default settings (10 workers, 5 req/sec):
117+
- **250 URLs**: ~50 seconds
118+
- **500 URLs**: ~100 seconds (1.7 minutes)
119+
- **1000 URLs**: ~200 seconds (3.3 minutes)
120+
121+
You can adjust `-workers` and `-rate-limit` to balance speed vs. server load. Higher values will download faster but may risk rate limiting or server errors.
122+

get-docs-markdown/go.mod

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
module get-docs-markdown
2+
3+
go 1.25.4
4+
5+
require golang.org/x/time v0.14.0 // indirect

get-docs-markdown/go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI=
2+
golang.org/x/time v0.14.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4=

0 commit comments

Comments
 (0)