|
| 1 | +# get-docs-markdown |
| 2 | + |
| 3 | +A Go utility that downloads markdown versions of MongoDB documentation pages from a CSV file. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This tool reads a CSV file containing MongoDB documentation URLs (typically output from `create-url-list`) and downloads the markdown version of each page. |
| 8 | + |
| 9 | +## Usage |
| 10 | + |
| 11 | +```bash |
| 12 | +./get-docs-markdown -csv <path-to-csv> -output <output-directory> [options] |
| 13 | +``` |
| 14 | + |
| 15 | +### Flags |
| 16 | + |
| 17 | +- `-csv`: (Required) Path to the CSV file containing URLs |
| 18 | +- `-output`: (Optional) Output directory for markdown files (default: `markdown-output`) |
| 19 | +- `-workers`: (Optional) Number of concurrent download workers (default: `10`) |
| 20 | +- `-rate-limit`: (Optional) Maximum requests per second (default: `5.0`, use `0` for unlimited) |
| 21 | + |
| 22 | +### Examples |
| 23 | + |
| 24 | +```bash |
| 25 | +# Build the tool |
| 26 | +go build |
| 27 | + |
| 28 | +# Download markdown files with default settings (10 workers, 5 req/sec) |
| 29 | +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files |
| 30 | + |
| 31 | +# Use more workers and higher rate limit for faster downloads |
| 32 | +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 20 -rate-limit 10 |
| 33 | + |
| 34 | +# Conservative settings to avoid server load |
| 35 | +./get-docs-markdown -csv /path/to/top-250-dec-2025.csv -output ./markdown-files -workers 5 -rate-limit 2 |
| 36 | +``` |
| 37 | + |
| 38 | +## CSV Format |
| 39 | + |
| 40 | +The tool expects a CSV file with the following format: |
| 41 | + |
| 42 | +``` |
| 43 | +Rank,Page,Number of Page Views |
| 44 | +1,www.mongodb.com/docs/manual/administration/install-community/,55197 |
| 45 | +2,www.mongodb.com/docs/get-started/,45669 |
| 46 | +``` |
| 47 | + |
| 48 | +Input CSVs may include or omit the header row. If a header row is present, the tool will skip it. |
| 49 | + |
| 50 | +The tool reads the URL from the second column (index 1). |
| 51 | + |
| 52 | +## How It Works |
| 53 | + |
| 54 | +1. **CSV Reading**: Reads all URLs from the CSV file into memory |
| 55 | + |
| 56 | +2. **Concurrent Processing**: Spawns multiple worker goroutines (default: 10) to download files in parallel |
| 57 | + |
| 58 | +3. **Rate Limiting**: Uses a token bucket algorithm to limit requests per second (default: 5 req/sec) |
| 59 | + - Prevents overwhelming the server |
| 60 | + - Ensures respectful crawling behavior |
| 61 | + |
| 62 | +4. **URL Processing**: For each URL: |
| 63 | + - Removes trailing slashes |
| 64 | + - Removes query parameters and anchor tags |
| 65 | + - Adds `.md` extension to get the markdown version |
| 66 | + - Adds User-Agent header to avoid 503 errors |
| 67 | + |
| 68 | +5. **Slug Extraction**: Extracts the page slug from the URL (everything after `www.mongodb.com/docs/`) |
| 69 | + - Includes language and version prefixes to ensure uniqueness |
| 70 | + - Examples: |
| 71 | + - `www.mongodb.com/docs/manual/installation/` → `manual/installation` |
| 72 | + - `www.mongodb.com/zh-cn/docs/manual/installation/` → `zh-cn/manual/installation` |
| 73 | + - `www.mongodb.com/docs/v7.0/manual/installation/` → `v7.0/manual/installation` |
| 74 | + |
| 75 | +6. **File Naming**: Saves files as `<output-dir>/<page-slug>.md` |
| 76 | + - Preserves directory structure from the URL path including language/version prefixes |
| 77 | + - Skips download if file already exists |
| 78 | + - Examples: `manual/installation.md`, `zh-cn/manual/installation.md`, `v7.0/manual/installation.md` |
| 79 | + |
| 80 | +7. **Download**: Downloads the markdown content and saves it to the output directory |
| 81 | + |
| 82 | +## Output |
| 83 | + |
| 84 | +The tool creates a directory structure matching the URL paths, including language and version prefixes: |
| 85 | + |
| 86 | +``` |
| 87 | +markdown-output/ |
| 88 | +├── manual/ |
| 89 | +│ ├── administration/ |
| 90 | +│ │ └── install-community.md |
| 91 | +│ └── reference/ |
| 92 | +│ └── connection-string.md |
| 93 | +├── zh-cn/ |
| 94 | +│ └── manual/ |
| 95 | +│ └── administration/ |
| 96 | +│ └── install-community.md |
| 97 | +├── v7.0/ |
| 98 | +│ └── administration/ |
| 99 | +│ └── install-community.md |
| 100 | +├── get-started.md |
| 101 | +├── mongodb-shell/ |
| 102 | +│ └── install.md |
| 103 | +└── compass/ |
| 104 | + └── install.md |
| 105 | +``` |
| 106 | + |
| 107 | +This ensures that different language versions and versioned documentation are saved separately without conflicts. |
| 108 | + |
| 109 | +## Error Handling |
| 110 | + |
| 111 | +- If a URL cannot be downloaded (404, network error, etc.), the tool logs the error and continues with the next URL |
| 112 | +- At the end, it reports the number of successful downloads and errors |
| 113 | + |
| 114 | +## Performance |
| 115 | + |
| 116 | +With default settings (10 workers, 5 req/sec): |
| 117 | +- **250 URLs**: ~50 seconds |
| 118 | +- **500 URLs**: ~100 seconds (1.7 minutes) |
| 119 | +- **1000 URLs**: ~200 seconds (3.3 minutes) |
| 120 | + |
| 121 | +You can adjust `-workers` and `-rate-limit` to balance speed vs. server load. Higher values will download faster but may risk rate limiting or server errors. |
| 122 | + |
0 commit comments