pdf2md

Convert academic PDF papers to clean, readable markdown with linked citations, embedded figures, and structured metadata for RAG systems.

Quick Start — install and convert a paper
Depth Levels — control how much processing is applied
Direct CLI Usage — convert PDFs locally
Service Mode — Docker microservice for remote/homelab use
Claude Code Integration — MCP server + /convert-paper command
Processing Pipeline — what happens at each stage
Local AI Setup — run with LM Studio or Ollama
Installation — extras and requirements
Batch Processing — convert many papers at once

Quick Start

# Install
uv tool install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

# Convert a paper — uses medium depth by default (Docling + postprocess + LLM retouch)
pdf2md convert paper.pdf

# Output goes to ./paper/paper.md (same directory as the PDF)
# Or specify an output directory explicitly:
pdf2md convert paper.pdf ./output

Depth Levels

pdf2md uses a depth-based system to control how much processing is applied. The default is medium.

Depth	Default?	What happens	AI required?
`low`		Docling extraction + rule-based postprocessing (citations, figures, sections, cleanup)	No
`medium`	yes	Everything in `low` + LLM retouch via Claude Agent SDK (author formatting, lettered section headers, figure relocation, paragraph merging)	Yes (Claude API or `--local`)
`high`		Everything in `medium` + VLM figure descriptions + code/equation enrichments	Yes (Claude API or `--local`)

# Fast, no AI needed
pdf2md convert paper.pdf -d low

# Default — includes agentic LLM cleanup (Claude)
pdf2md convert paper.pdf

# Full pipeline — adds VLM figure descriptions and RAG metadata
pdf2md convert paper.pdf -d high

# Any depth with a local LLM instead of Claude
pdf2md convert paper.pdf --local
pdf2md convert paper.pdf -d high --local

Direct CLI Usage

`pdf2md convert` — Main Conversion

pdf2md convert paper.pdf [output_dir] [OPTIONS]

If output_dir is omitted, output goes to the same directory as the PDF.

Option	Description
`-d, --depth`	Analysis depth: `low`, `medium` (default), `high`
`-l, --local`	Use local LLM/VLM instead of cloud (Claude)
`-p, --provider`	LLM provider: `lm_studio` (default), `ollama`
`-m, --model`	Override LLM/VLM model name
`--keep-raw`	Save raw Docling extraction alongside processed output
`--raw`	Skip all processing, output only raw extraction
`--images-scale N`	Image resolution multiplier (default: 2.0)
`--min-image-width`	Minimum image width in pixels, filters logos (default: 200)
`--min-image-height`	Minimum image height in pixels (default: 150)
`--min-image-area`	Minimum image area in pixels (default: 40000)

Output:

output/paper/
├── paper.md              # Final processed markdown
├── paper_raw.md          # Raw Docling output (if --keep-raw)
├── img/
│   ├── figure1.png
│   ├── figure2.png
│   └── ...
├── enrichments.json      # All metadata (depth=high only)
├── figures.json          # Figure metadata
├── equations.json        # Equations with LaTeX
└── code_blocks.json      # Code with language detection

`pdf2md retouch` — LLM Cleanup Only

Run LLM-based cleanup on an existing markdown file:

uv run pdf2md retouch paper.md [OPTIONS]

Option	Description
`-l, --local`	Use local LLM instead of cloud (Claude)
`-p, --provider`	LLM provider: `lm_studio`, `ollama`
`-m, --model`	Override LLM model name
`-i, --images`	Path to images directory (default: `./img`)
`-v, --verbose`	Show detailed LLM progress

The retouch step fixes:

Author formatting — Extracts and formats author names, affiliations, emails
Lettered section headers — Classifies A. Background as header vs A. We conducted... as sentence

`pdf2md postprocess` — Rule-Based Fixes Only

uv run pdf2md postprocess paper.md [OPTIONS]

Option	Description
`-i, --images`	Path to images directory (default: `./img`)
`-o, --output`	Output path (default: overwrite input file)

`pdf2md enrich` — Extract RAG Metadata

uv run pdf2md enrich paper.pdf ./output [OPTIONS]

Option	Description
`--describe`	Generate VLM descriptions for figures
`-l, --local`	Use local VLM instead of cloud
`-p, --provider`	VLM provider: `lm_studio`, `ollama`
`-m, --model`	Override VLM model
`--images-scale N`	Image resolution multiplier (default: 2.0)

Service Mode

Run pdf2md as a Docker microservice for remote or homelab use. The service provides an HTTP API with Ed25519 signature authentication and async job processing via Redis/arq.

Docker Deployment

# Start all services (API, worker, PostgreSQL, Redis)
docker compose up -d --build

# Run database migrations
docker compose exec api alembic upgrade head

# Check logs
docker compose logs -f worker

API Endpoints

All endpoints require Ed25519 signature authentication (see Auth Setup).

Method	Endpoint	Description
`POST`	`/submit_paper`	Upload a PDF and enqueue conversion. Returns `job_id`.
`GET`	`/status/{job_id}`	Check job status, progress, and errors.
`GET`	`/retrieve/{job_id}`	Download completed results as `tar.gz`.

Submit example:

curl -X POST http://your-server:8000/submit_paper \
  -F "file=@paper.pdf" \
  -F "depth=medium" \
  -H "Authorization: Signature <base64-sig>" \
  -H "X-Timestamp: $(date +%s)" \
  -H "X-Client-Id: <your-uuid>"

Auth Setup

The service uses Ed25519 keypairs for authentication. Each client has a UUID and a public key stored in the database; requests are signed with the corresponding private key.

Signature format: METHOD\nPATH\nTIMESTAMP signed with the client's Ed25519 private key.

Headers required:

Authorization: Signature <base64-signature>
X-Timestamp: <unix-epoch>
X-Client-Id: <client-uuid>

Timestamps must be within 5 minutes of server time (configurable via PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS).

Service Environment Variables

Variable	Default	Description
`PDF2MD_SERVICE_DATABASE_URL`	`postgresql+asyncpg://...`	PostgreSQL connection string
`PDF2MD_SERVICE_REDIS_URL`	`redis://localhost:6379`	Redis connection string
`PDF2MD_SERVICE_DATA_DIR`	`/data`	Root data directory
`PDF2MD_SERVICE_UPLOAD_DIR`	`/data/uploads`	PDF upload storage
`PDF2MD_SERVICE_AUTH_TIMESTAMP_TOLERANCE_SECONDS`	`300`	Signature freshness window
`PDF2MD_SERVICE_WORKER_MAX_JOBS`	`1`	Concurrent conversion jobs

Claude Code Integration

MCP Server

The mcp/server.py script exposes the service API as MCP tools for Claude Code. It loads credentials from a .env file in the repo root.

Register the server:

claude mcp add --scope user pdf2md-service -- uv run /path/to/paper-to-md/mcp/server.py

Required .env variables (not committed — see .env.example):

PDF2MD_SERVICE_URL=http://your-server:8000
PDF2MD_CLIENT_ID=00000000-0000-0000-0000-000000000001
PDF2MD_PRIVATE_KEY=<base64-ed25519-private-key>

Tools provided:

Tool	Description
`pdf2md_submit`	Upload a PDF and start conversion. Returns job ID.
`pdf2md_status`	Poll job status and progress.
`pdf2md_retrieve`	Download and extract completed results.

`/convert-paper` Command

A project-level slash command in .claude/commands/convert-paper.md that orchestrates the full conversion workflow.

/convert-paper path/to/paper.pdf

This submits the PDF, polls for completion, downloads results, and reports extracted files. Auto-discovered by Claude Code when working in this repo.

Processing Pipeline

1. Docling Extraction

Uses Docling (ML-based) to extract:

Text with structure (headings, paragraphs, lists)
Tables with formatting
Figures as images
Equations

2. Deterministic Post-Processing

Applied at all depth levels (including low):

Citations:

[7] → [[7]](#ref-7) (clickable links)
[11]-[14] → expanded to four individual linked citations
Anchors added to reference entries for link targets

Sections:

Abstract -Text here → ## Abstract\n\nText here
Hierarchical section numbering → proper markdown headers

Figures:

Embeds ![Figure N](./img/figureN.png) above line-start captions
Each figure embedded exactly once

Bibliography:

Adds <a id="ref-N"></a> anchors to reference entries
Ensures proper spacing between entries

Cleanup:

Fixes ligatures (ﬁ→fi, ﬂ→fl)
Removes GLYPH artifacts from OCR
Fixes hyphenated word breaks across lines
Merges split paragraphs
Removes OCR garbage near figure embeds

3. LLM Retouch (medium, high depth)

Uses LLM to fix issues that need judgment:

Author formatting — Extracts names, affiliations, emails into structured ## Authors section
Lettered sections — Classifies A. Background as header vs A. We conducted... as sentence

4. VLM + Enrichments (high depth)

Extracts structured data for RAG:

File	Contents
`figures.json`	Caption, classification, VLM description, page number
`equations.json`	LaTeX representation, surrounding context
`code_blocks.json`	Code text, detected language
`enrichments.json`	All of the above combined

Local AI Setup

pdf2md supports running entirely locally using LM Studio or Ollama:

# Using LM Studio (default local provider)
export LM_STUDIO_HOST=http://localhost:1234/v1
uv run pdf2md convert paper.pdf ./output --local

# Using Ollama
export OLLAMA_HOST=http://localhost:11434
uv run pdf2md convert paper.pdf ./output --local --provider ollama

# Override model
uv run pdf2md convert paper.pdf ./output --local --model qwen3-8b

# VLM on a separate node
export PDF2MD_VLM_HOST=http://192.168.1.100:1234/v1
uv run pdf2md convert paper.pdf ./output -d high --local

Environment Variables

Variable	Default	Description
`PDF2MD_TEXT_MODEL`	`qwen3-4b`	Text LLM for retouch
`PDF2MD_VLM_MODEL`	`qwen3-vl-4b`	VLM for figure descriptions
`PDF2MD_PROVIDER`	`lm_studio`	Default provider
`LM_STUDIO_HOST`	`http://localhost:1234/v1`	LM Studio endpoint
`PDF2MD_VLM_HOST`	`http://localhost:1234/v1`	VLM endpoint (can differ from text)
`OLLAMA_HOST`	`http://localhost:11434`	Ollama endpoint

Installation

# Install as a standalone tool (recommended)
uv tool install paper-to-md

# Pre-download Docling ML models (~500MB, one-time)
pdf2md download-models

Alternative install methods:

# Install into a project
uv add paper-to-md

# pip works too
pip install paper-to-md

# Docker microservice dependencies
uv tool install paper-to-md[service]

# Development (pytest + ruff)
uv pip install paper-to-md[dev]

Requirements

Python 3.10-3.12
uv recommended for installation and dependency management

Batch Processing

# Convert all PDFs in a directory
uv run python scripts/batch_convert.py papers/ output/

# Fast batch (no AI)
uv run python scripts/batch_convert.py papers/ output/ --depth low

# Full batch with local LLM
uv run python scripts/batch_convert.py papers/ output/ --depth high --local

# Dry run to see what would be processed
uv run python scripts/batch_convert.py papers/ output/ --dry-run

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
mcp		mcp
migrations		migrations
pdf2md		pdf2md
scripts		scripts
service		service
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2md

Contents

Quick Start

Depth Levels

Direct CLI Usage

`pdf2md convert` — Main Conversion

`pdf2md retouch` — LLM Cleanup Only

`pdf2md postprocess` — Rule-Based Fixes Only

`pdf2md enrich` — Extract RAG Metadata

Service Mode

Docker Deployment

API Endpoints

Auth Setup

Service Environment Variables

Claude Code Integration

MCP Server

`/convert-paper` Command

Processing Pipeline

1. Docling Extraction

2. Deterministic Post-Processing

3. LLM Retouch (medium, high depth)

4. VLM + Enrichments (high depth)

Local AI Setup

Environment Variables

Installation

Requirements

Batch Processing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf2md

Contents

Quick Start

Depth Levels

Direct CLI Usage

pdf2md convert — Main Conversion

pdf2md retouch — LLM Cleanup Only

pdf2md postprocess — Rule-Based Fixes Only

pdf2md enrich — Extract RAG Metadata

Service Mode

Docker Deployment

API Endpoints

Auth Setup

Service Environment Variables

Claude Code Integration

MCP Server

/convert-paper Command

Processing Pipeline

1. Docling Extraction

2. Deterministic Post-Processing

3. LLM Retouch (medium, high depth)

4. VLM + Enrichments (high depth)

Local AI Setup

Environment Variables

Installation

Requirements

Batch Processing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`pdf2md convert` — Main Conversion

`pdf2md retouch` — LLM Cleanup Only

`pdf2md postprocess` — Rule-Based Fixes Only

`pdf2md enrich` — Extract RAG Metadata

`/convert-paper` Command