FullMark

Full Marks — Every source, Perfect Markdown

Convert ANY source format into clean, structured Markdown — with images, videos, diagrams, and full provenance tracking. One command. No cloud required.

What FullMark Does

FullMark takes files, folders, URLs, videos, images, and archives and converts them into well-structured Markdown. It auto-detects the source type, routes it to the right conversion engine, and writes the result with a stable identity so you never convert the same thing twice.

Category	Formats	How it works
Documents	PDF, DOCX, RTF, TXT, EPUB	Text and tables extracted; scanned PDFs fall back to OCR
Spreadsheets	XLSX, XLS, CSV, ODS	Each sheet becomes a GFM pipe table
Presentations	PPTX, ODP	Slide titles, text blocks, and speaker notes
Notebooks	IPYNB	Markdown cells as-is; code cells in fenced blocks; outputs as text
Email	MSG (Outlook), EML	Headers + body; HTML email → Markdown
Images	JPG, PNG, BMP, TIFF, WebP	OCR text extraction (Tesseract → EasyOCR fallback); table images → GFM tables; decorative images embedded as base64 or described by vision LLM
SVG	SVG	Shapes and paths parsed → Mermaid diagram code block; text elements as bullet list
Video	MP4, AVI, MOV, MKV, WEBM	Audio → Whisper transcription with timecodes; frames sampled every 10 s, head overlay cropped, Tesseract OCR-diff detects slide changes (ignores head movement), upscaled and OCR'd (Tesseract → EasyOCR → vision LLM fallback); transcript and slide content interleaved chronologically
Audio	MP3, WAV, M4A	Whisper transcription with `[MM:SS]` timestamps
Web	HTTP/HTTPS URLs, HTML, RSS, YouTube	Page content + images → Markdown; SVG logos detected by content-type; YouTube transcripts; RSS entries as numbered list
Archives	ZIP	Auto-unpacked; each file routed to the right agent individually
URL Lists	TXT, DOCX, XLSX, CSV	One URL per line/cell — each fetched and converted; non-URL lines skipped
Source Code	`.py` `.js` `.ts` `.go` `.rs` `.java` `.c` `.cpp` `.cs` `.rb` `.php` `.sh` `.sql` `.json` `.yaml` `.toml` `.tf` and 50+ more	Each file becomes a syntax-highlighted fenced code block
GitHub Repos	`https://github.com/owner/repo`	Full repo tree via GitHub API — no git clone needed

Every output file carries:

YAML front matter — source URL/path, conversion timestamp, agent name, and a stable source_id
Provenance footnote — at the bottom of every file so the origin travels with the document

Files over 120,000 characters are split into name_001.md, name_002.md, etc.

Before You Clone — Run the Prerequisite Check

Many computers have Python, ffmpeg, or Tesseract installed but not on their system PATH. This means commands like python, ffmpeg, or tesseract might fail even though the software is physically installed.

Create a test.py file anywhere on your machine and run it first:

# test.py — paste this and run: python test.py
import shutil, sys

tools = {
    "python":    sys.executable,
    "ffmpeg":    shutil.which("ffmpeg"),
    "tesseract": shutil.which("tesseract"),
}

print(f"Python version : {sys.version}")
for name, path in tools.items():
    status = "✓  found" if path else "✗  NOT FOUND (may need PATH fix)"
    print(f"{name:12} {status}  {path or ''}")

⚠ A tool showing as "NOT FOUND" here might still be installed — it just isn't on your PATH. See the system binaries section for install links.

Python Version

FullMark requires Python 3.11 or higher. Check your version:

python --version

If you have multiple Python versions installed, use py -3.11 on Windows to select explicitly.

Installation

1. Clone the repository

git clone https://github.com/tmprabubiz/fullmark.git
cd fullmark

2. Create a virtual environment (recommended)

# Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1

# macOS / Linux
python3 -m venv .venv
source .venv/bin/activate

3. Install Python dependencies

pip install -r requirements.txt

Optional extras (install only what you need):

pip install pymupdf        # enhanced PDF extraction
pip install xlrd           # legacy .xls support
pip install cairosvg       # SVG → raster for vision fallback

4. System binaries

These are not Python packages — install them separately:

Tool	Purpose	Download
ffmpeg	Video/audio extraction	https://ffmpeg.org/download.html
Tesseract OCR	Image text extraction	https://github.com/UB-Mannheim/tesseract/wiki (Windows)

After installing, confirm they are on your PATH:

ffmpeg -version
tesseract --version

If either command fails, add the install directory to your system PATH, or set the full path in .env:

FFMPEG_PATH=C:\FFmpeg\bin\ffmpeg.exe
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe

5. Configure your environment

# Windows
copy .env.template .env

# macOS / Linux
cp .env.template .env

Then open .env in any text editor. Everything is optional — FullMark always has a free fallback path.

6. Verify everything is ready

python fullmark_preflight.py

Usage

Convert a single file

python fullmark_cli.py report.pdf
python fullmark_cli.py presentation.pptx
python fullmark_cli.py lecture.mp4
python fullmark_cli.py diagram.svg       # → Mermaid code block
python fullmark_cli.py screenshot.png    # → OCR extracted text

Convert a URL

python fullmark_cli.py https://example.com/article

Web pages are converted with images intact:

Raster images (JPG/PNG/WebP) → OCR run, text extracted into the document
SVG images (logos, icons) → content-type detected, saved as .svg, converted to Mermaid or text bullets
Decorative images with no extractable text → base64-embedded inline

When you run a URL for the first time, FullMark asks whether to follow links:

Source: https://example.com/docs
Single-page conversion by default (no link following).
Follow links on this page and convert multiple pages? [y/N]:

Type n (or just press Enter) for a single page. Type y to see an estimated page count and time before confirming a crawl.

Convert a YouTube video

python fullmark_cli.py https://www.youtube.com/watch?v=VIDEO_ID

Convert a video or audio file

python fullmark_cli.py lecture.mp4
python fullmark_cli.py interview.mp3

Output structure:

## Scene 1 — [00:00]

[Transcript for this time segment...]

[Frame OCR text if present]

## Scene 2 — [01:42]
...

Whisper runs locally — no audio is sent to any cloud service. Set model size with --whisper-model tiny|base|small|medium|large (default: base).

GPU acceleration (NVIDIA): The default pip install -r requirements.txt installs CPU-only PyTorch. If you have an NVIDIA GPU, install the CUDA build for significantly faster Whisper transcription and EasyOCR:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Replace cu121 with your CUDA version (cu118, cu124, etc. — check with nvidia-smi). Without this, Whisper and EasyOCR log informational messages about running in CPU/FP32 mode — these are not errors.

Video frame OCR pipeline (fully local, no API required):

Step	Tool	Notes
1. Sample frames	ffmpeg	Every `VIDEO_FRAME_INTERVAL` seconds (default: 10 s)
2. Crop head overlay	Pillow	Top 20 % of each frame is removed before comparison — ignores webcam/head movement
3. Slide-change detection	Tesseract + Jaccard similarity	OCR the content region; keep frame only when slide text changed ≥ 30 % from previous kept frame
4. Upscale for final OCR	Pillow LANCZOS	Kept frames narrower than 1280 px are upscaled before final OCR pass
5. OCR primary	Tesseract	Best for text-heavy slides and screen recordings
6. OCR fallback	EasyOCR	Catches stylised fonts and graphics that Tesseract misses
7. Vision LLM fallback	VISION_CHAIN	Only if both OCR tools return empty — e.g. hand-drawn diagrams or decorative slides. Disabled by default. Enable by setting `VISION_CHAIN` in `.env`.

The Tesseract-first change detection (steps 2–3) means:

Human head movement is invisible to the frame selector
A new slide triggers a capture even if it appears after only 10 s
Duplicate frames are discarded immediately to save disk space
If Tesseract is not installed, FullMark falls back to PySceneDetect, then fixed-interval sampling

The final output interleaves audio transcript and frame OCR chronologically by timecode. Each ## [MM:SS] section shows the slide OCR content followed by the speech that occurred while that slide was visible. Whisper segments are never split mid-sentence.

Tune extraction with .env settings:

VIDEO_FRAME_INTERVAL=10    # seconds between sampled frames (default: 10)
WHISPER_MODEL=base         # tiny | base | small | medium | large

Manual vision extraction tool (advanced / optional)

For cases where the standard OCR pipeline is insufficient — such as very low-resolution videos that cannot be usefully upscaled, or videos consisting entirely of diagrams with no machine-readable text — a standalone script is included:

python video_vision_extractor.py "input/lecture.mp4"

This sends each unique frame to an OpenAI vision model (GPT-4o-mini by default) and writes a separate _Video.md file alongside the main transcript. It requires OPENAI_API_KEY in your .env.

OPENAI_API_KEY=sk-...
OPENAI_VISION_MODEL=gpt-4o-mini   # or gpt-4o for higher accuracy

Key options:

# Skip PySceneDetect (much faster for long videos — uses fixed-interval only)
python video_vision_extractor.py lecture.mp4 --skip-scene-detect

# Adjust sampling density (lower = more frames)
python video_vision_extractor.py lecture.mp4 --interval 10 --hash-threshold 10

# Preview which frames would be sent without spending any API tokens
python video_vision_extractor.py lecture.mp4 --dry-run --skip-scene-detect

Cost note: GPT-4o-mini vision at detail: high costs roughly $0.001–$0.003 per frame. A 1-hour lecture at 10s intervals ≈ 300 unique frames ≈ $0.30–$0.90. The standard OCR pipeline (Tesseract + EasyOCR) is completely free and should be tried first — it handles most screen-recording and slide-deck videos well.

Convert everything in a folder

python fullmark_cli.py ./my_documents/

Use the `input/` folder (no-argument mode)

Place files into input/ and run with no arguments:

python fullmark_cli.py

FullMark scans input/ and converts everything it finds.

Convert a list of URLs

Create a .txt file with one URL per line:

https://example.com/page-one
https://example.com/page-two
https://docs.python.org/3/library/os.html

Save as input/urls.txt and run:

python fullmark_cli.py input/urls.txt

Crawl a site recursively

python fullmark_cli.py https://example.com/docs --follow-links --crawl-depth 2 --max-pages 30 --crawl-delay 2

⚠ Large crawls can consume significant LLM tokens. Start with --max-pages 10 to sample first. Interrupted runs resume automatically — already-converted sources are skipped.

Convert a GitHub repository — no clone needed

# Entire repo
python fullmark_cli.py https://github.com/owner/repo

# Specific branch
python fullmark_cli.py https://github.com/owner/repo/tree/main

# Just a subdirectory (recommended for large repos)
python fullmark_cli.py https://github.com/owner/repo/tree/main/src

Output contains the repo tree overview, all text files grouped by directory in syntax-highlighted fenced code blocks, and binary files noted but not embedded.

Rate limits:

Mode	Limit
Unauthenticated	60 API requests/hour
With `GITHUB_TOKEN`	5,000 API requests/hour

GITHUB_TOKEN=ghp_yourTokenHere

Get a free token at github.com/settings/tokens — no scopes needed for public repos.

Source code and config files

python fullmark_cli.py ./my-project/

Supported code/config extensions (50+):

Category	Extensions
Python	`.py` `.pyw` `.pyi`
JavaScript / TypeScript	`.js` `.mjs` `.cjs` `.jsx` `.ts` `.tsx`
JVM	`.java` `.kt` `.scala` `.groovy`
C family	`.c` `.h` `.cpp` `.cs`
Systems	`.go` `.rs` `.swift` `.zig` `.dart`
Scripting	`.rb` `.php` `.pl` `.lua` `.r`
Shell	`.sh` `.bash` `.ps1` `.bat` `.cmd`
Data / config	`.json` `.yaml` `.toml` `.ini` `.cfg` `.xml`
Database	`.sql` `.graphql` `.proto`
Infrastructure	`.tf` `.tfvars` `.bicep` `.nix`
Web front-end	`.css` `.scss` `.vue` `.svelte`
Docs-as-code	`.rst` `.mdx`
Misc	`.dockerfile` `.gitignore` `.editorconfig` `.lock`

All CLI options

python fullmark_cli.py --help

Arguments:
  SOURCE               File, directory, or URL (optional — omit to use input/)

Options:
  -o, --output DIR     Output directory (default: ./output)
  -w, --whisper-model  Whisper model: tiny|base|small|medium|large
  -v, --verbose        Show debug logs
  --follow-links       Follow hyperlinks found on URL sources
  --crawl-depth N      Link-hop depth (default: 1)
  --crawl-delay SECS   Sleep between requests (default: 2.0)
  --max-pages N        Hard cap on crawled pages (default: 50)
  --force              Reconvert even if already in conversion_log.json
  --version            Show version

Conversion Log and Deduplication

FullMark tracks every conversion in two files inside output/:

File	Purpose
`conversion_log.json`	Machine-readable dedup index — the tool reads this
`conversion_log.md`	Human-readable summary table with source ID column
`conversion_skipped.log`	Plain-text log of every skipped source with date, file location, and how to reconvert

Deduplication is on by default

Re-running FullMark on the same source is automatically skipped — no flag needed. Every source gets a stable source_id (fm-<16hex>) computed from its content:

Source type	Identity basis
URL	SHA256 of normalised URL — tracking params stripped (`utm_*`, `fbclid`, `gclid`, `ref`), `.git` suffix removed, trailing `/` removed
File < 10 MB	SHA256 of full file bytes — same file under a different name is detected
File ≥ 10 MB (video/audio)	SHA256 of first 4 MB — fast fingerprint without reading the whole file

This means https://example.com/page?utm_source=newsletter and https://example.com/page produce the same source_id and are treated as the same source.

Skip notices — always know what was skipped

When a source is skipped, a notice is appended to output/conversion_skipped.log:

[2026-06-07 21:47:00 UTC] SKIPPED — already converted
  Source    : https://docs.cloud.google.com/managed-spark/docs
  ID        : fm-5003ea12a85b9f75
  Converted : 2026-06-06 21:47:01 UTC
  Output    :
    output\doc_managed-spark_docs\doc_managed-spark_docs.md
  To reconvert : fullmark_cli.py https://... --force
  To delete & redo: delete the output file(s) above, then run again

The skip log is plain text — easy to grep by date, filename, or source ID.

Deleted output = automatic reconversion

If you delete an output .md file, FullMark detects that the file is gone and reconverts automatically — no --force flag needed. The dedup check verifies files actually exist on disk, not just that they appear in the log.

Workflow for replacing an output:

Delete the .md file in output/
Run python fullmark_cli.py <source> — FullMark reconverts without objection

Force reconversion (output file still exists)

python fullmark_cli.py https://example.com/article --force

Or set for a whole batch run:

FORCE_RECONVERT=true

Provenance — Every Output Carries Its Origin

Every converted .md file has a provenance trail built in so the source and identity travel with the document wherever it goes.

YAML front matter (top of every file):

---
source: https://docs.cloud.google.com/managed-spark/docs
converted: 2026-06-07T10:22:33Z
agent: WebAgent
source_id: fm-5003ea12a85b9f75
---

Footnote (bottom of every file / last segment):

---
*Converted by [FullMark](https://github.com/tmprabubiz/fullmark) · source: `https://...` · id: `fm-5003ea12a85b9f75`*

Copy the file into a knowledge base, share it, or embed it in a document store — the origin is always traceable. The source_id ties the output back to the log entry.

Image Handling in Detail

Image type	What happens
Photo / screenshot with text	Tesseract OCR → text blocks; aligned columns → GFM table
SVG logo / icon	Content-type detected from HTTP header (not blindly saved as `.jpg`); shapes → Mermaid code block; `<text>` elements → bullet list
Decorative image (no text)	Base64-embedded inline or described by vision LLM
Web page images	Downloaded alongside the page; extension set from `Content-Type` header with byte-sniff fallback (PNG magic, JPEG `FF D8`, SVG `<svg` tag)

OCR pipeline: Tesseract → EasyOCR fallback → embed if both fail.

Video and Audio in Detail

MP4 / AVI / MOV / MKV / WEBM
  └─ ffmpeg extracts audio
       └─ Whisper (local, free) → timestamped transcript
  └─ PySceneDetect finds scene changes
       └─ opencv extracts one frame per scene → ImageAgent OCR
  └─ CompilerAgent merges transcript + frame OCR → structured Markdown

MP3 / WAV / M4A
  └─ Whisper → [MM:SS] timestamped Markdown

Whisper runs fully locally — no audio leaves your machine.

LLM Configuration (optional)

FullMark uses an LLM to structure video/audio transcripts and describe decorative images. Entirely optional — if you have no API keys, it falls back to mechanical formatting.

Provider chain in `.env`

# Try left-to-right; first to respond wins.
# Put your most reliable/preferred provider first.
# Recommended default (Gemini free tier is fast and generous):
COMPILER_CHAIN=gemini,groq,openrouter_free,ollama

# VISION_CHAIN: used only when Tesseract + EasyOCR both return empty on a frame.
# Put your preferred vision provider first — order is fully user-configurable.
# OpenAI (gpt-4o-mini) is a reliable, low-cost vision option:
VISION_CHAIN=openai,gemini,gemini_free,anthropic,openrouter_free

Important: If .env contains more than one active COMPILER_CHAIN= or VISION_CHAIN= line, python-dotenv silently uses only the first one. FullMark will log a warning at startup if duplicates are detected so this is never invisible.

Provider tiers

Tier	Providers	Cost
Free APIs	Gemini free, Groq, Cerebras, NVIDIA, OpenRouter (free models), Mistral free	Free
Low-cost	OpenAI (gpt-4o-mini), DeepSeek, Together AI, Fireworks, Cohere	Pay-per-use, cheap
Premium	OpenAI (gpt-4o), Anthropic, Gemini Pro	Pay-per-use
Local / offline	Ollama (any model you've pulled)	Free

PROVIDER_MAX_RETRIES=2    # retries per provider on 429 before moving on
PROVIDER_RETRY_DELAY=5    # seconds (exponential: 5s, 10s)

Output Structure

output/
  report.md                         ← single file (< 120k chars)
  big_document_001.md               ← auto-segmented (> 120k chars)
  big_document_002.md
  doc_managed-spark_docs/
    doc_managed-spark_docs.md       ← web conversion in subfolder
    image-001.svg                   ← companion images, correct extension
    image-002.png
  git_owner_repo/
    git_owner_repo_001.md           ← large repo, 3 segments
    git_owner_repo_002.md
    git_owner_repo_003.md
  conversion_log.json               ← machine-readable dedup index
  conversion_log.md                 ← human-readable summary table with source_id
  conversion_skipped.log            ← skip notices: date, file, how to reconvert

Running in a Terminal

FullMark is designed to run in a Command Prompt or PowerShell window — not by double-clicking.

cd G:\fullmark
python fullmark_cli.py input/

Long Sessions and Logging

# Windows PowerShell — capture everything to a file
python fullmark_cli.py ./docs/ 2>&1 | Tee-Object -FilePath run.log

# Bash / macOS / Linux
python fullmark_cli.py ./docs/ 2>&1 | tee run.log

Verbose mode — see every routing decision, provider attempt, and retry:

python fullmark_cli.py report.pdf -v

Architecture

ORCHESTRATOR  (extension + MIME routing, dedup, source_id, footnote)
    │
    ├── DocumentAgent   → PDF DOCX XLSX CSV PPTX EPUB IPYNB MSG EML RTF TXT
    ├── CodeAgent       → .py .js .ts .go .rs .java .json .yaml .toml + 50 more
    ├── WebAgent        → URLs HTML RSS YouTube  +  UrlListAgent
    │       └── ImageAgent  (per downloaded image — OCR / SVG / embed)
    ├── ImageAgent      → JPG PNG SVG BMP TIFF WebP  (Tesseract → EasyOCR → base64)
    ├── VideoAgent      → MP4 AVI MOV MP3 WAV M4A   (ffmpeg + Whisper + PySceneDetect)
    │       └── CompilerAgent  (LLM merge of transcript + frame OCR)
    └── RepoAgent       → https://github.com/owner/repo  (GitHub Trees API, no clone)

Running Tests

python -m pytest tests/ -v

147 tests, all mocked — no internet connection or external tools required.

Test Coverage Gaps (pending)

The unit suite mocks all I/O. The following test cases are not yet written and should be added before a production release:

Security (from Opus 4.8 review)

ZIP traversal: member ../../evil.txt must not escape temp_dir
YAML front matter: source values containing #, :, or " produce valid YAML

Output naming

Two sources with the same filename stem → two distinct output paths (hash suffix)
Summary links for subfolder outputs are relative, not bare filenames

CLI / routing

https://github.com/a/b → RepoAgent (not crawlable)
https://github.com/a/b/issues/1 → WebAgent (crawlable)
https://github.com/a/b/blob/main/f.py → WebAgent (crawlable)

Video OCR pipeline

Frame is upscaled to ≥1280px before Tesseract runs
EasyOCR is not instantiated when Tesseract returns text
Vision LLM fallback is gated on VISION_CHAIN env var

Manual Integration Tests

These require real files or network and cannot be mocked:

#	Input	What to verify
1	PDF with text	Output contains paragraph text, not just front matter
2	Scanned (image) PDF	pytesseract OCR fallback produces text
3	XLSX	GFM table output
4	HTTP URL	Images download, HTML → clean Markdown
5	YouTube URL	Transcript with `[MM:SS]` timestamps
6	Short MP4 (< 5 min)	Whisper transcript + at least some OCR frame text
7	ZIP with mixed files	Each file routes to correct agent, all outputs collected
8	Crafted ZIP (traversal)	`../../test.txt` member extracts safely, not outside temp dir
9	Two files with same name from different dirs	Both saved, no silent overwrite
10	GitHub repo URL	RepoAgent runs, not WebAgent

fullmark/
  __init__.py          ← AgentError, FullMarkError
  orchestrator.py      ← routing, dedup, source_id injection, output writing
  agents/
    document_agent.py  ← PDF DOCX XLSX CSV PPTX EPUB IPYNB MSG EML RTF TXT
    code_agent.py      ← source code + config files (50+ extensions)
    web_agent.py       ← URLs HTML RSS YouTube; image content-type detection
    image_agent.py     ← raster OCR + SVG→Mermaid
    video_agent.py     ← Whisper + scene detection + frame OCR
    compiler_agent.py  ← LLM merge of transcript + frame data
    repo_agent.py      ← GitHub repo → Markdown (no clone, GitHub Trees API)
  utils/
    model_client.py    ← provider fallback chain (Gemini → OpenAI-compat → Ollama)
    markdown_utils.py  ← front matter, inject_source_id, append_footnote, GFM tables
    file_utils.py      ← extension detection, ZIP unpacking, URL naming
    metadata_logger.py ← JSON + Markdown log, dedup, skip notices
    crawler.py         ← recursive URL crawler (BFS, depth/delay/domain control)
tests/                 ← pytest suite (147 tests)
fullmark_cli.py        ← CLI entry point
fullmark_preflight.py  ← system dependency checker
.env.template          ← configuration template

FilesExpand file tree

README.md

Latest commit

History