Awesome Open-Source Media Parsers for AI / LLM / RAG

A curated list of open-source tools that ingest, parse, chunk, embed, retrieve, and evaluate media for AI/LLM/RAG pipelines.

🎯 Defining "Media" in the Age of AI

In the context of Large Language Models and RAG systems, "Media" is no longer just audio or video. It represents any unstructured or semi-structured data format that must be extracted, parsed, or transformed before an AI can reason over it.

This repository covers the complete ingestion spectrum, including:

Visual & Audio Media: Images, Videos, Speech, Medical Scans (DICOM)
Document Media: PDFs, Word Docs, Spreadsheets, E-books, Presentations
Digital Media: Web pages, HTML, Emails, Repositories, Source Code
Spatial & Specialized Media: 3D Models, GIS/Geospatial Data, Architectural Diagrams

(Note: Tools strictly meant for inference, UI, TTS generation, and infrastructure deployment are out of scope. This list strictly focuses on the data ingestion and representation layer.)

1. Document Extraction & Parsing

PDF & Documents

End-to-End PDF → Markdown/JSON for RAG

Tool	License	Stars	Notes
Docling	Apache 2.0	61k+	IBM's layout-aware parser. ML models for layout/table detection. LangChain, LlamaIndex integrations.
Marker	GPL 3.0	25k+	GPU-accelerated. Pipeline: OCR (surya) → layout → formatting. Handles tables, equations, code.
OpenDataLoader PDF	Apache 2.0	3k+	#1 in benchmarks (0.907). XY-Cut++ reading order. Bounding boxes. CPU-only local mode.
pdfmux	MIT	900+	Self-healing: routes each page to the best backend, audits output, re-extracts failures.
PyMuPDF4LLM	AGPL 3.0 / Commercial	2k+	Wrapper around PyMuPDF tuned for LLM/RAG. Markdown extraction, LlamaIndex adapter.
LiteParse	Apache 2.0	9k+	Fast spatial text parsing via PDFium. Built-in Tesseract OCR. Bounding boxes. Rust/Python/Node/WASM.
MegaParse	Apache 2.0	5k+	Document-to-Markdown for RAG ingestion. Handles PDF, DOCX, PPTX, images.
MarkItDown	MIT	50k+	Microsoft's multi-format (PDF, Office, images, HTML, CSV) to Markdown converter.
MinerU	AGPL 3.0	15k+	Full pipeline: OCR → layout → formula → table → Markdown.
PPX	Source-available	500+	Local CPU. OCR + layout + formula → Markdown/JSON. Optional LLM backend.
GoPDF	MIT	900+	Pure Go, deterministic. Per-page signals for OCR routing.
kaos-pdf	Apache 2.0	—	PDF → typed AST with provenance. MCP tools for agentic workflows.
docproc	MIT	200+	Document-to-Markdown with vision LLM for images, equations, figures.
pagewise-pdf-extractor	Apache 2.0	—	Page-wise routing: PyMuPDF → Marker OCR → Ollama vision fallback.
pdfmark-ai	MIT	1k+	Renders PDF pages as images, uses multimodal LLM to produce Markdown.
pdf-to-markdown-pipeline	MIT	100+	Docling + markitdown → clean → chunk. CPU-only. Scientific focus.
mupdf4llm	MIT	—	TypeScript/Bun port of pymupdf4llm. WASM-based. LlamaIndex adapter.
pypdf	BSD 3-Clause	9k+	Pure Python. Most popular PDF library. Simple text extraction.
PyMuPDF (fitz)	AGPL 3.0 / Commercial	7k+	Fastest text extraction. Foundation of many RAG tools.
pdfplumber	MIT	5k+	Best classic library for table extraction. Visual debugging.
PDFminer.six	MIT	6k+	Community fork. Text analysis focus.

OCR for Scanned PDFs

Tool	License	Stars	Notes
Surya	GPL 3.0	18k+	OCR, layout, reading order, table recognition in 90+ languages.
PaddleOCR	Apache 2.0	50k+	Lightweight multilingual OCR. 80+ languages.
EasyOCR	Apache 2.0	25k+	Ready-to-use OCR with 80+ languages.
OCRmyPDF	GPL 3.0	16k+	Adds OCR text layer to scanned PDFs. Uses Tesseract.
Tesseract	Apache 2.0	65k+	The definitive open-source OCR engine.
DeepSeek-OCR	MIT	3k+	Vision-language model for document OCR.

Academic & Scientific Papers

Tool	License	Stars	Notes
Nougat	MIT	9k+	Meta's neural academic PDF → Markdown (math formulas).
olmOCR	Apache 2.0	3k+	Allen AI. LLM-based PDF-to-text for training datasets.
GROBID	Apache 2.0	4k+	ML-driven scholarly PDF structure extraction. Java.
paper-to-md (pdf2md)	MIT	100+	Docling + LLM retouch. Citations, figures, RAG metadata.

Table Extraction

Tool	License	Stars	Notes
Camelot	MIT	3k+	Lattice + stream table detection. Specialist.
Tabula	MIT	2k+	Java-based table extraction with GUI.
Table Transformer (TATR)	MIT	3k+	Microsoft's DETR-based table detection + structure.

Presentations & Office

Tool	License	Stars	Notes
python-pptx	MIT	4k+	Read/write PowerPoint files.
python-docx	MIT	5k+	Read/write Word documents.
Docling	Apache 2.0	61k+	Handles PPTX, DOCX, XLSX. Unified structured output.
MarkItDown	MIT	50k+	Converts PPTX, DOCX to Markdown.

Spreadsheets & Structured Data

Tool	License	Stars	Notes
openpyxl	MIT	6k+	Read/write Excel files. Used by most RAG pipelines.
pandas	BSD 3-Clause	46k+	Swiss army knife for structured data.
TabLib	MIT	—	Table extraction from spreadsheets.
MarkItDown	MIT	50k+	Handles CSV/XLSX → Markdown.

E-books & EPUB

Tool	License	Stars	Notes
calibre	GPL 3.0	22k+	eBook management. Command-line EPUB conversion.
ebooklib	AGPL 3.0	2k+	Python EPUB library.
pandoc	GPL 2.0	37k+	Universal document converter. Handles EPUB, HTML, LaTeX, Markdown.

Email & Messages

Tool	License	Stars	Notes
mail-parser	Apache 2.0	600+	Python email parsing. Attachments, headers, body.
extract-msg	GPL 3.0	400+	Outlook .msg file parser.

Archives & Compression

Tool	License	Stars	Notes
kreuzberg	MIT	—	Rust core. 91+ file formats. Handles ZIP, TAR, 7z. Async. Multiple OCR backends.
extractous	Apache 2.0	1.7k+	Rust. High-speed text/metadata extraction. Apache Tika backend. Python/Java/JS bindings.
goblintools	MIT	—	Python. 30+ archive formats. Magic-byte sniffing. Built-in OCR (Tesseract/AWS Textract).
zipstream-ai	MIT	21	Stream ZIP/TAR directly to LLMs without extraction. Auto-detect CSV/JSON. DataFrame integration.
exarch	MIT	—	Secure archive library. CVE protection. TAR/ZIP/7z. Rust core. Python/Node.js bindings.
dedoc	Apache 2.0	—	Document → unified format pipeline. Auto-extracts archives. REST API.
libarchive	BSD	—	Industry-standard C library. Reads 20+ archive formats (ZIP, TAR, 7z, RAR, CAB, ISO).

2. Web & Code Ingestion

Web & HTML

Tool	License	Stars	Notes
Essence	MIT	—	Fastest. Rust. HTTP-first with Chromium fallback. MCP server. Single binary.
Crawl4AI	Apache 2.0	35k+	Async web crawler for LLMs. Markdown output. JS rendering.
Firecrawl	AGPL 3.0	30k+	API-first. Crawl, scrape, search. Markdown output. Self-hostable.
rdrr	MIT	78+	TypeScript. 20+ site-specific extractors (Wikipedia, Reddit, GitHub, YouTube).
MarkCrawl	MIT	—	Python. Crawl → Markdown + JSONL. Supabase/pgvector upload. MCP server.
pulldown	MIT	—	Python. 5 detail levels. HTTP-first. Chromium optional. CLI + MCP.
site-to-md	MIT	—	Generates /llms.txt + clean Markdown per page.
LLMParser	MIT	—	Python. Full site crawl. RSS. Typed content blocks. No LLM dependencies.
readdown	MIT	—	JS/TS. Replaces Readability + Turndown. Token estimation built-in.
h2m-parser	MIT	—	TS. Mozilla Readability + streaming renderer. 4x faster than alternatives.
scrapedown	MIT	—	HTML → Markdown with CSS/XPath annotations for LLM scraping.
url-to-markdown	MIT	—	Self-hostable API. Handles JS SPAs, PDF, DOCX.
WebToMD	MIT	—	CLI. JS rendering. Design system extraction.
Trafilatura	Apache 2.0	3k+	Python. Reliable web text extraction. Used as backend by many tools.

Code

Tool	License	Stars	Notes
omnichunk	MIT	—	AST-based code chunking for 15+ languages. Context-rich.
tree-sitter	MIT	20k+	Incremental parser for 100+ languages. Used by many code tools.
LangChain code splitters	MIT	105k+	RecursiveCharacterTextSplitter language-aware.
Julienne (CodeChunker)	MIT	—	Rust. AST-based chunking for Python and Rust.

3. Visual Media & OCR

Images & Vision

Image Captioning

Tool	License	Stars	Notes
BLIP / BLIP-2	BSD 3-Clause	4k+	Salesforce. Vision-language pre-training. Captioning + VQA.
BLIP-2 (FLAN-T5-XL)	BSD 3-Clause	10k+	State-of-the-art image captioning. Used in many RAG pipelines.
MetaCaptioner	Apache 2.0	1k+	ICLR 2026. GPT-4.1-level caption quality. 89.5% cost reduction.
CapRL / CapRL++	Apache 2.0	200+	ICLR 2026. RL-trained dense captioning. Image + video.
ScaleCap	MIT	—	ICLR 2026. Inference-time scalable captioning. 450k dataset.
Florence-2	MIT	3k+	Microsoft. Unified vision-language model. Captioning + OCR + detection.

Vision Language Models (VLMs)

Tool	License	Params	Notes
Qwen3-VL	Apache 2.0	7B–72B	Alibaba. Top-tier multimodal reasoning. Agentic capabilities.
Molmo	Apache 2.0	1B–72B	Allen AI. On par with GPT-4V. Open weights.
Pixtral	Apache 2.0	12B	Mistral's first multimodal model. Images + text.
GLM-4.6V	Apache 2.0	—	Zhipu AI. Native multimodal tool use. 128K context.
InternVL2.5	MIT	1B–76B	Strong document understanding. Charts, tables.
LLaVA	Apache 2.0	7B–34B	Pioneering open VLM. Large ecosystem.
Gemma 3 Vision	Gemma	4B–27B	Google. Lightweight. Image + short video.
DeepSeek-VL	MIT	7B	MoE efficiency. Strong technical/scientific visuals.

Multi-modal Embeddings

Tool	License	Stars	Notes
CLIP	MIT	28k+	OpenAI. Text + image in shared vector space. The workhorse.
SigLIP / SigLIP-2	Apache 2.0	—	Google. Strong open-source visual encoder. Better than CLIP on docs.
ImageBind	CC BY-NC 4.0	9k+	Meta. 6 modalities: text, image, audio, depth, thermal, IMU.
ColPali	Apache 2.0	3k+	Late-interaction model for document image retrieval. Bypasses OCR.
ColQwen	Apache 2.0	3k+	Qwen-based ColPali. Better accuracy on documents.
Qwen3-VL-Embedding	Apache 2.0	75k+	Vision encoder usable standalone for embeddings.

Diagrams & Charts

Tool	License	Stars	Notes
diagram2graph	Apache 2.0	41	VLM extracts nodes/edges from process diagrams → structured KG JSON. Fine-tuned Qwen2.5-VL.
DiagramAgent	—	—	Diagram → structured code. Qwen2-VL based.
Schematex	AGPL 3.0	30	DSL for professional diagrams (medical, electrical, legal) → pure SVG. Standards-as-code.
fcp-drawio	MIT	3	MCP server for creating/editing draw.io diagrams via intent-level commands.
PlantUML	MIT	11k+	Text-based UML diagram generation. Widely used.
Mermaid	MIT	75k+	JS-based diagramming and charting. Native LLM rendering support.

Math & LaTeX OCR

Tool	License	Stars	Params	Notes
pix2tex (LaTeX-OCR)	MIT	16k+	—	ViT encoder + Transformer decoder. The gold standard. GUI + CLI + API.
TexTeller	Apache 2.0	729	300M	80M image-formula pairs. Stronger generalization. Handwriting + scanned + printed.
Texo	AGPL 3.0	835	20M	Ultra-lightweight SOTA. Runs in browser. Distilled from PPFormulaNet.
Pix2Text (P2T)	Apache 2.0	5k+	—	Full Mathpix alternative: layout + tables + math + text → Markdown. 80+ languages.

Handwriting Recognition

Tool	License	Stars	Notes
Kraken	Apache 2.0	975	Turn-key OCR for historical and non-Latin scripts. Trainable layout analysis. ALTO/PageXML output.
Churro	Apache 2.0	31	Stanford. 3B VLM. Exceeds Gemini 2.5 Pro accuracy at 15.5x lower cost. 22 centuries of scripts.
HTRflow	Apache 2.0	—	Riksarkivet. YAML pipeline blueprints. TrOCR + YOLO. Exports PAGE/ALTO XML.
Thulium	Apache 2.0	8	52+ languages. ONNX export. CNN/ViT + Transformer/LSTM. Production-ready.
TrOCR	MIT	20k+	Microsoft. Transformer-based OCR. Strong handwritten text baseline. Fine-tunable.
PyLaia	Apache 2.0	254	VGG + BLSTM for HTR. CTC decoding. GPU/CPU agnostic.
HTR-ConvText	—	—	Hybrid CNN-ViT. SOTA on IAM, READ2016. 65.9M params. Textual Context Module.
Loghi HTR	—	—	HTR framework. VGSL model definitions. API mode.

4. Audio & Video Processing

Audio & Speech

Speech-to-Text Models

Tool	License	Params	WER	Languages	Notes
Whisper	MIT	1.55B	7.4%	99+	Gold standard. Encoder-decoder transformer.
Faster-Whisper	MIT	1.55B	7.4%	99+	CTranslate2 reimplementation. 4x faster than Whisper.
WhisperX	BSD 2-Clause	1.55B	—	99+	Word-level timestamps + speaker diarization. 70x realtime.
Whisper Turbo	MIT	809M	7.75%	99+	6x faster than Large V3. Minimal accuracy loss.
Distil-Whisper	MIT	756M	~8%	English	6x faster than Large V3.
Canary-Qwen 2.5B	CC-BY-4.0	2.5B	5.63%	25	Highest accuracy. NVIDIA.
Granite Speech 8B	Apache 2.0	9B	5.85%	English + 7	Enterprise-grade. IBM.
Parakeet TDT	CC-BY-4.0	1.1B	~8%	English	Ultra low-latency streaming. 2728x RTFx.
Moonshine	Apache 2.0	27M–331M	—	English	Edge / on-device.
Qwen3-ASR	Apache 2.0	1.7B	—	52	Alibaba. Competitive with commercial APIs.
CrisperWhisper	MIT	1.55B	6.66%	99+	Verbatim transcription. 1st OpenASR leaderboard.
Vosk	Apache 2.0	—	—	20+	Lightweight, CPU-friendly, streaming.
SpeechBrain	Apache 2.0	—	—	Multi	Research toolkit. Custom pipelines.

Audio RAG & Processing

Tool	License	Stars	Notes
audio-rag-whisper-faiss	MIT	—	Audio → Whisper → FAISS → timestamped RAG.
RAG AgenticVoice	MIT	—	Real-time voice RAG. Whisper → FAISS → Gemini → TTS.
OwnScribe	MIT	—	Browser-based (WebGPU). Whisper → LLM summary → semantic search.
whisper-transcribe	MIT	—	Docker. GPU-accelerated. Diarization. LLM post-correction.
edge-conversational-agent	MIT	—	Edge pipeline: ASR → RAG → LLM → TTS. Whisper + Piper.

Video

Frame Extraction

Tool	License	Stars	Notes
ffmpeg	LGPL/GPL	50k+	Universal media processor. Frame extraction, transcoding, streaming.
PySceneDetect	BSD 3-Clause	3k+	Content-aware scene boundary detection.
FramesExtractor	MIT	—	GPU-accelerated frame extraction via ffmpeg.
distant-frames	GPL 3.0	—	Smart dedup: only saves frames that are visually different.
vidlizer	MIT	—	Video → structured JSON. Local Ollama or cloud. Perceptual dedup.

Video RAG & Understanding

Tool	License	Stars	Notes
VideoRAG	MIT	2k+	Graph-driven knowledge indexing. Long video QA. Single GPU.
FRAG	MIT	400+	NVIDIA. Frame Selection Augmented Generation. Zero-shot.
FOCUS	MIT	100+	ICLR 2026. Keyframe selection via multi-armed bandits.
Tempo	MIT	—	Query-aware frame compression. Outperforms GPT-4o on long video.
VideoITG	MIT	—	NVIDIA. Instructed Temporal Grounding. Adaptive frame sampling.
PEEK	MIT	—	Query-free frame selector for low-budget video captioning.
media-ingest	MIT	—	Download + frames + transcript for Claude. yt-dlp + Whisper.
VZT Video-Intel	MIT	—	Temporal scene graph. CLI + MCP server. Analyze once, query forever.
RAG-X	MIT	—	Video Graph RAG. SAM2 + CLIP + BLIP + Neo4j.
CFM-RAG	MIT	—	Cross-frame multimodal RAG for video. BLIP-2, YOLO, SAM.
Video-RAG	MIT	—	Training-free. Uses ASR + OCR + object detection as auxiliary texts.

5. Specialized Data & Modalities

3D Models & Spatial Data

Tool	License	Stars	Notes
Open3D	MIT	10k+	Comprehensive library for 3D data processing. Point cloud/mesh feature extraction for LLMs.
trimesh	MIT	3k+	Python library for loading and using triangular meshes. Geometric metadata extraction.

Geospatial & GIS Data

Tool	License	Stars	Notes
GeoPandas	BSD 3-Clause	4k+	Parses Shapefiles and GeoJSON into DataFrames. Easily integrates with LLM data agents.
Rasterio	BSD 3-Clause	3k+	Essential for reading geospatial raster data (satellite imagery) for vision models.
OSMnx	MIT	5k+	Downloads and analyzes street networks from OpenStreetMap. Geospatial RAG context.

Medical Imaging (DICOM)

Tool	License	Stars	Notes
pydicom	MIT	2k+	The standard for reading, modifying, and extracting patient metadata/text from DICOM medical scans.
MONAI	Apache 2.0	8k+	AI toolkit for healthcare imaging. Embeddings from MRI/CT scans.

Database Connectors & SQL RAG

Tool	License	Stars	Notes
Vanna	MIT	23.6k+	Text-to-SQL RAG framework. Train on your schema → natural language queries. Any LLM + any DB.
Databao Agent	—	101	JetBrains. Natural language → interactive charts/tables. Local Ollama support. Pythonic API.
queryclaw	Apache 2.0	7	AI-native DB agent. ReACT loop. Schema exploration, DML, DDL. Safety layer + HITL.
LangChain SQL Agent	MIT	105k+	SQL database toolkit. Query, check, execute, describe. LangGraph integration.
LlamaIndex SQL	MIT	40k+	SQL table indexing + structured query engine. Text-to-SQL with table schema.
DBHub MCP	Apache 2.0	2k+	Universal DB MCP server. Works with Claude, Cursor, VS Code. PostgreSQL, MySQL, SQLite.

6. RAG Infrastructure & Frameworks

Chunking & Splitting

Tool	License	Stars	Notes
Chonky	MIT	400+	Neural semantic chunking with fine-tuned transformers.
Adaptive Chunking	MIT	230+	Auto-selects best method per document. LREC 2026.
chunklet-py	MIT	77+	Multi-format (text, code, docs). Composable constraints.
omnichunk	MIT	—	Structure-aware. Code, Markdown, JSON, HTML. MCP server.
chunkmate	MIT	—	Token-aware. Auto-format detection. AI metadata generation.
chunkweaver	MIT	—	Regex boundaries, hierarchical levels. LangChain/LlamaIndex drop-in.
chunktuner	MIT	—	Auto-tunes chunking for your corpus. CLI + MCP.
poma-primecut-nano	MIT	—	Hierarchical heading-based. Self-contained retrieval units.
COSMIC	MIT	—	Concept-aware semantic meta-chunking. Discourse coherence.
chunkedrs	MIT	—	Rust. Token-accurate. Recursive, markdown, semantic.
Julienne	MIT	—	Rust. Range-preserving chunks. LangChain-style + semantic.

Embedding Models

Tool	License	Dims	Notes
BGE (BAAI)	MIT	384–1024	Top-performing open embeddings. BGE-M3 multilingual.
E5 / E5-Mistral	MIT	1024–4096	Microsoft. Strong retrieval embeddings.
GTE (Alibaba)	Apache 2.0	384–768	Multilingual. Qwen-based.
all-MiniLM-L6-v2	Apache 2.0	384	Small, fast. The default for most RAG prototypes.
Jina Embeddings v3	Apache 2.0	512–1024	LoRA adapters. Task-specific.
Nomic Embed v1.5	Apache 2.0	768	1M context length.
Cohere Embed v3	Proprietary	1024	Cloud API. Strong multilingual retrieval.

Vector Databases

Tool	License	Stars	Scale	Notes
Qdrant	Apache 2.0	24k+	Billions	Rust. Best filtering perf. Hybrid search.
Milvus	Apache 2.0	33k+	100B+	K8s-native. GPU indexing. Enterprise scale.
Weaviate	BSD 3-Clause	13k+	Billions	Built-in vectorization + hybrid search. MCP server.
Chroma	Apache 2.0	18k+	Millions	Python-first, embedded. Fastest prototyping.
pgvector	PostgreSQL	13k+	Millions	PostgreSQL extension. For Postgres shops.
LanceDB	Apache 2.0	5k+	Millions	Embedded, S3-native. Multimodal.
FAISS	MIT	33k+	Custom	Library. GPU-accelerated. Research-grade.
Vald	Apache 2.0	1.6k+	Billions	Cloud-native. Automated vector indexing.

Multimodal RAG

Tool	License	Stars	Notes
PixelRAG	MIT	—	UC Berkeley. Renders pages as screenshots. VLMs read tiles. Beats text parsers.
ColPali	Apache 2.0	3k+	Document retrieval via vision. No OCR needed.
Byaldi	Apache 2.0	—	User-friendly ColPali. RAG over document images.
LlamaIndex multimodal	MIT	40k+	MultiModalVectorStoreIndex. Text + image dual-index.
NexusRAG	MIT	—	Hybrid RAG + KG. Image/table captioning. Vision LLM.

RAG Frameworks

Tool	License	Stars	Notes
LangChain	MIT	105k+	General orchestration. 500+ integrations. Largest ecosystem.
LlamaIndex	MIT	40k+	Data framework. Best RAG-specific indexing/retrieval.
Dify	Apache 2.0	90k+	Visual AI platform. Low-code RAG + agents.
RAGFlow	Apache 2.0	48k+	Deep document parsing. Intelligent chunking. Knowledge graphs.
Haystack	Apache 2.0	20k+	Production-safe modular pipelines. Built-in evaluation.
DSPy	MIT	22k+	Programmatic prompt/pipeline optimization.
LightRAG	MIT	14k+	Lightweight graph-based RAG. Minimal hardware.
txtai	Apache 2.0	10k+	All-in-one. Semantic search + RAG + agents.
LLMWare	Apache 2.0	12k+	Enterprise RAG. CPU-optimized.
R2R	MIT	5k+	Production RAG engine. REST API.
mem0	Apache 2.0	28k+	Memory layer for AI agents. Personalized RAG.
VelociRAG	MIT	—	ONNX-powered 4-layer fusion. No PyTorch. MCP server.
ragway	MIT	—	Modular. Swap components via YAML. No code changes.

Evaluation & Observability

Tool	License	Stars	Notes
RAGAS	Apache 2.0	9k+	Standard RAG quality metrics. Faithfulness, relevancy.
LangSmith	MIT	—	Tracing, evaluation, debugging. Managed.
DeepEval	Apache 2.0	6k+	Unit testing for LLMs. 15+ metrics. CI/CD.
Phoenix (Arize)	Elastic License	10k+	LLM observability. Tracing, evaluation.
LangFuse	MIT	8k+	Open-source LLM engineering. Tracing, prompts, metrics.
MLflow	Apache 2.0	21k+	LLM evaluation, tracing, registry.

Agent Frameworks

Tool	License	Stars	Notes
LangGraph	MIT	10k+	Stateful multi-actor agents. Cycles, HITL.
CrewAI	MIT	30k+	Multi-agent orchestration. Role-based.
AutoGen	MIT	40k+	Microsoft. Multi-agent conversations.
smolagents	Apache 2.0	20k+	HuggingFace. Minimal. Code agents.
OpenAI Agents SDK	MIT	20k+	Official. Handoffs, guardrails.
Cognee	Apache 2.0	2k+	GraphRAG + agentic memory.

How to Choose

Use Case	Recommended Tools
Text-heavy digital PDFs	PyMuPDF → PyMuPDF4LLM
Complex layouts, tables, multi-column	Docling or OpenDataLoader
Scanned / image PDFs	Marker or Surya OCR + Docling
Self-healing / high accuracy	pdfmux
Academic papers (formulas, citations)	Nougat, GROBID, paper-to-md
PowerPoint/Word docs → text	python-pptx/python-docx or MarkItDown
Spreadsheets / CSV → RAG	openpyxl + pandas
EPUB/e-books	pandoc, calibre, ebooklib
Email parsing (.eml/.msg)	mail-parser or extract-msg
Archives (ZIP/TAR/7z) → RAG	kreuzberg, extractous, or goblintools
Web pages → Markdown	Essence (fastest), Crawl4AI, Firecrawl, rdrr
Code chunking	omnichunk or tree-sitter
Image captioning	BLIP-2, MetaCaptioner, CapRL
Vision understanding	Qwen3-VL, Molmo, Pixtral
Multi-modal retrieval (text + images)	CLIP, SigLIP-2, ColPali, Byaldi
Diagrams → structured data	diagram2graph or Schematex
UML / architecture diagrams	PlantUML, Mermaid, or fcp-drawio
Math formula image → LaTeX	pix2tex (LaTeX-OCR) or Pix2Text
Handwritten text recognition	Kraken (historical), Churro (VLM), or TrOCR
Audio transcription	Whisper (99+ languages) or Faster-Whisper
Real-time/streaming audio	Parakeet TDT or Whisper Turbo
Speaker diarization	WhisperX
Frame extraction from video	ffmpeg + PySceneDetect or distant-frames
Video RAG / long video QA	VideoRAG, FRAG, FOCUS, media-ingest
3D model metadata	Open3D or trimesh
Geospatial data	GeoPandas + OSMnx
Medical imaging (DICOM)	pydicom + MONAI
Natural language → SQL	Vanna or Databao Agent
Pixel-level RAG (no text parsing)	PixelRAG
Chunking for RAG	chunkweaver (structure), Chonky (neural), LangChain splitters
Embeddings	BGE, E5, GTE (open) or all-MiniLM-L6-v2 (fast)
Fast vector search prototype	Chroma or FAISS
Production vector search (<100M)	Qdrant or Weaviate
Production vector search (>100M)	Milvus or Qdrant
Already on PostgreSQL	pgvector + pgvectorscale
RAG orchestration	LlamaIndex (retrieval-first) or LangChain (general)
Visual / low-code RAG	Dify or RAGFlow
Multi-agent workflows	LangGraph or CrewAI
Graph-enhanced RAG	Microsoft GraphRAG or LightRAG
LLM evaluation	RAGAS + LangFuse or DeepEval
Full ingestion pipeline	Docling → chunkweaver → BGE → Qdrant → LlamaIndex

Related Awesome Lists

awesome-pdf — General PDF libraries
awesome-document-understanding — Document Understanding research
awesome-ocr — Classical OCR
awesome-document-ocr — Modern VLM-based OCR
Awesome-OCR-in-the-Foundation-Model-Era — OCR in foundation model era
awesome-llm-apps — LLM application tools

This list is maintained by the open-source community. Contributions welcome — open a PR to add or update entries.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Open-Source Media Parsers for AI / LLM / RAG

🎯 Defining "Media" in the Age of AI

Contents

1. Document Extraction & Parsing

PDF & Documents

End-to-End PDF → Markdown/JSON for RAG

OCR for Scanned PDFs

Academic & Scientific Papers

Table Extraction

Presentations & Office

Spreadsheets & Structured Data

E-books & EPUB

Email & Messages

Archives & Compression

2. Web & Code Ingestion

Web & HTML

Code

3. Visual Media & OCR

Images & Vision

Image Captioning

Vision Language Models (VLMs)

Multi-modal Embeddings

Diagrams & Charts

Math & LaTeX OCR

Handwriting Recognition

4. Audio & Video Processing

Audio & Speech

Speech-to-Text Models

Audio RAG & Processing

Video

Frame Extraction

Video RAG & Understanding

5. Specialized Data & Modalities

3D Models & Spatial Data

Geospatial & GIS Data

Medical Imaging (DICOM)

Database Connectors & SQL RAG

6. RAG Infrastructure & Frameworks

Chunking & Splitting

Embedding Models

Vector Databases

Multimodal RAG

RAG Frameworks

Evaluation & Observability

Agent Frameworks

How to Choose

Related Awesome Lists

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages