A curated list of open-source tools that ingest, parse, chunk, embed, retrieve, and evaluate media for AI/LLM/RAG pipelines.
In the context of Large Language Models and RAG systems, "Media" is no longer just audio or video. It represents any unstructured or semi-structured data format that must be extracted, parsed, or transformed before an AI can reason over it.
This repository covers the complete ingestion spectrum, including:
- Visual & Audio Media: Images, Videos, Speech, Medical Scans (DICOM)
- Document Media: PDFs, Word Docs, Spreadsheets, E-books, Presentations
- Digital Media: Web pages, HTML, Emails, Repositories, Source Code
- Spatial & Specialized Media: 3D Models, GIS/Geospatial Data, Architectural Diagrams
(Note: Tools strictly meant for inference, UI, TTS generation, and infrastructure deployment are out of scope. This list strictly focuses on the data ingestion and representation layer.)
- 1. Document Extraction & Parsing
- 2. Web & Code Ingestion
- 3. Visual Media & OCR
- 4. Audio & Video Processing
- 5. Specialized Data & Modalities
- 6. RAG Infrastructure & Frameworks
- How to Choose
- Related Awesome Lists
| Tool | License | Stars | Notes |
|---|---|---|---|
| Docling | Apache 2.0 | 61k+ | IBM's layout-aware parser. ML models for layout/table detection. LangChain, LlamaIndex integrations. |
| Marker | GPL 3.0 | 25k+ | GPU-accelerated. Pipeline: OCR (surya) → layout → formatting. Handles tables, equations, code. |
| OpenDataLoader PDF | Apache 2.0 | 3k+ | #1 in benchmarks (0.907). XY-Cut++ reading order. Bounding boxes. CPU-only local mode. |
| pdfmux | MIT | 900+ | Self-healing: routes each page to the best backend, audits output, re-extracts failures. |
| PyMuPDF4LLM | AGPL 3.0 / Commercial | 2k+ | Wrapper around PyMuPDF tuned for LLM/RAG. Markdown extraction, LlamaIndex adapter. |
| LiteParse | Apache 2.0 | 9k+ | Fast spatial text parsing via PDFium. Built-in Tesseract OCR. Bounding boxes. Rust/Python/Node/WASM. |
| MegaParse | Apache 2.0 | 5k+ | Document-to-Markdown for RAG ingestion. Handles PDF, DOCX, PPTX, images. |
| MarkItDown | MIT | 50k+ | Microsoft's multi-format (PDF, Office, images, HTML, CSV) to Markdown converter. |
| MinerU | AGPL 3.0 | 15k+ | Full pipeline: OCR → layout → formula → table → Markdown. |
| PPX | Source-available | 500+ | Local CPU. OCR + layout + formula → Markdown/JSON. Optional LLM backend. |
| GoPDF | MIT | 900+ | Pure Go, deterministic. Per-page signals for OCR routing. |
| kaos-pdf | Apache 2.0 | — | PDF → typed AST with provenance. MCP tools for agentic workflows. |
| docproc | MIT | 200+ | Document-to-Markdown with vision LLM for images, equations, figures. |
| pagewise-pdf-extractor | Apache 2.0 | — | Page-wise routing: PyMuPDF → Marker OCR → Ollama vision fallback. |
| pdfmark-ai | MIT | 1k+ | Renders PDF pages as images, uses multimodal LLM to produce Markdown. |
| pdf-to-markdown-pipeline | MIT | 100+ | Docling + markitdown → clean → chunk. CPU-only. Scientific focus. |
| mupdf4llm | MIT | — | TypeScript/Bun port of pymupdf4llm. WASM-based. LlamaIndex adapter. |
| pypdf | BSD 3-Clause | 9k+ | Pure Python. Most popular PDF library. Simple text extraction. |
| PyMuPDF (fitz) | AGPL 3.0 / Commercial | 7k+ | Fastest text extraction. Foundation of many RAG tools. |
| pdfplumber | MIT | 5k+ | Best classic library for table extraction. Visual debugging. |
| PDFminer.six | MIT | 6k+ | Community fork. Text analysis focus. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Surya | GPL 3.0 | 18k+ | OCR, layout, reading order, table recognition in 90+ languages. |
| PaddleOCR | Apache 2.0 | 50k+ | Lightweight multilingual OCR. 80+ languages. |
| EasyOCR | Apache 2.0 | 25k+ | Ready-to-use OCR with 80+ languages. |
| OCRmyPDF | GPL 3.0 | 16k+ | Adds OCR text layer to scanned PDFs. Uses Tesseract. |
| Tesseract | Apache 2.0 | 65k+ | The definitive open-source OCR engine. |
| DeepSeek-OCR | MIT | 3k+ | Vision-language model for document OCR. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Nougat | MIT | 9k+ | Meta's neural academic PDF → Markdown (math formulas). |
| olmOCR | Apache 2.0 | 3k+ | Allen AI. LLM-based PDF-to-text for training datasets. |
| GROBID | Apache 2.0 | 4k+ | ML-driven scholarly PDF structure extraction. Java. |
| paper-to-md (pdf2md) | MIT | 100+ | Docling + LLM retouch. Citations, figures, RAG metadata. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Camelot | MIT | 3k+ | Lattice + stream table detection. Specialist. |
| Tabula | MIT | 2k+ | Java-based table extraction with GUI. |
| Table Transformer (TATR) | MIT | 3k+ | Microsoft's DETR-based table detection + structure. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| python-pptx | MIT | 4k+ | Read/write PowerPoint files. |
| python-docx | MIT | 5k+ | Read/write Word documents. |
| Docling | Apache 2.0 | 61k+ | Handles PPTX, DOCX, XLSX. Unified structured output. |
| MarkItDown | MIT | 50k+ | Converts PPTX, DOCX to Markdown. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| openpyxl | MIT | 6k+ | Read/write Excel files. Used by most RAG pipelines. |
| pandas | BSD 3-Clause | 46k+ | Swiss army knife for structured data. |
| TabLib | MIT | — | Table extraction from spreadsheets. |
| MarkItDown | MIT | 50k+ | Handles CSV/XLSX → Markdown. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| calibre | GPL 3.0 | 22k+ | eBook management. Command-line EPUB conversion. |
| ebooklib | AGPL 3.0 | 2k+ | Python EPUB library. |
| pandoc | GPL 2.0 | 37k+ | Universal document converter. Handles EPUB, HTML, LaTeX, Markdown. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| mail-parser | Apache 2.0 | 600+ | Python email parsing. Attachments, headers, body. |
| extract-msg | GPL 3.0 | 400+ | Outlook .msg file parser. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| kreuzberg | MIT | — | Rust core. 91+ file formats. Handles ZIP, TAR, 7z. Async. Multiple OCR backends. |
| extractous | Apache 2.0 | 1.7k+ | Rust. High-speed text/metadata extraction. Apache Tika backend. Python/Java/JS bindings. |
| goblintools | MIT | — | Python. 30+ archive formats. Magic-byte sniffing. Built-in OCR (Tesseract/AWS Textract). |
| zipstream-ai | MIT | 21 | Stream ZIP/TAR directly to LLMs without extraction. Auto-detect CSV/JSON. DataFrame integration. |
| exarch | MIT | — | Secure archive library. CVE protection. TAR/ZIP/7z. Rust core. Python/Node.js bindings. |
| dedoc | Apache 2.0 | — | Document → unified format pipeline. Auto-extracts archives. REST API. |
| libarchive | BSD | — | Industry-standard C library. Reads 20+ archive formats (ZIP, TAR, 7z, RAR, CAB, ISO). |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Essence | MIT | — | Fastest. Rust. HTTP-first with Chromium fallback. MCP server. Single binary. |
| Crawl4AI | Apache 2.0 | 35k+ | Async web crawler for LLMs. Markdown output. JS rendering. |
| Firecrawl | AGPL 3.0 | 30k+ | API-first. Crawl, scrape, search. Markdown output. Self-hostable. |
| rdrr | MIT | 78+ | TypeScript. 20+ site-specific extractors (Wikipedia, Reddit, GitHub, YouTube). |
| MarkCrawl | MIT | — | Python. Crawl → Markdown + JSONL. Supabase/pgvector upload. MCP server. |
| pulldown | MIT | — | Python. 5 detail levels. HTTP-first. Chromium optional. CLI + MCP. |
| site-to-md | MIT | — | Generates /llms.txt + clean Markdown per page. |
| LLMParser | MIT | — | Python. Full site crawl. RSS. Typed content blocks. No LLM dependencies. |
| readdown | MIT | — | JS/TS. Replaces Readability + Turndown. Token estimation built-in. |
| h2m-parser | MIT | — | TS. Mozilla Readability + streaming renderer. 4x faster than alternatives. |
| scrapedown | MIT | — | HTML → Markdown with CSS/XPath annotations for LLM scraping. |
| url-to-markdown | MIT | — | Self-hostable API. Handles JS SPAs, PDF, DOCX. |
| WebToMD | MIT | — | CLI. JS rendering. Design system extraction. |
| Trafilatura | Apache 2.0 | 3k+ | Python. Reliable web text extraction. Used as backend by many tools. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| omnichunk | MIT | — | AST-based code chunking for 15+ languages. Context-rich. |
| tree-sitter | MIT | 20k+ | Incremental parser for 100+ languages. Used by many code tools. |
| LangChain code splitters | MIT | 105k+ | RecursiveCharacterTextSplitter language-aware. |
| Julienne (CodeChunker) | MIT | — | Rust. AST-based chunking for Python and Rust. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| BLIP / BLIP-2 | BSD 3-Clause | 4k+ | Salesforce. Vision-language pre-training. Captioning + VQA. |
| BLIP-2 (FLAN-T5-XL) | BSD 3-Clause | 10k+ | State-of-the-art image captioning. Used in many RAG pipelines. |
| MetaCaptioner | Apache 2.0 | 1k+ | ICLR 2026. GPT-4.1-level caption quality. 89.5% cost reduction. |
| CapRL / CapRL++ | Apache 2.0 | 200+ | ICLR 2026. RL-trained dense captioning. Image + video. |
| ScaleCap | MIT | — | ICLR 2026. Inference-time scalable captioning. 450k dataset. |
| Florence-2 | MIT | 3k+ | Microsoft. Unified vision-language model. Captioning + OCR + detection. |
| Tool | License | Params | Notes |
|---|---|---|---|
| Qwen3-VL | Apache 2.0 | 7B–72B | Alibaba. Top-tier multimodal reasoning. Agentic capabilities. |
| Molmo | Apache 2.0 | 1B–72B | Allen AI. On par with GPT-4V. Open weights. |
| Pixtral | Apache 2.0 | 12B | Mistral's first multimodal model. Images + text. |
| GLM-4.6V | Apache 2.0 | — | Zhipu AI. Native multimodal tool use. 128K context. |
| InternVL2.5 | MIT | 1B–76B | Strong document understanding. Charts, tables. |
| LLaVA | Apache 2.0 | 7B–34B | Pioneering open VLM. Large ecosystem. |
| Gemma 3 Vision | Gemma | 4B–27B | Google. Lightweight. Image + short video. |
| DeepSeek-VL | MIT | 7B | MoE efficiency. Strong technical/scientific visuals. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| CLIP | MIT | 28k+ | OpenAI. Text + image in shared vector space. The workhorse. |
| SigLIP / SigLIP-2 | Apache 2.0 | — | Google. Strong open-source visual encoder. Better than CLIP on docs. |
| ImageBind | CC BY-NC 4.0 | 9k+ | Meta. 6 modalities: text, image, audio, depth, thermal, IMU. |
| ColPali | Apache 2.0 | 3k+ | Late-interaction model for document image retrieval. Bypasses OCR. |
| ColQwen | Apache 2.0 | 3k+ | Qwen-based ColPali. Better accuracy on documents. |
| Qwen3-VL-Embedding | Apache 2.0 | 75k+ | Vision encoder usable standalone for embeddings. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| diagram2graph | Apache 2.0 | 41 | VLM extracts nodes/edges from process diagrams → structured KG JSON. Fine-tuned Qwen2.5-VL. |
| DiagramAgent | — | — | Diagram → structured code. Qwen2-VL based. |
| Schematex | AGPL 3.0 | 30 | DSL for professional diagrams (medical, electrical, legal) → pure SVG. Standards-as-code. |
| fcp-drawio | MIT | 3 | MCP server for creating/editing draw.io diagrams via intent-level commands. |
| PlantUML | MIT | 11k+ | Text-based UML diagram generation. Widely used. |
| Mermaid | MIT | 75k+ | JS-based diagramming and charting. Native LLM rendering support. |
| Tool | License | Stars | Params | Notes |
|---|---|---|---|---|
| pix2tex (LaTeX-OCR) | MIT | 16k+ | — | ViT encoder + Transformer decoder. The gold standard. GUI + CLI + API. |
| TexTeller | Apache 2.0 | 729 | 300M | 80M image-formula pairs. Stronger generalization. Handwriting + scanned + printed. |
| Texo | AGPL 3.0 | 835 | 20M | Ultra-lightweight SOTA. Runs in browser. Distilled from PPFormulaNet. |
| Pix2Text (P2T) | Apache 2.0 | 5k+ | — | Full Mathpix alternative: layout + tables + math + text → Markdown. 80+ languages. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Kraken | Apache 2.0 | 975 | Turn-key OCR for historical and non-Latin scripts. Trainable layout analysis. ALTO/PageXML output. |
| Churro | Apache 2.0 | 31 | Stanford. 3B VLM. Exceeds Gemini 2.5 Pro accuracy at 15.5x lower cost. 22 centuries of scripts. |
| HTRflow | Apache 2.0 | — | Riksarkivet. YAML pipeline blueprints. TrOCR + YOLO. Exports PAGE/ALTO XML. |
| Thulium | Apache 2.0 | 8 | 52+ languages. ONNX export. CNN/ViT + Transformer/LSTM. Production-ready. |
| TrOCR | MIT | 20k+ | Microsoft. Transformer-based OCR. Strong handwritten text baseline. Fine-tunable. |
| PyLaia | Apache 2.0 | 254 | VGG + BLSTM for HTR. CTC decoding. GPU/CPU agnostic. |
| HTR-ConvText | — | — | Hybrid CNN-ViT. SOTA on IAM, READ2016. 65.9M params. Textual Context Module. |
| Loghi HTR | — | — | HTR framework. VGSL model definitions. API mode. |
| Tool | License | Params | WER | Languages | Notes |
|---|---|---|---|---|---|
| Whisper | MIT | 1.55B | 7.4% | 99+ | Gold standard. Encoder-decoder transformer. |
| Faster-Whisper | MIT | 1.55B | 7.4% | 99+ | CTranslate2 reimplementation. 4x faster than Whisper. |
| WhisperX | BSD 2-Clause | 1.55B | — | 99+ | Word-level timestamps + speaker diarization. 70x realtime. |
| Whisper Turbo | MIT | 809M | 7.75% | 99+ | 6x faster than Large V3. Minimal accuracy loss. |
| Distil-Whisper | MIT | 756M | ~8% | English | 6x faster than Large V3. |
| Canary-Qwen 2.5B | CC-BY-4.0 | 2.5B | 5.63% | 25 | Highest accuracy. NVIDIA. |
| Granite Speech 8B | Apache 2.0 | 9B | 5.85% | English + 7 | Enterprise-grade. IBM. |
| Parakeet TDT | CC-BY-4.0 | 1.1B | ~8% | English | Ultra low-latency streaming. 2728x RTFx. |
| Moonshine | Apache 2.0 | 27M–331M | — | English | Edge / on-device. |
| Qwen3-ASR | Apache 2.0 | 1.7B | — | 52 | Alibaba. Competitive with commercial APIs. |
| CrisperWhisper | MIT | 1.55B | 6.66% | 99+ | Verbatim transcription. 1st OpenASR leaderboard. |
| Vosk | Apache 2.0 | — | — | 20+ | Lightweight, CPU-friendly, streaming. |
| SpeechBrain | Apache 2.0 | — | — | Multi | Research toolkit. Custom pipelines. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| audio-rag-whisper-faiss | MIT | — | Audio → Whisper → FAISS → timestamped RAG. |
| RAG AgenticVoice | MIT | — | Real-time voice RAG. Whisper → FAISS → Gemini → TTS. |
| OwnScribe | MIT | — | Browser-based (WebGPU). Whisper → LLM summary → semantic search. |
| whisper-transcribe | MIT | — | Docker. GPU-accelerated. Diarization. LLM post-correction. |
| edge-conversational-agent | MIT | — | Edge pipeline: ASR → RAG → LLM → TTS. Whisper + Piper. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| ffmpeg | LGPL/GPL | 50k+ | Universal media processor. Frame extraction, transcoding, streaming. |
| PySceneDetect | BSD 3-Clause | 3k+ | Content-aware scene boundary detection. |
| FramesExtractor | MIT | — | GPU-accelerated frame extraction via ffmpeg. |
| distant-frames | GPL 3.0 | — | Smart dedup: only saves frames that are visually different. |
| vidlizer | MIT | — | Video → structured JSON. Local Ollama or cloud. Perceptual dedup. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| VideoRAG | MIT | 2k+ | Graph-driven knowledge indexing. Long video QA. Single GPU. |
| FRAG | MIT | 400+ | NVIDIA. Frame Selection Augmented Generation. Zero-shot. |
| FOCUS | MIT | 100+ | ICLR 2026. Keyframe selection via multi-armed bandits. |
| Tempo | MIT | — | Query-aware frame compression. Outperforms GPT-4o on long video. |
| VideoITG | MIT | — | NVIDIA. Instructed Temporal Grounding. Adaptive frame sampling. |
| PEEK | MIT | — | Query-free frame selector for low-budget video captioning. |
| media-ingest | MIT | — | Download + frames + transcript for Claude. yt-dlp + Whisper. |
| VZT Video-Intel | MIT | — | Temporal scene graph. CLI + MCP server. Analyze once, query forever. |
| RAG-X | MIT | — | Video Graph RAG. SAM2 + CLIP + BLIP + Neo4j. |
| CFM-RAG | MIT | — | Cross-frame multimodal RAG for video. BLIP-2, YOLO, SAM. |
| Video-RAG | MIT | — | Training-free. Uses ASR + OCR + object detection as auxiliary texts. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Open3D | MIT | 10k+ | Comprehensive library for 3D data processing. Point cloud/mesh feature extraction for LLMs. |
| trimesh | MIT | 3k+ | Python library for loading and using triangular meshes. Geometric metadata extraction. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| GeoPandas | BSD 3-Clause | 4k+ | Parses Shapefiles and GeoJSON into DataFrames. Easily integrates with LLM data agents. |
| Rasterio | BSD 3-Clause | 3k+ | Essential for reading geospatial raster data (satellite imagery) for vision models. |
| OSMnx | MIT | 5k+ | Downloads and analyzes street networks from OpenStreetMap. Geospatial RAG context. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| pydicom | MIT | 2k+ | The standard for reading, modifying, and extracting patient metadata/text from DICOM medical scans. |
| MONAI | Apache 2.0 | 8k+ | AI toolkit for healthcare imaging. Embeddings from MRI/CT scans. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Vanna | MIT | 23.6k+ | Text-to-SQL RAG framework. Train on your schema → natural language queries. Any LLM + any DB. |
| Databao Agent | — | 101 | JetBrains. Natural language → interactive charts/tables. Local Ollama support. Pythonic API. |
| queryclaw | Apache 2.0 | 7 | AI-native DB agent. ReACT loop. Schema exploration, DML, DDL. Safety layer + HITL. |
| LangChain SQL Agent | MIT | 105k+ | SQL database toolkit. Query, check, execute, describe. LangGraph integration. |
| LlamaIndex SQL | MIT | 40k+ | SQL table indexing + structured query engine. Text-to-SQL with table schema. |
| DBHub MCP | Apache 2.0 | 2k+ | Universal DB MCP server. Works with Claude, Cursor, VS Code. PostgreSQL, MySQL, SQLite. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| Chonky | MIT | 400+ | Neural semantic chunking with fine-tuned transformers. |
| Adaptive Chunking | MIT | 230+ | Auto-selects best method per document. LREC 2026. |
| chunklet-py | MIT | 77+ | Multi-format (text, code, docs). Composable constraints. |
| omnichunk | MIT | — | Structure-aware. Code, Markdown, JSON, HTML. MCP server. |
| chunkmate | MIT | — | Token-aware. Auto-format detection. AI metadata generation. |
| chunkweaver | MIT | — | Regex boundaries, hierarchical levels. LangChain/LlamaIndex drop-in. |
| chunktuner | MIT | — | Auto-tunes chunking for your corpus. CLI + MCP. |
| poma-primecut-nano | MIT | — | Hierarchical heading-based. Self-contained retrieval units. |
| COSMIC | MIT | — | Concept-aware semantic meta-chunking. Discourse coherence. |
| chunkedrs | MIT | — | Rust. Token-accurate. Recursive, markdown, semantic. |
| Julienne | MIT | — | Rust. Range-preserving chunks. LangChain-style + semantic. |
| Tool | License | Dims | Notes |
|---|---|---|---|
| BGE (BAAI) | MIT | 384–1024 | Top-performing open embeddings. BGE-M3 multilingual. |
| E5 / E5-Mistral | MIT | 1024–4096 | Microsoft. Strong retrieval embeddings. |
| GTE (Alibaba) | Apache 2.0 | 384–768 | Multilingual. Qwen-based. |
| all-MiniLM-L6-v2 | Apache 2.0 | 384 | Small, fast. The default for most RAG prototypes. |
| Jina Embeddings v3 | Apache 2.0 | 512–1024 | LoRA adapters. Task-specific. |
| Nomic Embed v1.5 | Apache 2.0 | 768 | 1M context length. |
| Cohere Embed v3 | Proprietary | 1024 | Cloud API. Strong multilingual retrieval. |
| Tool | License | Stars | Scale | Notes |
|---|---|---|---|---|
| Qdrant | Apache 2.0 | 24k+ | Billions | Rust. Best filtering perf. Hybrid search. |
| Milvus | Apache 2.0 | 33k+ | 100B+ | K8s-native. GPU indexing. Enterprise scale. |
| Weaviate | BSD 3-Clause | 13k+ | Billions | Built-in vectorization + hybrid search. MCP server. |
| Chroma | Apache 2.0 | 18k+ | Millions | Python-first, embedded. Fastest prototyping. |
| pgvector | PostgreSQL | 13k+ | Millions | PostgreSQL extension. For Postgres shops. |
| LanceDB | Apache 2.0 | 5k+ | Millions | Embedded, S3-native. Multimodal. |
| FAISS | MIT | 33k+ | Custom | Library. GPU-accelerated. Research-grade. |
| Vald | Apache 2.0 | 1.6k+ | Billions | Cloud-native. Automated vector indexing. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| PixelRAG | MIT | — | UC Berkeley. Renders pages as screenshots. VLMs read tiles. Beats text parsers. |
| ColPali | Apache 2.0 | 3k+ | Document retrieval via vision. No OCR needed. |
| Byaldi | Apache 2.0 | — | User-friendly ColPali. RAG over document images. |
| LlamaIndex multimodal | MIT | 40k+ | MultiModalVectorStoreIndex. Text + image dual-index. |
| NexusRAG | MIT | — | Hybrid RAG + KG. Image/table captioning. Vision LLM. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| LangChain | MIT | 105k+ | General orchestration. 500+ integrations. Largest ecosystem. |
| LlamaIndex | MIT | 40k+ | Data framework. Best RAG-specific indexing/retrieval. |
| Dify | Apache 2.0 | 90k+ | Visual AI platform. Low-code RAG + agents. |
| RAGFlow | Apache 2.0 | 48k+ | Deep document parsing. Intelligent chunking. Knowledge graphs. |
| Haystack | Apache 2.0 | 20k+ | Production-safe modular pipelines. Built-in evaluation. |
| DSPy | MIT | 22k+ | Programmatic prompt/pipeline optimization. |
| LightRAG | MIT | 14k+ | Lightweight graph-based RAG. Minimal hardware. |
| txtai | Apache 2.0 | 10k+ | All-in-one. Semantic search + RAG + agents. |
| LLMWare | Apache 2.0 | 12k+ | Enterprise RAG. CPU-optimized. |
| R2R | MIT | 5k+ | Production RAG engine. REST API. |
| mem0 | Apache 2.0 | 28k+ | Memory layer for AI agents. Personalized RAG. |
| VelociRAG | MIT | — | ONNX-powered 4-layer fusion. No PyTorch. MCP server. |
| ragway | MIT | — | Modular. Swap components via YAML. No code changes. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| RAGAS | Apache 2.0 | 9k+ | Standard RAG quality metrics. Faithfulness, relevancy. |
| LangSmith | MIT | — | Tracing, evaluation, debugging. Managed. |
| DeepEval | Apache 2.0 | 6k+ | Unit testing for LLMs. 15+ metrics. CI/CD. |
| Phoenix (Arize) | Elastic License | 10k+ | LLM observability. Tracing, evaluation. |
| LangFuse | MIT | 8k+ | Open-source LLM engineering. Tracing, prompts, metrics. |
| MLflow | Apache 2.0 | 21k+ | LLM evaluation, tracing, registry. |
| Tool | License | Stars | Notes |
|---|---|---|---|
| LangGraph | MIT | 10k+ | Stateful multi-actor agents. Cycles, HITL. |
| CrewAI | MIT | 30k+ | Multi-agent orchestration. Role-based. |
| AutoGen | MIT | 40k+ | Microsoft. Multi-agent conversations. |
| smolagents | Apache 2.0 | 20k+ | HuggingFace. Minimal. Code agents. |
| OpenAI Agents SDK | MIT | 20k+ | Official. Handoffs, guardrails. |
| Cognee | Apache 2.0 | 2k+ | GraphRAG + agentic memory. |
| Use Case | Recommended Tools |
|---|---|
| Text-heavy digital PDFs | PyMuPDF → PyMuPDF4LLM |
| Complex layouts, tables, multi-column | Docling or OpenDataLoader |
| Scanned / image PDFs | Marker or Surya OCR + Docling |
| Self-healing / high accuracy | pdfmux |
| Academic papers (formulas, citations) | Nougat, GROBID, paper-to-md |
| PowerPoint/Word docs → text | python-pptx/python-docx or MarkItDown |
| Spreadsheets / CSV → RAG | openpyxl + pandas |
| EPUB/e-books | pandoc, calibre, ebooklib |
| Email parsing (.eml/.msg) | mail-parser or extract-msg |
| Archives (ZIP/TAR/7z) → RAG | kreuzberg, extractous, or goblintools |
| Web pages → Markdown | Essence (fastest), Crawl4AI, Firecrawl, rdrr |
| Code chunking | omnichunk or tree-sitter |
| Image captioning | BLIP-2, MetaCaptioner, CapRL |
| Vision understanding | Qwen3-VL, Molmo, Pixtral |
| Multi-modal retrieval (text + images) | CLIP, SigLIP-2, ColPali, Byaldi |
| Diagrams → structured data | diagram2graph or Schematex |
| UML / architecture diagrams | PlantUML, Mermaid, or fcp-drawio |
| Math formula image → LaTeX | pix2tex (LaTeX-OCR) or Pix2Text |
| Handwritten text recognition | Kraken (historical), Churro (VLM), or TrOCR |
| Audio transcription | Whisper (99+ languages) or Faster-Whisper |
| Real-time/streaming audio | Parakeet TDT or Whisper Turbo |
| Speaker diarization | WhisperX |
| Frame extraction from video | ffmpeg + PySceneDetect or distant-frames |
| Video RAG / long video QA | VideoRAG, FRAG, FOCUS, media-ingest |
| 3D model metadata | Open3D or trimesh |
| Geospatial data | GeoPandas + OSMnx |
| Medical imaging (DICOM) | pydicom + MONAI |
| Natural language → SQL | Vanna or Databao Agent |
| Pixel-level RAG (no text parsing) | PixelRAG |
| Chunking for RAG | chunkweaver (structure), Chonky (neural), LangChain splitters |
| Embeddings | BGE, E5, GTE (open) or all-MiniLM-L6-v2 (fast) |
| Fast vector search prototype | Chroma or FAISS |
| Production vector search (<100M) | Qdrant or Weaviate |
| Production vector search (>100M) | Milvus or Qdrant |
| Already on PostgreSQL | pgvector + pgvectorscale |
| RAG orchestration | LlamaIndex (retrieval-first) or LangChain (general) |
| Visual / low-code RAG | Dify or RAGFlow |
| Multi-agent workflows | LangGraph or CrewAI |
| Graph-enhanced RAG | Microsoft GraphRAG or LightRAG |
| LLM evaluation | RAGAS + LangFuse or DeepEval |
| Full ingestion pipeline | Docling → chunkweaver → BGE → Qdrant → LlamaIndex |
- awesome-pdf — General PDF libraries
- awesome-document-understanding — Document Understanding research
- awesome-ocr — Classical OCR
- awesome-document-ocr — Modern VLM-based OCR
- Awesome-OCR-in-the-Foundation-Model-Era — OCR in foundation model era
- awesome-llm-apps — LLM application tools
This list is maintained by the open-source community. Contributions welcome — open a PR to add or update entries.