Skip to content

naranyala/oss-media-parsers-for-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Awesome Open-Source Media Parsers for AI / LLM / RAG

A curated list of open-source tools that ingest, parse, chunk, embed, retrieve, and evaluate media for AI/LLM/RAG pipelines.

🎯 Defining "Media" in the Age of AI

In the context of Large Language Models and RAG systems, "Media" is no longer just audio or video. It represents any unstructured or semi-structured data format that must be extracted, parsed, or transformed before an AI can reason over it.

This repository covers the complete ingestion spectrum, including:

  • Visual & Audio Media: Images, Videos, Speech, Medical Scans (DICOM)
  • Document Media: PDFs, Word Docs, Spreadsheets, E-books, Presentations
  • Digital Media: Web pages, HTML, Emails, Repositories, Source Code
  • Spatial & Specialized Media: 3D Models, GIS/Geospatial Data, Architectural Diagrams

(Note: Tools strictly meant for inference, UI, TTS generation, and infrastructure deployment are out of scope. This list strictly focuses on the data ingestion and representation layer.)

Contents


1. Document Extraction & Parsing

PDF & Documents

End-to-End PDF → Markdown/JSON for RAG

Tool License Stars Notes
Docling Apache 2.0 61k+ IBM's layout-aware parser. ML models for layout/table detection. LangChain, LlamaIndex integrations.
Marker GPL 3.0 25k+ GPU-accelerated. Pipeline: OCR (surya) → layout → formatting. Handles tables, equations, code.
OpenDataLoader PDF Apache 2.0 3k+ #1 in benchmarks (0.907). XY-Cut++ reading order. Bounding boxes. CPU-only local mode.
pdfmux MIT 900+ Self-healing: routes each page to the best backend, audits output, re-extracts failures.
PyMuPDF4LLM AGPL 3.0 / Commercial 2k+ Wrapper around PyMuPDF tuned for LLM/RAG. Markdown extraction, LlamaIndex adapter.
LiteParse Apache 2.0 9k+ Fast spatial text parsing via PDFium. Built-in Tesseract OCR. Bounding boxes. Rust/Python/Node/WASM.
MegaParse Apache 2.0 5k+ Document-to-Markdown for RAG ingestion. Handles PDF, DOCX, PPTX, images.
MarkItDown MIT 50k+ Microsoft's multi-format (PDF, Office, images, HTML, CSV) to Markdown converter.
MinerU AGPL 3.0 15k+ Full pipeline: OCR → layout → formula → table → Markdown.
PPX Source-available 500+ Local CPU. OCR + layout + formula → Markdown/JSON. Optional LLM backend.
GoPDF MIT 900+ Pure Go, deterministic. Per-page signals for OCR routing.
kaos-pdf Apache 2.0 PDF → typed AST with provenance. MCP tools for agentic workflows.
docproc MIT 200+ Document-to-Markdown with vision LLM for images, equations, figures.
pagewise-pdf-extractor Apache 2.0 Page-wise routing: PyMuPDF → Marker OCR → Ollama vision fallback.
pdfmark-ai MIT 1k+ Renders PDF pages as images, uses multimodal LLM to produce Markdown.
pdf-to-markdown-pipeline MIT 100+ Docling + markitdown → clean → chunk. CPU-only. Scientific focus.
mupdf4llm MIT TypeScript/Bun port of pymupdf4llm. WASM-based. LlamaIndex adapter.
pypdf BSD 3-Clause 9k+ Pure Python. Most popular PDF library. Simple text extraction.
PyMuPDF (fitz) AGPL 3.0 / Commercial 7k+ Fastest text extraction. Foundation of many RAG tools.
pdfplumber MIT 5k+ Best classic library for table extraction. Visual debugging.
PDFminer.six MIT 6k+ Community fork. Text analysis focus.

OCR for Scanned PDFs

Tool License Stars Notes
Surya GPL 3.0 18k+ OCR, layout, reading order, table recognition in 90+ languages.
PaddleOCR Apache 2.0 50k+ Lightweight multilingual OCR. 80+ languages.
EasyOCR Apache 2.0 25k+ Ready-to-use OCR with 80+ languages.
OCRmyPDF GPL 3.0 16k+ Adds OCR text layer to scanned PDFs. Uses Tesseract.
Tesseract Apache 2.0 65k+ The definitive open-source OCR engine.
DeepSeek-OCR MIT 3k+ Vision-language model for document OCR.

Academic & Scientific Papers

Tool License Stars Notes
Nougat MIT 9k+ Meta's neural academic PDF → Markdown (math formulas).
olmOCR Apache 2.0 3k+ Allen AI. LLM-based PDF-to-text for training datasets.
GROBID Apache 2.0 4k+ ML-driven scholarly PDF structure extraction. Java.
paper-to-md (pdf2md) MIT 100+ Docling + LLM retouch. Citations, figures, RAG metadata.

Table Extraction

Tool License Stars Notes
Camelot MIT 3k+ Lattice + stream table detection. Specialist.
Tabula MIT 2k+ Java-based table extraction with GUI.
Table Transformer (TATR) MIT 3k+ Microsoft's DETR-based table detection + structure.

Presentations & Office

Tool License Stars Notes
python-pptx MIT 4k+ Read/write PowerPoint files.
python-docx MIT 5k+ Read/write Word documents.
Docling Apache 2.0 61k+ Handles PPTX, DOCX, XLSX. Unified structured output.
MarkItDown MIT 50k+ Converts PPTX, DOCX to Markdown.

Spreadsheets & Structured Data

Tool License Stars Notes
openpyxl MIT 6k+ Read/write Excel files. Used by most RAG pipelines.
pandas BSD 3-Clause 46k+ Swiss army knife for structured data.
TabLib MIT Table extraction from spreadsheets.
MarkItDown MIT 50k+ Handles CSV/XLSX → Markdown.

E-books & EPUB

Tool License Stars Notes
calibre GPL 3.0 22k+ eBook management. Command-line EPUB conversion.
ebooklib AGPL 3.0 2k+ Python EPUB library.
pandoc GPL 2.0 37k+ Universal document converter. Handles EPUB, HTML, LaTeX, Markdown.

Email & Messages

Tool License Stars Notes
mail-parser Apache 2.0 600+ Python email parsing. Attachments, headers, body.
extract-msg GPL 3.0 400+ Outlook .msg file parser.

Archives & Compression

Tool License Stars Notes
kreuzberg MIT Rust core. 91+ file formats. Handles ZIP, TAR, 7z. Async. Multiple OCR backends.
extractous Apache 2.0 1.7k+ Rust. High-speed text/metadata extraction. Apache Tika backend. Python/Java/JS bindings.
goblintools MIT Python. 30+ archive formats. Magic-byte sniffing. Built-in OCR (Tesseract/AWS Textract).
zipstream-ai MIT 21 Stream ZIP/TAR directly to LLMs without extraction. Auto-detect CSV/JSON. DataFrame integration.
exarch MIT Secure archive library. CVE protection. TAR/ZIP/7z. Rust core. Python/Node.js bindings.
dedoc Apache 2.0 Document → unified format pipeline. Auto-extracts archives. REST API.
libarchive BSD Industry-standard C library. Reads 20+ archive formats (ZIP, TAR, 7z, RAR, CAB, ISO).

2. Web & Code Ingestion

Web & HTML

Tool License Stars Notes
Essence MIT Fastest. Rust. HTTP-first with Chromium fallback. MCP server. Single binary.
Crawl4AI Apache 2.0 35k+ Async web crawler for LLMs. Markdown output. JS rendering.
Firecrawl AGPL 3.0 30k+ API-first. Crawl, scrape, search. Markdown output. Self-hostable.
rdrr MIT 78+ TypeScript. 20+ site-specific extractors (Wikipedia, Reddit, GitHub, YouTube).
MarkCrawl MIT Python. Crawl → Markdown + JSONL. Supabase/pgvector upload. MCP server.
pulldown MIT Python. 5 detail levels. HTTP-first. Chromium optional. CLI + MCP.
site-to-md MIT Generates /llms.txt + clean Markdown per page.
LLMParser MIT Python. Full site crawl. RSS. Typed content blocks. No LLM dependencies.
readdown MIT JS/TS. Replaces Readability + Turndown. Token estimation built-in.
h2m-parser MIT TS. Mozilla Readability + streaming renderer. 4x faster than alternatives.
scrapedown MIT HTML → Markdown with CSS/XPath annotations for LLM scraping.
url-to-markdown MIT Self-hostable API. Handles JS SPAs, PDF, DOCX.
WebToMD MIT CLI. JS rendering. Design system extraction.
Trafilatura Apache 2.0 3k+ Python. Reliable web text extraction. Used as backend by many tools.

Code

Tool License Stars Notes
omnichunk MIT AST-based code chunking for 15+ languages. Context-rich.
tree-sitter MIT 20k+ Incremental parser for 100+ languages. Used by many code tools.
LangChain code splitters MIT 105k+ RecursiveCharacterTextSplitter language-aware.
Julienne (CodeChunker) MIT Rust. AST-based chunking for Python and Rust.

3. Visual Media & OCR

Images & Vision

Image Captioning

Tool License Stars Notes
BLIP / BLIP-2 BSD 3-Clause 4k+ Salesforce. Vision-language pre-training. Captioning + VQA.
BLIP-2 (FLAN-T5-XL) BSD 3-Clause 10k+ State-of-the-art image captioning. Used in many RAG pipelines.
MetaCaptioner Apache 2.0 1k+ ICLR 2026. GPT-4.1-level caption quality. 89.5% cost reduction.
CapRL / CapRL++ Apache 2.0 200+ ICLR 2026. RL-trained dense captioning. Image + video.
ScaleCap MIT ICLR 2026. Inference-time scalable captioning. 450k dataset.
Florence-2 MIT 3k+ Microsoft. Unified vision-language model. Captioning + OCR + detection.

Vision Language Models (VLMs)

Tool License Params Notes
Qwen3-VL Apache 2.0 7B–72B Alibaba. Top-tier multimodal reasoning. Agentic capabilities.
Molmo Apache 2.0 1B–72B Allen AI. On par with GPT-4V. Open weights.
Pixtral Apache 2.0 12B Mistral's first multimodal model. Images + text.
GLM-4.6V Apache 2.0 Zhipu AI. Native multimodal tool use. 128K context.
InternVL2.5 MIT 1B–76B Strong document understanding. Charts, tables.
LLaVA Apache 2.0 7B–34B Pioneering open VLM. Large ecosystem.
Gemma 3 Vision Gemma 4B–27B Google. Lightweight. Image + short video.
DeepSeek-VL MIT 7B MoE efficiency. Strong technical/scientific visuals.

Multi-modal Embeddings

Tool License Stars Notes
CLIP MIT 28k+ OpenAI. Text + image in shared vector space. The workhorse.
SigLIP / SigLIP-2 Apache 2.0 Google. Strong open-source visual encoder. Better than CLIP on docs.
ImageBind CC BY-NC 4.0 9k+ Meta. 6 modalities: text, image, audio, depth, thermal, IMU.
ColPali Apache 2.0 3k+ Late-interaction model for document image retrieval. Bypasses OCR.
ColQwen Apache 2.0 3k+ Qwen-based ColPali. Better accuracy on documents.
Qwen3-VL-Embedding Apache 2.0 75k+ Vision encoder usable standalone for embeddings.

Diagrams & Charts

Tool License Stars Notes
diagram2graph Apache 2.0 41 VLM extracts nodes/edges from process diagrams → structured KG JSON. Fine-tuned Qwen2.5-VL.
DiagramAgent Diagram → structured code. Qwen2-VL based.
Schematex AGPL 3.0 30 DSL for professional diagrams (medical, electrical, legal) → pure SVG. Standards-as-code.
fcp-drawio MIT 3 MCP server for creating/editing draw.io diagrams via intent-level commands.
PlantUML MIT 11k+ Text-based UML diagram generation. Widely used.
Mermaid MIT 75k+ JS-based diagramming and charting. Native LLM rendering support.

Math & LaTeX OCR

Tool License Stars Params Notes
pix2tex (LaTeX-OCR) MIT 16k+ ViT encoder + Transformer decoder. The gold standard. GUI + CLI + API.
TexTeller Apache 2.0 729 300M 80M image-formula pairs. Stronger generalization. Handwriting + scanned + printed.
Texo AGPL 3.0 835 20M Ultra-lightweight SOTA. Runs in browser. Distilled from PPFormulaNet.
Pix2Text (P2T) Apache 2.0 5k+ Full Mathpix alternative: layout + tables + math + text → Markdown. 80+ languages.

Handwriting Recognition

Tool License Stars Notes
Kraken Apache 2.0 975 Turn-key OCR for historical and non-Latin scripts. Trainable layout analysis. ALTO/PageXML output.
Churro Apache 2.0 31 Stanford. 3B VLM. Exceeds Gemini 2.5 Pro accuracy at 15.5x lower cost. 22 centuries of scripts.
HTRflow Apache 2.0 Riksarkivet. YAML pipeline blueprints. TrOCR + YOLO. Exports PAGE/ALTO XML.
Thulium Apache 2.0 8 52+ languages. ONNX export. CNN/ViT + Transformer/LSTM. Production-ready.
TrOCR MIT 20k+ Microsoft. Transformer-based OCR. Strong handwritten text baseline. Fine-tunable.
PyLaia Apache 2.0 254 VGG + BLSTM for HTR. CTC decoding. GPU/CPU agnostic.
HTR-ConvText Hybrid CNN-ViT. SOTA on IAM, READ2016. 65.9M params. Textual Context Module.
Loghi HTR HTR framework. VGSL model definitions. API mode.

4. Audio & Video Processing

Audio & Speech

Speech-to-Text Models

Tool License Params WER Languages Notes
Whisper MIT 1.55B 7.4% 99+ Gold standard. Encoder-decoder transformer.
Faster-Whisper MIT 1.55B 7.4% 99+ CTranslate2 reimplementation. 4x faster than Whisper.
WhisperX BSD 2-Clause 1.55B 99+ Word-level timestamps + speaker diarization. 70x realtime.
Whisper Turbo MIT 809M 7.75% 99+ 6x faster than Large V3. Minimal accuracy loss.
Distil-Whisper MIT 756M ~8% English 6x faster than Large V3.
Canary-Qwen 2.5B CC-BY-4.0 2.5B 5.63% 25 Highest accuracy. NVIDIA.
Granite Speech 8B Apache 2.0 9B 5.85% English + 7 Enterprise-grade. IBM.
Parakeet TDT CC-BY-4.0 1.1B ~8% English Ultra low-latency streaming. 2728x RTFx.
Moonshine Apache 2.0 27M–331M English Edge / on-device.
Qwen3-ASR Apache 2.0 1.7B 52 Alibaba. Competitive with commercial APIs.
CrisperWhisper MIT 1.55B 6.66% 99+ Verbatim transcription. 1st OpenASR leaderboard.
Vosk Apache 2.0 20+ Lightweight, CPU-friendly, streaming.
SpeechBrain Apache 2.0 Multi Research toolkit. Custom pipelines.

Audio RAG & Processing

Tool License Stars Notes
audio-rag-whisper-faiss MIT Audio → Whisper → FAISS → timestamped RAG.
RAG AgenticVoice MIT Real-time voice RAG. Whisper → FAISS → Gemini → TTS.
OwnScribe MIT Browser-based (WebGPU). Whisper → LLM summary → semantic search.
whisper-transcribe MIT Docker. GPU-accelerated. Diarization. LLM post-correction.
edge-conversational-agent MIT Edge pipeline: ASR → RAG → LLM → TTS. Whisper + Piper.

Video

Frame Extraction

Tool License Stars Notes
ffmpeg LGPL/GPL 50k+ Universal media processor. Frame extraction, transcoding, streaming.
PySceneDetect BSD 3-Clause 3k+ Content-aware scene boundary detection.
FramesExtractor MIT GPU-accelerated frame extraction via ffmpeg.
distant-frames GPL 3.0 Smart dedup: only saves frames that are visually different.
vidlizer MIT Video → structured JSON. Local Ollama or cloud. Perceptual dedup.

Video RAG & Understanding

Tool License Stars Notes
VideoRAG MIT 2k+ Graph-driven knowledge indexing. Long video QA. Single GPU.
FRAG MIT 400+ NVIDIA. Frame Selection Augmented Generation. Zero-shot.
FOCUS MIT 100+ ICLR 2026. Keyframe selection via multi-armed bandits.
Tempo MIT Query-aware frame compression. Outperforms GPT-4o on long video.
VideoITG MIT NVIDIA. Instructed Temporal Grounding. Adaptive frame sampling.
PEEK MIT Query-free frame selector for low-budget video captioning.
media-ingest MIT Download + frames + transcript for Claude. yt-dlp + Whisper.
VZT Video-Intel MIT Temporal scene graph. CLI + MCP server. Analyze once, query forever.
RAG-X MIT Video Graph RAG. SAM2 + CLIP + BLIP + Neo4j.
CFM-RAG MIT Cross-frame multimodal RAG for video. BLIP-2, YOLO, SAM.
Video-RAG MIT Training-free. Uses ASR + OCR + object detection as auxiliary texts.

5. Specialized Data & Modalities

3D Models & Spatial Data

Tool License Stars Notes
Open3D MIT 10k+ Comprehensive library for 3D data processing. Point cloud/mesh feature extraction for LLMs.
trimesh MIT 3k+ Python library for loading and using triangular meshes. Geometric metadata extraction.

Geospatial & GIS Data

Tool License Stars Notes
GeoPandas BSD 3-Clause 4k+ Parses Shapefiles and GeoJSON into DataFrames. Easily integrates with LLM data agents.
Rasterio BSD 3-Clause 3k+ Essential for reading geospatial raster data (satellite imagery) for vision models.
OSMnx MIT 5k+ Downloads and analyzes street networks from OpenStreetMap. Geospatial RAG context.

Medical Imaging (DICOM)

Tool License Stars Notes
pydicom MIT 2k+ The standard for reading, modifying, and extracting patient metadata/text from DICOM medical scans.
MONAI Apache 2.0 8k+ AI toolkit for healthcare imaging. Embeddings from MRI/CT scans.

Database Connectors & SQL RAG

Tool License Stars Notes
Vanna MIT 23.6k+ Text-to-SQL RAG framework. Train on your schema → natural language queries. Any LLM + any DB.
Databao Agent 101 JetBrains. Natural language → interactive charts/tables. Local Ollama support. Pythonic API.
queryclaw Apache 2.0 7 AI-native DB agent. ReACT loop. Schema exploration, DML, DDL. Safety layer + HITL.
LangChain SQL Agent MIT 105k+ SQL database toolkit. Query, check, execute, describe. LangGraph integration.
LlamaIndex SQL MIT 40k+ SQL table indexing + structured query engine. Text-to-SQL with table schema.
DBHub MCP Apache 2.0 2k+ Universal DB MCP server. Works with Claude, Cursor, VS Code. PostgreSQL, MySQL, SQLite.

6. RAG Infrastructure & Frameworks

Chunking & Splitting

Tool License Stars Notes
Chonky MIT 400+ Neural semantic chunking with fine-tuned transformers.
Adaptive Chunking MIT 230+ Auto-selects best method per document. LREC 2026.
chunklet-py MIT 77+ Multi-format (text, code, docs). Composable constraints.
omnichunk MIT Structure-aware. Code, Markdown, JSON, HTML. MCP server.
chunkmate MIT Token-aware. Auto-format detection. AI metadata generation.
chunkweaver MIT Regex boundaries, hierarchical levels. LangChain/LlamaIndex drop-in.
chunktuner MIT Auto-tunes chunking for your corpus. CLI + MCP.
poma-primecut-nano MIT Hierarchical heading-based. Self-contained retrieval units.
COSMIC MIT Concept-aware semantic meta-chunking. Discourse coherence.
chunkedrs MIT Rust. Token-accurate. Recursive, markdown, semantic.
Julienne MIT Rust. Range-preserving chunks. LangChain-style + semantic.

Embedding Models

Tool License Dims Notes
BGE (BAAI) MIT 384–1024 Top-performing open embeddings. BGE-M3 multilingual.
E5 / E5-Mistral MIT 1024–4096 Microsoft. Strong retrieval embeddings.
GTE (Alibaba) Apache 2.0 384–768 Multilingual. Qwen-based.
all-MiniLM-L6-v2 Apache 2.0 384 Small, fast. The default for most RAG prototypes.
Jina Embeddings v3 Apache 2.0 512–1024 LoRA adapters. Task-specific.
Nomic Embed v1.5 Apache 2.0 768 1M context length.
Cohere Embed v3 Proprietary 1024 Cloud API. Strong multilingual retrieval.

Vector Databases

Tool License Stars Scale Notes
Qdrant Apache 2.0 24k+ Billions Rust. Best filtering perf. Hybrid search.
Milvus Apache 2.0 33k+ 100B+ K8s-native. GPU indexing. Enterprise scale.
Weaviate BSD 3-Clause 13k+ Billions Built-in vectorization + hybrid search. MCP server.
Chroma Apache 2.0 18k+ Millions Python-first, embedded. Fastest prototyping.
pgvector PostgreSQL 13k+ Millions PostgreSQL extension. For Postgres shops.
LanceDB Apache 2.0 5k+ Millions Embedded, S3-native. Multimodal.
FAISS MIT 33k+ Custom Library. GPU-accelerated. Research-grade.
Vald Apache 2.0 1.6k+ Billions Cloud-native. Automated vector indexing.

Multimodal RAG

Tool License Stars Notes
PixelRAG MIT UC Berkeley. Renders pages as screenshots. VLMs read tiles. Beats text parsers.
ColPali Apache 2.0 3k+ Document retrieval via vision. No OCR needed.
Byaldi Apache 2.0 User-friendly ColPali. RAG over document images.
LlamaIndex multimodal MIT 40k+ MultiModalVectorStoreIndex. Text + image dual-index.
NexusRAG MIT Hybrid RAG + KG. Image/table captioning. Vision LLM.

RAG Frameworks

Tool License Stars Notes
LangChain MIT 105k+ General orchestration. 500+ integrations. Largest ecosystem.
LlamaIndex MIT 40k+ Data framework. Best RAG-specific indexing/retrieval.
Dify Apache 2.0 90k+ Visual AI platform. Low-code RAG + agents.
RAGFlow Apache 2.0 48k+ Deep document parsing. Intelligent chunking. Knowledge graphs.
Haystack Apache 2.0 20k+ Production-safe modular pipelines. Built-in evaluation.
DSPy MIT 22k+ Programmatic prompt/pipeline optimization.
LightRAG MIT 14k+ Lightweight graph-based RAG. Minimal hardware.
txtai Apache 2.0 10k+ All-in-one. Semantic search + RAG + agents.
LLMWare Apache 2.0 12k+ Enterprise RAG. CPU-optimized.
R2R MIT 5k+ Production RAG engine. REST API.
mem0 Apache 2.0 28k+ Memory layer for AI agents. Personalized RAG.
VelociRAG MIT ONNX-powered 4-layer fusion. No PyTorch. MCP server.
ragway MIT Modular. Swap components via YAML. No code changes.

Evaluation & Observability

Tool License Stars Notes
RAGAS Apache 2.0 9k+ Standard RAG quality metrics. Faithfulness, relevancy.
LangSmith MIT Tracing, evaluation, debugging. Managed.
DeepEval Apache 2.0 6k+ Unit testing for LLMs. 15+ metrics. CI/CD.
Phoenix (Arize) Elastic License 10k+ LLM observability. Tracing, evaluation.
LangFuse MIT 8k+ Open-source LLM engineering. Tracing, prompts, metrics.
MLflow Apache 2.0 21k+ LLM evaluation, tracing, registry.

Agent Frameworks

Tool License Stars Notes
LangGraph MIT 10k+ Stateful multi-actor agents. Cycles, HITL.
CrewAI MIT 30k+ Multi-agent orchestration. Role-based.
AutoGen MIT 40k+ Microsoft. Multi-agent conversations.
smolagents Apache 2.0 20k+ HuggingFace. Minimal. Code agents.
OpenAI Agents SDK MIT 20k+ Official. Handoffs, guardrails.
Cognee Apache 2.0 2k+ GraphRAG + agentic memory.

How to Choose

Use Case Recommended Tools
Text-heavy digital PDFs PyMuPDF → PyMuPDF4LLM
Complex layouts, tables, multi-column Docling or OpenDataLoader
Scanned / image PDFs Marker or Surya OCR + Docling
Self-healing / high accuracy pdfmux
Academic papers (formulas, citations) Nougat, GROBID, paper-to-md
PowerPoint/Word docs → text python-pptx/python-docx or MarkItDown
Spreadsheets / CSV → RAG openpyxl + pandas
EPUB/e-books pandoc, calibre, ebooklib
Email parsing (.eml/.msg) mail-parser or extract-msg
Archives (ZIP/TAR/7z) → RAG kreuzberg, extractous, or goblintools
Web pages → Markdown Essence (fastest), Crawl4AI, Firecrawl, rdrr
Code chunking omnichunk or tree-sitter
Image captioning BLIP-2, MetaCaptioner, CapRL
Vision understanding Qwen3-VL, Molmo, Pixtral
Multi-modal retrieval (text + images) CLIP, SigLIP-2, ColPali, Byaldi
Diagrams → structured data diagram2graph or Schematex
UML / architecture diagrams PlantUML, Mermaid, or fcp-drawio
Math formula image → LaTeX pix2tex (LaTeX-OCR) or Pix2Text
Handwritten text recognition Kraken (historical), Churro (VLM), or TrOCR
Audio transcription Whisper (99+ languages) or Faster-Whisper
Real-time/streaming audio Parakeet TDT or Whisper Turbo
Speaker diarization WhisperX
Frame extraction from video ffmpeg + PySceneDetect or distant-frames
Video RAG / long video QA VideoRAG, FRAG, FOCUS, media-ingest
3D model metadata Open3D or trimesh
Geospatial data GeoPandas + OSMnx
Medical imaging (DICOM) pydicom + MONAI
Natural language → SQL Vanna or Databao Agent
Pixel-level RAG (no text parsing) PixelRAG
Chunking for RAG chunkweaver (structure), Chonky (neural), LangChain splitters
Embeddings BGE, E5, GTE (open) or all-MiniLM-L6-v2 (fast)
Fast vector search prototype Chroma or FAISS
Production vector search (<100M) Qdrant or Weaviate
Production vector search (>100M) Milvus or Qdrant
Already on PostgreSQL pgvector + pgvectorscale
RAG orchestration LlamaIndex (retrieval-first) or LangChain (general)
Visual / low-code RAG Dify or RAGFlow
Multi-agent workflows LangGraph or CrewAI
Graph-enhanced RAG Microsoft GraphRAG or LightRAG
LLM evaluation RAGAS + LangFuse or DeepEval
Full ingestion pipeline Docling → chunkweaver → BGE → Qdrant → LlamaIndex

Related Awesome Lists


This list is maintained by the open-source community. Contributions welcome — open a PR to add or update entries.

About

A curated list of open-source tools that ingest, parse, chunk, embed, retrieve, and evaluate media for AI/LLM/RAG pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors