Skip to content

SingggggYee/awesome-llm-knowledge-bases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome LLM Knowledge Bases Awesome PRs Welcome CC0

Andrej Karpathy's viral post on LLM Knowledge Bases hit 1.7M views. This is the definitive resource list for the workflow he described.

"raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM."

Andrej Karpathy, Apr 3, 2026

The workflow: Ingest → Compile → Lint → View → Query → Enhance → Repeat.

Part of the LLM KB Ecosystem: karpathy-kb-template | wiki-compiler | kb-lint


Contents


Data Ingestion

Tools for converting web pages, PDFs, papers, and other sources into clean markdown.

  • Obsidian Web Clipper - Browser extension that clips web pages directly into your Obsidian vault as markdown.
  • Markdownload - Browser extension to convert web pages to markdown files.
  • Jina Reader - Converts any URL to LLM-friendly markdown via r.jina.ai. Handles JavaScript-rendered pages.
  • Docling - IBM's document conversion library. Parses PDFs, DOCX, PPTX, HTML to markdown with table and figure extraction.
  • Marker - Converts PDF, EPUB, and MOBI to markdown with high accuracy. Handles complex layouts, tables, and equations.
  • Trafilatura - Python library for web scraping and text extraction. Focuses on main content extraction from web pages.
  • Pandoc - Universal document converter. Converts between dozens of formats including markdown, LaTeX, DOCX, and HTML.
  • pdf2md - Converts PDF files to markdown, preserving structure and formatting.
  • Unstructured - General-purpose document parsing library. Handles PDFs, images, HTML, Word docs, and more.
  • Firecrawl - Crawls websites and converts pages to clean markdown. Built for LLM consumption.
  • Crawl4AI - Open-source LLM-friendly web crawler that outputs clean markdown with structured extraction.
  • MarkItDown - Microsoft's Python tool for converting various file formats (PDF, DOCX, XLSX, PPTX, images, audio) to markdown.
  • Zerox - Zero-shot PDF OCR to markdown using vision models. Simple API for document extraction.
  • MinerU - High-quality document content extraction tool supporting PDF to markdown/JSON conversion.

Wiki Compilation

LLM-powered tools that organize, compile, and structure knowledge into coherent wikis.

  • wiki-compiler - LLM-driven compiler that organizes raw markdown fragments into a structured, interlinked wiki.
  • Fabric - AI-powered framework for augmenting humans. Includes patterns for extracting and organizing knowledge.
  • Khoj - Personal AI assistant that indexes your markdown notes and documents for natural language interaction.
  • Quivr - Personal productivity assistant that ingests documents and builds a searchable knowledge base.
  • Mem0 - Memory layer for AI applications. Persists and organizes knowledge across interactions.

Knowledge Base Linting

Tools for checking consistency, finding gaps, and maintaining quality in markdown knowledge bases.

  • kb-lint - Linter for markdown knowledge bases. Detects broken links, orphan pages, inconsistent terminology, and coverage gaps.
  • Markdownlint - Style checker and linting tool for markdown files. Enforces consistent formatting.
  • markdown-link-check - Checks all hyperlinks in markdown files for broken or dead links.
  • Obsidian Linter - Obsidian plugin that enforces consistent markdown formatting and style rules across your vault.
  • Vale - Prose linter that brings code-like linting to natural language. Supports custom style rules.

Obsidian Plugins

Obsidian plugins that enhance the knowledge base workflow.

  • Obsidian Copilot - AI assistant inside Obsidian. Chat with your notes, generate content, and get suggestions using multiple LLM providers.
  • Smart Connections - AI-powered note connections. Finds related notes using embeddings and enables chat with your vault.
  • Dataview - Query engine for your vault. Treat your notes as a database with inline queries and JavaScript API.
  • Marp Slides - Create presentation slides from markdown notes using the Marp framework.
  • Templater - Advanced template engine for Obsidian. Create dynamic templates with JavaScript execution.
  • Obsidian Git - Automatic backup and version control for your vault using Git.
  • Local GPT - Run local LLMs directly within Obsidian for private AI-assisted note-taking.
  • Text Generator - AI text generation plugin supporting multiple providers. Generates, rewrites, and summarizes within notes.
  • Canvas - Built-in infinite canvas for visual knowledge mapping and spatial organization of notes.

IDE & Viewers

Applications for viewing, editing, and navigating markdown knowledge bases.

  • Obsidian - The gold standard for local-first markdown knowledge bases. Graph view, backlinks, plugins ecosystem.
  • Logseq - Open-source outliner-style knowledge base with bidirectional linking and graph visualization.
  • Notion - All-in-one workspace with databases, wikis, and AI features. Cloud-based.
  • Foam - Personal knowledge management and sharing system built on VS Code and markdown.
  • Dendron - Developer-focused knowledge management tool built on VS Code. Hierarchical note organization.
  • SiYuan - Privacy-first personal knowledge management system with block-level references and sync.
  • Zettlr - Markdown editor designed for academic writing and Zettelkasten-style note-taking.
  • Marktext - Simple and elegant open-source markdown editor with real-time preview.

RAG & Search

Retrieval-Augmented Generation frameworks and local search engines for querying knowledge bases.

  • LlamaIndex - Data framework for connecting custom data sources to LLMs. Indexing, retrieval, and query engines.
  • LangChain - Framework for developing LLM-powered applications with chains, agents, and retrieval.
  • Haystack - End-to-end NLP framework for building RAG pipelines, search systems, and question answering.
  • RAGFlow - Open-source RAG engine with deep document understanding and chunk-level citation.
  • Chroma - Open-source embedding database. Simple API for storing and querying document embeddings.
  • Qdrant - High-performance vector similarity search engine with filtering and payload support.
  • Milvus - Cloud-native vector database for scalable similarity search and AI applications.
  • txtai - All-in-one embeddings database for semantic search, LLM orchestration, and language model workflows.
  • Vanna - RAG framework for SQL generation. Train on your database schema and documentation.

LLM Agents & Frameworks

AI coding agents and frameworks for operating on knowledge bases via CLI.

  • Claude Code - Anthropic's agentic CLI for Claude. Operates on files, runs commands, and iterates on codebases and knowledge bases.
  • Cursor - AI code editor with deep codebase understanding. Chat, edit, and generate across files.
  • Aider - AI pair programming in your terminal. Works with local Git repos and supports multiple LLM providers.
  • OpenAI Codex CLI - OpenAI's lightweight coding agent that runs in the terminal with sandboxed execution.
  • Continue - Open-source AI code assistant for VS Code and JetBrains. Supports multiple models.
  • Open Interpreter - Natural language interface for your computer. Runs code locally to complete tasks.
  • CrewAI - Framework for orchestrating role-playing AI agents that collaborate on complex tasks.
  • AutoGen - Microsoft's framework for building multi-agent conversational AI systems.
  • Pydantic AI - Agent framework built on Pydantic for type-safe, structured AI application development.

Visualization & Output

Tools for turning knowledge base content into presentations, diagrams, and visual outputs.

  • Marp - Markdown presentation ecosystem. Convert markdown files to slides, PDFs, and HTML presentations.
  • Mermaid - Generate diagrams and flowcharts from markdown-like text. Supported natively in GitHub markdown.
  • D3.js - JavaScript library for data-driven visualizations. Create interactive charts and graphs from knowledge base data.
  • Markmap - Visualize markdown documents as interactive mind maps.
  • Matplotlib - Python plotting library for generating charts, graphs, and figures from data.
  • Excalidraw - Virtual whiteboard for sketching hand-drawn-style diagrams. Has an Obsidian integration.
  • Slidev - Presentation slides for developers using markdown and Vue components.
  • reveal.js - HTML presentation framework with markdown support and a rich plugin ecosystem.

Synthetic Data & Fine-tuning

Tools for distilling knowledge bases into training data and fine-tuned model weights.

  • Distilabel - Framework for synthetic data generation and AI feedback. Create training datasets from knowledge bases.
  • Axolotl - Streamlined fine-tuning tool supporting multiple architectures. LoRA, QLoRA, full fine-tuning.
  • Unsloth - Fast LLM fine-tuning with 2x speed and 60% less memory. Supports Llama, Mistral, and more.
  • LitGPT - Hackable implementation of open-source LLMs for pretraining, fine-tuning, and deployment.
  • MLX - Apple's array framework for machine learning on Apple silicon. Efficient local fine-tuning.
  • Argilla - Collaboration platform for AI engineers and domain experts to build high-quality datasets.

Workflows & Guides

Blog posts, tutorials, and videos about building LLM knowledge bases.

Similar Projects

Existing knowledge base, second brain, and personal wiki projects.


Contributing

Contributions welcome! Please read the contributing guidelines first.


License

CC0

To the extent possible under law, the authors have waived all copyright and related or neighboring rights to this work.

About

A curated list of tools for building LLM-powered personal knowledge bases

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors