|
| 1 | +# Build a Knowledge Graph for Docs — Neo4j (CocoIndex v1) |
| 2 | + |
| 3 | +Turn a folder of Markdown documentation into a concept knowledge graph in |
| 4 | +[Neo4j](https://neo4j.com/). For each document an LLM (via |
| 5 | +[LiteLLM](https://docs.litellm.ai/) + [instructor](https://python.useinstructor.com/)) |
| 6 | +produces a short summary and a set of `(subject, predicate, object)` triples |
| 7 | +about the concepts it covers — "concepts, not code" — and the triples become a |
| 8 | +property graph. |
| 9 | + |
| 10 | +This is the CocoIndex **v1** port of the blog post |
| 11 | +[Build a Knowledge Graph for Documents](https://cocoindex.io/blogs/knowledge-graph-for-docs/). |
| 12 | + |
| 13 | +Please drop [CocoIndex on Github](https://github.com/cocoindex-io/cocoindex) a |
| 14 | +star to support us and stay tuned for more updates. Thank you so much 🥥🤗. |
| 15 | +[](https://github.com/cocoindex-io/cocoindex) |
| 16 | + |
| 17 | +## What this builds |
| 18 | + |
| 19 | +- `Document` nodes — one per Markdown file, keyed by filename, with an |
| 20 | + LLM-generated `title` and `summary` |
| 21 | +- `Entity` nodes — one per distinct concept named in a triple, keyed by `value` |
| 22 | +- Relationships: |
| 23 | + - `RELATIONSHIP` — `Entity → Entity`, with the `predicate` stored on the edge |
| 24 | + - `MENTION` — `Document → Entity`, recording which document named which concept |
| 25 | + |
| 26 | +The flow watches the source folder and keeps the graph up to date |
| 27 | +incrementally. |
| 28 | + |
| 29 | +## How it works |
| 30 | + |
| 31 | +The pipeline runs in two phases: |
| 32 | + |
| 33 | +1. **Per-file extraction.** Read each Markdown file, extract a `DocumentSummary` |
| 34 | + (title + summary) and a list of relationship triples with LiteLLM + |
| 35 | + instructor. The `Document` node is declared in this phase; the triples are |
| 36 | + carried forward. |
| 37 | +2. **Graph building.** A single pass declares the deduplicated `Entity` nodes |
| 38 | + and the `RELATIONSHIP` / `MENTION` edges across all documents. Each distinct |
| 39 | + triple is keyed by a stable hash, so re-asserting the same fact in another |
| 40 | + doc maps to the same edge. |
| 41 | + |
| 42 | +CocoIndex reconciles changes incrementally — re-running after editing one doc |
| 43 | +only re-extracts that doc, and the graph pass only re-runs when the set of |
| 44 | +triples changes. To collapse near-identical entity names (e.g. "CocoIndex" vs |
| 45 | +"Cocoindex"), add an entity-resolution pass like the one in |
| 46 | +[`meeting_notes_graph_neo4j`](../meeting_notes_graph_neo4j). |
| 47 | + |
| 48 | +## Prerequisites |
| 49 | + |
| 50 | +- A running Neo4j 5.18+ instance: |
| 51 | + ```sh |
| 52 | + docker run -d \ |
| 53 | + -p 7474:7474 -p 7687:7687 \ |
| 54 | + -e NEO4J_AUTH=neo4j/cocoindex \ |
| 55 | + --name cocoindex-neo4j \ |
| 56 | + neo4j:5.26-community |
| 57 | + ``` |
| 58 | + The browser UI is at <http://localhost:7474>; log in with `neo4j` / |
| 59 | + `cocoindex`. |
| 60 | + |
| 61 | +- An LLM. Defaults to OpenAI (set `OPENAI_API_KEY`); set `LLM_MODEL` to any |
| 62 | + [LiteLLM provider](https://docs.litellm.ai/docs/providers) — e.g. |
| 63 | + `LLM_MODEL=ollama/llama3.2` to run the extraction locally with no API key. |
| 64 | + |
| 65 | +## Environment |
| 66 | + |
| 67 | +Copy `.env.example` to `.env` and fill in the blanks: |
| 68 | + |
| 69 | +```sh |
| 70 | +cp .env.example .env |
| 71 | +set -a && source .env && set +a |
| 72 | +``` |
| 73 | + |
| 74 | +## Run |
| 75 | + |
| 76 | +Install dependencies: |
| 77 | + |
| 78 | +```sh |
| 79 | +uv pip install -e . |
| 80 | +``` |
| 81 | + |
| 82 | +This example ships a small `markdown_files/` folder of sample concept docs so it |
| 83 | +runs out of the box. Build/update the graph: |
| 84 | + |
| 85 | +```sh |
| 86 | +cocoindex update main |
| 87 | +``` |
| 88 | + |
| 89 | +To index your own docs, drop `.md` / `.mdx` files into `markdown_files/` (or |
| 90 | +point `sourcedir` in `main.py` at another directory — e.g. CocoIndex's own |
| 91 | +`docs/`) and re-run. |
| 92 | + |
| 93 | +## Browse the knowledge graph |
| 94 | + |
| 95 | +Open Neo4j Browser at <http://localhost:7474>, log in, and run Cypher queries: |
| 96 | + |
| 97 | +```cypher |
| 98 | +// Everything |
| 99 | +MATCH p=()-->() RETURN p LIMIT 200 |
| 100 | +
|
| 101 | +// Concept-to-concept relationships |
| 102 | +MATCH (a:Entity)-[r:RELATIONSHIP]->(b:Entity) |
| 103 | +RETURN a.value, r.predicate, b.value |
| 104 | +
|
| 105 | +// Which documents mention which concepts |
| 106 | +MATCH (d:Document)-[:MENTION]->(e:Entity) |
| 107 | +RETURN d.filename, d.title, e.value |
| 108 | +
|
| 109 | +// Concepts mentioned in the most documents |
| 110 | +MATCH (d:Document)-[:MENTION]->(e:Entity) |
| 111 | +RETURN e.value, count(DISTINCT d) AS docs |
| 112 | +ORDER BY docs DESC LIMIT 10 |
| 113 | +``` |
| 114 | + |
| 115 | +To wipe the graph between runs: |
| 116 | + |
| 117 | +```cypher |
| 118 | +MATCH (n) DETACH DELETE n |
| 119 | +``` |
0 commit comments