Build your first CocoIndex pipeline in 5 minutes.
Convert a folder of PDFs into Markdown — and watch it only reprocess what actually changed.
This is the companion repo for the CocoIndex Quickstart. Clone it, run two commands, and you have a working incremental data pipeline.
pdf_files/*.pdf ──▶ Docling (PDF → Markdown) ──▶ out/*.md
- Read PDF files from a local directory
- Convert each one to Markdown with Docling
- Write the Markdown to an output directory as a target state
You declare the transformation in plain Python — target_state = transformation(source_state). When a source file or your logic changes, CocoIndex figures out the minimum work needed and keeps the output in sync. No external services, runs entirely on your CPU.
- Python 3.11+
- That's it — no Postgres, no Docker, no API keys.
1. Install dependencies (a sample PDF — Attention Is All You Need — is already in pdf_files/):
pip install -e .2. Run the pipeline:
cocoindex update main.pyYour Markdown appears in ./out/, one file per PDF:
ls out/
# 1706.03762v7.mdThat's the whole thing. 🎉
Re-running never redoes work that's already done. Try it:
# Add another PDF, then:
cocoindex update main.py # ⚡ only the new file is converted
# Edit or replace a PDF, then:
cocoindex update main.py # ⚡ only the changed file is reprocessed
# Remove a PDF, then:
rm pdf_files/some.pdf
cocoindex update main.py # 🧹 its Markdown is deleted automaticallyEvery other file is skipped — CocoIndex memoizes by content and code, so re-runs are effectively free. This is what makes it practical to run real pipelines over thousands of files.
The whole pipeline is main.py — about 40 lines:
@coco.fn(memo=True) # memoized: unchanged files are skipped
def process_file(file, outdir):
markdown = _converter.convert(file.file_path.resolve()) \
.document.export_to_markdown()
localfs.declare_file(outdir / (file.file_path.path.stem + ".md"), markdown,
create_parent_dirs=True) # declare the output you want to exist
@coco.fn
async def app_main(sourcedir, outdir):
files = localfs.walk_dir(sourcedir, recursive=True,
path_matcher=PatternFilePathMatcher(["**/*.pdf"]))
await coco.mount_each(process_file, files.items(), outdir) # one component per fileYou describe the output state you want; the engine handles inserts, updates, and deletes for you.
- 📖 Read the full Quickstart and Core Concepts
- 🔎 Build something bigger — text embeddings & RAG, codebase indexing, knowledge graphs
- 🗂️ Browse all examples
If this helped you get started, the easiest way to support CocoIndex is to give it a ⭐ on GitHub — it's how other developers find the project. Questions or ideas? Come say hi on Discord.