CocoIndex Quickstart 🥥

Build your first CocoIndex pipeline in 5 minutes.
Convert a folder of PDFs into Markdown — and watch it only reprocess what actually changed.

This is the companion repo for the CocoIndex Quickstart. Clone it, run two commands, and you have a working incremental data pipeline.

What you'll build

pdf_files/*.pdf  ──▶  Docling (PDF → Markdown)  ──▶  out/*.md

Read PDF files from a local directory
Convert each one to Markdown with Docling
Write the Markdown to an output directory as a target state

You declare the transformation in plain Python — target_state = transformation(source_state). When a source file or your logic changes, CocoIndex figures out the minimum work needed and keeps the output in sync. No external services, runs entirely on your CPU.

Prerequisites

Python 3.11+
That's it — no Postgres, no Docker, no API keys.

Quickstart

1. Install dependencies (a sample PDF — Attention Is All You Need — is already in pdf_files/):

pip install -e .

2. Run the pipeline:

cocoindex update main.py

Your Markdown appears in ./out/, one file per PDF:

ls out/
# 1706.03762v7.md

That's the whole thing. 🎉

The magic: incremental processing

Re-running never redoes work that's already done. Try it:

# Add another PDF, then:
cocoindex update main.py     # ⚡ only the new file is converted

# Edit or replace a PDF, then:
cocoindex update main.py     # ⚡ only the changed file is reprocessed

# Remove a PDF, then:
rm pdf_files/some.pdf
cocoindex update main.py     # 🧹 its Markdown is deleted automatically

Every other file is skipped — CocoIndex memoizes by content and code, so re-runs are effectively free. This is what makes it practical to run real pipelines over thousands of files.

How it works

The whole pipeline is main.py — about 40 lines:

@coco.fn(memo=True)                       # memoized: unchanged files are skipped
def process_file(file, outdir):
    markdown = _converter.convert(file.file_path.resolve()) \
                         .document.export_to_markdown()
    localfs.declare_file(outdir / (file.file_path.path.stem + ".md"), markdown,
                         create_parent_dirs=True)   # declare the output you want to exist

@coco.fn
async def app_main(sourcedir, outdir):
    files = localfs.walk_dir(sourcedir, recursive=True,
                             path_matcher=PatternFilePathMatcher(["**/*.pdf"]))
    await coco.mount_each(process_file, files.items(), outdir)  # one component per file

You describe the output state you want; the engine handles inserts, updates, and deletes for you.

Next steps

📖 Read the full Quickstart and Core Concepts
🔎 Build something bigger — text embeddings & RAG, codebase indexing, knowledge graphs
🗂️ Browse all examples

Support us ❤️

If this helped you get started, the easiest way to support CocoIndex is to give it a ⭐ on GitHub — it's how other developers find the project. Questions or ideas? Come say hi on Discord.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pdf_files		pdf_files
.env		.env
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CocoIndex Quickstart 🥥

What you'll build

Prerequisites

Quickstart

The magic: incremental processing

How it works

Next steps

Support us ❤️

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CocoIndex Quickstart 🥥

What you'll build

Prerequisites

Quickstart

The magic: incremental processing

How it works

Next steps

Support us ❤️

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages