llm-bench-cli is an early-stage project for benchmarking local LLMs and OpenAI-compatible inference endpoints, with a focus on throughput, latency, and side-by-side model comparison on your own hardware.
As local inference gets more practical, the hard part is often not running a model but choosing between several of them. Different models, runtimes, and quantizations can behave very differently on the same machine. llm-bench-cli is aimed at making those tradeoffs easier to measure.
The long-term goal is to provide a lightweight benchmarking workflow for developers who care about:
- tokens per second
- time to first token
- end-to-end latency
- throughput under concurrency
- practical comparison across local-model runtimes
llm-bench-cli is currently in an early public stage.
The repository already captures the project intent and basic dependency direction, but it does not yet contain the full CLI implementation described by the original concept. This README has been updated to reflect the current state honestly while keeping the intended direction clear.
The project is intended to grow toward:
- a simple CLI for repeatable local-model benchmarks
- comparisons across models served via OpenAI-compatible APIs
- benchmark suites for coding, reasoning, summarization, and instruction-following tasks
- exportable benchmark results for tracking performance over time
- lightweight reporting that works well in terminals and scripts
The intended user flow is straightforward:
- Point the tool at a local or self-hosted inference endpoint.
- Run a benchmark suite against one or more models.
- Compare latency and throughput across runs.
- Export results for later analysis.
For many builders, model choice is now a hardware and latency question as much as a quality question. A small benchmarking tool can help make those decisions more empirical, especially when comparing Ollama, llama.cpp servers, LM Studio, vLLM, and other OpenAI-compatible endpoints.
Install from source:
git clone https://github.com/MukundaKatta/llm-bench-cli.git
cd llm-bench-cli
pip install -e .The first runnable slice benchmarks an OpenAI-compatible /v1/chat/completions endpoint.
Required environment variables:
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=test-keyRun a benchmark:
llm-bench bench \
--model your-model-name \
--prompt "Explain what makes a fast benchmark useful."Optional flags:
--streamto measure time to first token when the endpoint supports streaming--json-outputto print the normalized result schema for scripts and later comparisons--timeoutto raise or lower the request timeout
Example JSON result:
{
"provider": "openai-compatible",
"model": "your-model-name",
"base_url": "http://localhost:8000/v1",
"prompt": "Explain what makes a fast benchmark useful.",
"status_code": 200,
"success": true,
"latency_ms": 123.4,
"total_duration_ms": 245.7,
"ttft_ms": 123.4,
"output_text": "A useful benchmark...",
"output_tokens": 42,
"prompt_tokens": 18,
"total_tokens": 60,
"error": null,
"raw_metrics": {
"stream": true
}
}llm-bench-cli/
├── README.md
├── requirements.txt
└── LICENSE
Current dependencies power the first implementation directly:
httpxfor calling benchmark targetsrichfor terminal outputclickfor the CLI interface- a normalized provider adapter contract in
src/llm_bench_cli/adapters.py - a shared benchmark result schema in
src/llm_bench_cli/models.py
llm-bench-cli/
├── README.md
├── pyproject.toml
├── src/llm_bench_cli/
│ ├── adapters.py
│ ├── cli.py
│ ├── models.py
│ └── __init__.py
└── tests/
└── test_cli.py
Near-term priorities now include:
- adding more benchmark suites and prompts
- exporting run results to files
- supporting more provider adapters behind the same schema
- adding concurrency and throughput runs
- expanding reporting beyond a single benchmark invocation
Contributions, ideas, and feedback are welcome, especially around:
- benchmark design
- local-model comparison workflows
- CLI ergonomics
- reporting formats
- evaluation categories that are practical for developers
MIT License. See LICENSE for details.