Benchmarks
Run hfo --bench <tag> to measure tokens-per-second and time-to-first-token on your rig against a standard 4-prompt suite. Save the result as a submission file and contribute it to the community leaderboard — a public catalogue of GPU × quant × throughput tuples nobody else is collecting.
Run it
Expected runtime: 15–60 seconds on consumer GPU, 1–3 minutes on CPU-only. The warmup prompt is excluded from the aggregate so cold-start latency doesn't skew the numbers.
What the bench measures
The suite is four fixed prompts sent through Ollama's streaming HTTP API at /api/generate, with temperature: 0.2 to keep output length stable across runs:
| Prompt | Token budget | Task |
|---|---|---|
warmup | 16 | Single word output — absorbs the first-load penalty, excluded from aggregate |
code | 128 | Write a small Python function, no commentary |
reasoning | 192 | One-paragraph explanation of backpropagation |
translation | 128 | EN → FR and EN → ES of the pangram |
For each prompt the runner records:
- TTFT (ms) — time from HTTP request to the first streamed chunk.
- Total (ms) — end-to-end wall-clock for the full generation.
- Output tokens — taken from Ollama's
eval_countfield. - Tok/s — preferred formula is
eval_count / (eval_duration / 1e9)(server-reported). Falls back to wall-clock only wheneval_durationis missing.
Sample output
Submission format
When you pass --out <file>, hfo writes a JSON document that follows the hfo-bench-v1 schema:
Contribute a submission
The leaderboard is PR-based — no server, no telemetry, no account required. To contribute:
- Run
hfo --bench <tag> --out my-run.json. - Open a pull request against the hfo repo adding the file under
docs/benchmarks/submissions/(path may be created by you if the directory doesn't exist yet). - Name the file
<gpu-slug>_<tag-slug>.json— e.g.rtx4090_qwen25coder-7b-q5km.json.
Personal information: the submission file only contains the fields shown above. No hostname, username, MAC address, or path is recorded. If you want to obfuscate the GPU string or OS, edit the JSON before opening the PR.
Reproducibility notes
- Model state matters. Close any other Ollama-loaded model first (
ollama stop <other>). With two 8-bit models loaded, KV cache pressure can cut tok/s by 15–30%. - Thermals matter. Run once cold, once after a 2-minute sustained load. Report the sustained figure — it reflects real workloads better than the cold run.
- Temperature is fixed at 0.2 to keep output length predictable across runs. Don't override it.
- Context window uses Ollama's default for the model. If you override it via Modelfile, note that in your PR description — tok/s drops with longer contexts.
What the benchmarks don't tell you
- Quality. Tok/s says nothing about whether the output is correct. Use a separate eval harness (MMLU, HumanEval, your own) for that.
- Long-context behaviour. This bench generates ≤ 192 tokens per prompt. Long generations degrade tok/s on most hardware — that's a different measurement.
- Multi-turn overhead. Each prompt is fresh. Conversational back-and-forth with large KV caches will be slower than what this bench reports.