Run it

# Quick run — prints per-prompt + aggregate tok/s and TTFT, nothing saved $ hfo --bench llama3.1:8b # Full submission — writes a JSON file you can share $ hfo --bench llama3.1:8b --out bench.json

Expected runtime: 15–60 seconds on consumer GPU, 1–3 minutes on CPU-only. The warmup prompt is excluded from the aggregate so cold-start latency doesn't skew the numbers.

What the bench measures

The suite is four fixed prompts sent through Ollama's streaming HTTP API at /api/generate, with temperature: 0.2 to keep output length stable across runs:

PromptToken budgetTask
warmup16Single word output — absorbs the first-load penalty, excluded from aggregate
code128Write a small Python function, no commentary
reasoning192One-paragraph explanation of backpropagation
translation128EN → FR and EN → ES of the pangram

For each prompt the runner records:

  • TTFT (ms) — time from HTTP request to the first streamed chunk.
  • Total (ms) — end-to-end wall-clock for the full generation.
  • Output tokens — taken from Ollama's eval_count field.
  • Tok/s — preferred formula is eval_count / (eval_duration / 1e9) (server-reported). Falls back to wall-clock only when eval_duration is missing.

Sample output

── Benchmarking qwen2.5-coder:7b-q5_k_m ──────────────────────────────────── Model qwen2.5-coder:7b-q5_k_m GPU NVIDIA GeForce RTX 4090 VRAM 24 GB RAM 64 GB Ollama 0.21.3 ── Per-prompt ────────────────────────────────────────────────────────────── warmup 12 tok 124 ms TTFT 148 ms total 82.1 tok/s code 128 tok 118 ms TTFT 1564 ms total 80.2 tok/s reasoning 192 tok 122 ms TTFT 2378 ms total 79.8 tok/s translation 128 tok 119 ms TTFT 1582 ms total 80.4 tok/s ── Aggregate ─────────────────────────────────────────────────────────────── 80.1 tok/s avg 120 ms TTFT avg 460 total output tokens in 5.7 s

Submission format

When you pass --out <file>, hfo writes a JSON document that follows the hfo-bench-v1 schema:

{ "tag": "qwen2.5-coder:7b-q5_k_m", "ollamaVersion": "0.21.3", "hfoVersion": "0.1.0", "hardware": { "gpuName": "NVIDIA GeForce RTX 4090", "vramMiB": 24576, "ramMiB": 65536, "cpuCores": 16, "platform": "win32" }, "runs": [ { "id": "warmup", "prompt": "...", "outputTokens": 12, "ttftMs": 124, "totalMs": 148, "tokensPerSec": 82.1 }, { "id": "code", "outputTokens": 128, "ttftMs": 118, "totalMs": 1564, "tokensPerSec": 80.2 }, { "id": "reasoning", "outputTokens": 192, "ttftMs": 122, "totalMs": 2378, "tokensPerSec": 79.8 }, { "id": "translation", "outputTokens": 128, "ttftMs": 119, "totalMs": 1582, "tokensPerSec": 80.4 } ], "aggregate": { "tokensPerSec": 80.1, "ttftMs": 120, "totalTokens": 460, "totalMs": 5672 }, "submittedAt": "2026-04-23T20:14:32.114Z", "schema": "hfo-bench-v1" }

Contribute a submission

The leaderboard is PR-based — no server, no telemetry, no account required. To contribute:

  1. Run hfo --bench <tag> --out my-run.json.
  2. Open a pull request against the hfo repo adding the file under docs/benchmarks/submissions/ (path may be created by you if the directory doesn't exist yet).
  3. Name the file <gpu-slug>_<tag-slug>.json — e.g. rtx4090_qwen25coder-7b-q5km.json.

Personal information: the submission file only contains the fields shown above. No hostname, username, MAC address, or path is recorded. If you want to obfuscate the GPU string or OS, edit the JSON before opening the PR.

Reproducibility notes

  • Model state matters. Close any other Ollama-loaded model first (ollama stop <other>). With two 8-bit models loaded, KV cache pressure can cut tok/s by 15–30%.
  • Thermals matter. Run once cold, once after a 2-minute sustained load. Report the sustained figure — it reflects real workloads better than the cold run.
  • Temperature is fixed at 0.2 to keep output length predictable across runs. Don't override it.
  • Context window uses Ollama's default for the model. If you override it via Modelfile, note that in your PR description — tok/s drops with longer contexts.

What the benchmarks don't tell you

  • Quality. Tok/s says nothing about whether the output is correct. Use a separate eval harness (MMLU, HumanEval, your own) for that.
  • Long-context behaviour. This bench generates ≤ 192 tokens per prompt. Long generations degrade tok/s on most hardware — that's a different measurement.
  • Multi-turn overhead. Each prompt is fresh. Conversational back-and-forth with large KV caches will be slower than what this bench reports.