How We Benchmark LLMs

When I started testing LLMs on my AMD Radeon RX7900XTX, I realized I needed more than just speed numbers. I needed to know: can this model handle tools? Does it write good code? Can it reason effectively? This is how we benchmark to answer those real questions.

1. Assessment Scenarios

We test models with 4 distinct scenarios, each evaluating different AI capabilities. Each scenario has a specific temperature setting that optimizes for the type of task.

Understanding Temperature

Temperature controls the randomness of the model's responses:

  • 0.0-0.3 (Cold): Very deterministic. Same prompt produces nearly identical responses. Best for tasks requiring consistency like code generation.
  • 0.4-0.6 (Moderate): Balanced. Some variety while maintaining coherence. Good for planning and analysis tasks.
  • 0.7-1.0+ (Hot): High randomness. Produces creative, varied responses. Essential for narrative and roleplay tasks.

2. Performance Metrics

These metrics measure how fast the model runs and how efficiently it uses hardware. All metrics are collected automatically during benchmark execution.

Token Generation Speed

Unit: tokens per second (tok/s)

The speed at which the model generates output tokens after processing the input.

✓ What's Good

Higher is better. Typically 10-150 tok/s depending on model size and hardware. Larger models generally produce fewer tokens/sec.

✗ What's Bad

Below 5 tok/s may indicate poor hardware or model sizing issues. Values over 200 tok/s are unusual and may indicate measurement error.

Formula: generated_tokens ÷ generation_time (seconds)
Notes: Actual tokens are counted by the inference server. For Ollama, this is reported by the API. Token count is estimated as word_count × 1.3 if not directly available.

Time to First Token

Unit: milliseconds (ms)

The latency between sending a request and receiving the first output token. Measures "perceived responsiveness." Interpretation depends on the scenario being benchmarked, as different use cases have different latency expectations.

✓ What's Good

Lower is better, but thresholds depend on use case: • Chat: ≤300ms (users expect immediate response) • Code Generation: ≤400ms (IDE/completion integration) • Agent Workflow: ≤500ms (multi-step reasoning) • Role Play & Narrative: ≤300ms (narrative flow) • Research & Analysis: ≤1000ms (analytical work)

✗ What's Bad

Performance threshold depends on use case: • Chat: >800ms (unacceptable latency for real-time interaction) • Code Generation: >1200ms (breaks IDE integration experience) • Agent Workflow: >1500ms (compounds across multiple steps) • Role Play & Narrative: >800ms (breaks immersion) • Research & Analysis: >3000ms (frustrating for analytical work)

Formula: Directly measured via HTTP streaming: T1 − T0, where T0 = timestamp immediately before sending the request, T1 = timestamp when the first non-empty content chunk arrives. Fallback: (generation_time ÷ generated_tokens) × 1000 ms if streaming measurement is unavailable.
Notes: TTFT interpretation is use-case dependent because different scenarios have distinct latency expectations. A 1000ms TTFT is bad for Chat (immediate response required) but acceptable for Research (analytical work can tolerate longer thinking time). When viewing results, match the TTFT value to the relevant scenario threshold. Displayed value on summary cards is average across scenarios; detailed breakdowns show per-scenario TTFT for accurate interpretation.

Memory Usage

Unit: gigabytes (GB)

Peak memory (RAM or VRAM) used by the model during generation.

✓ What's Good

Lower is better for given model quality. Allows more models/instances on same hardware.

✗ What's Bad

Usage approaching available VRAM causes slowdown (memory swapping). Usage above available RAM causes out-of-memory errors.

Formula: Model VRAM (in MB) ÷ 1024 = GB
Notes: Typically reported by the inference tool (Ollama, LM Studio, etc.). Includes model weights + KV cache.

GPU Utilization

Unit: percentage (%)

What fraction of the model is running on GPU vs CPU. 100% means the entire model is on GPU (fastest).

✓ What's Good

100% = entire model on GPU = best performance. 50%+ = good performance. Below 50% = significant CPU overhead.

✗ What's Bad

0% = model running entirely on CPU = slowest possible inference.

Formula: GPU_parameters ÷ total_parameters × 100
Notes: Reported as "processorGpuPercent" in benchmark results. Related value "processorCpuPercent" shows CPU percentage.

Efficiency Ratio

Unit: tokens per second per GB (tok/s/GB)

Normalized speed considering memory cost. How many tokens can the model generate per second for each gigabyte of memory used?

✓ What's Good

Higher is better. Shows quality/performance per unit of hardware cost. 0.5-2.0 is solid range.

✗ What's Bad

Below 0.1 indicates inefficient hardware utilization.

Formula: tokenGenerationSpeed ÷ (memoryUsage_in_MB ÷ 1024)
Notes: Allows fair comparison between different model sizes. A large model using 48GB can be more efficient than a small model on poor hardware.

Metrics Context

Token Estimation: When actual token counts aren't available from the inference server, we estimate tokens as:

estimated_tokens = word_count × 1.3

This 1.3 multiplier is empirically derived as an average ratio of tokens to words in English text.

3. Quality Assessment & Scoring

We use an LLM to evaluate model responses across multiple dimensions. Each scenario has specific evaluation criteria designed to measure what matters most for that task type.

Evaluation Dimensions by Scenario

Agent Planning — 35% Coherence + 35% Tool Selection + 20% Decomposition + 10% Error Handling

We evaluate whether the agent breaks down complex problems logically, selects appropriate tools for each step, considers error cases, and explains reasoning clearly. The checklist verifies: reasoning coherence (steps connected?), tool appropriateness (correct tools for each step?), decomposition quality (manageable sub-tasks?), and error handling (risks identified and mitigated?).

Role Play & Narrative — 30% Consistency + 35% Immersion + 20% Dialogue + 15% Narrative Arc

Character consistency is verified through emotional triggers (does Kestra mention hit authentically?), trust progression (earned gradually, not given freely?), and internal conflict (character torn between mercenary instinct and humanity?). Immersion requires sensory grounding (tavern smells, body language, emotional texture). Dialogue must be voice-authentic and reveal character through language. Narrative arc must show clear progression from resistance through warming to a binary, character-driven choice.

Research & Analysis — 30% Depth + 35% Insight + 25% Interpretation + 10% Clarity

Depth checks: Are all required sections present? Is scaling analyzed per-task (not overall)? Are calculations shown, not claimed? Insight checks: Do comparisons show understanding of task-specific differences? Are patterns explained or just observed? Interpretation checks: Can recommendations be verified from data? Are predictions justified with confidence levels and assumptions? Are limitations quantified (±X percentage points)? Clarity checks: Are recommendations specific and tied to actual scores/tradeoffs?

Code Generation (Breakout) — 40% Correctness + 30% Code Quality + 20% Performance + 10% Completeness

Correctness requires verifiable code inspection: state management clear (grid, ball, paddle tracked)? Ball physics correct (velocity magnitude, direction initialization)? Collision detection complete (ball-wall, ball-paddle, ball-brick all verified)? Paddle control responsive (keyboard + mobile touch)? Rendering uses Canvas (not DOM)? Game over logic triggers correctly? Frame rate using requestAnimationFrame? We inspect the code structure rather than running it, so bugs are visible: off-by-one collisions, velocity math errors, missing event handlers.

How Scoring Works

Step 1: Dimension Scoring

For each dimension, an expert evaluator (LLM) assigns a score from 0-100 based on the quality definition for that scenario. Specific criteria and checklists guide the scoring to ensure consistency.

Step 2: Weighted Average (KPI)

The KPI (Key Performance Indicator) for each scenario is calculated as a weighted average:

KPI = Σ(dimension_score × dimension_weight)

Step 3: Aggregation

The overall benchmark quality score displayed on the benchmarks page is the average of KPI scores across all 4 scenarios.

4. Assessment Flow

Here's what happens when you run a benchmark:

1

Client Collects Data

Benchmark client runs all 4 scenarios, capturing: prompt, full response, tokens used, generation speed, memory usage, GPU utilization.

2

Submit to Server

Full benchmark data is sent to /api/benchmarks endpoint. Stored in database. Quality assessment jobs created (status: pending).

3

Async Quality Assessment

Background processor retrieves pending assessments. For each scenario: calls a defined LLM with scenario-specific prompt.

4

Store Results

Scores and KPI stored in database with status: completed or failed. Average KPI calculated across all 4 scenarios.

5

Display Results

Results appear on /benchmarks page. Click "expand" on any benchmark to see all quality scores and detailed assessment summaries.

Learn More

5. Frequently Asked Questions

Common questions about our benchmarking process.

How do you ensure benchmark reproducibility?

Our open-source client uses a fixed seed and identical prompt engineering for every run. Models are tested multiple times to filter out anomalies in generation speeds or memory reporting.

Are community submissions verified?

Yes, community submissions check for impossible timings or incorrect signature payloads, though we rely on client-side consistency checks and heuristic anomaly detection on the server.

Methodology v1.0 • Last updated March 2026