When I started testing LLMs on my AMD Radeon RX7900XTX, I realized I needed more than just speed numbers. I needed to know: can this model handle tools? Does it write good code? Can it reason effectively? This is how we benchmark to answer those real questions.
We test models with 4 distinct scenarios, each evaluating different AI capabilities. Each scenario has a specific temperature setting that optimizes for the type of task.
Temperature controls the randomness of the model's responses:
These metrics measure how fast the model runs and how efficiently it uses hardware. All metrics are collected automatically during benchmark execution.
Unit: tokens per second (tok/s)
The speed at which the model generates output tokens after processing the input.
Higher is better. Typically 10-150 tok/s depending on model size and hardware. Larger models generally produce fewer tokens/sec.
Below 5 tok/s may indicate poor hardware or model sizing issues. Values over 200 tok/s are unusual and may indicate measurement error.
generated_tokens ÷ generation_time (seconds)Unit: milliseconds (ms)
The latency between sending a request and receiving the first output token. Measures "perceived responsiveness." Interpretation depends on the scenario being benchmarked, as different use cases have different latency expectations.
Lower is better, but thresholds depend on use case: • Chat: ≤300ms (users expect immediate response) • Code Generation: ≤400ms (IDE/completion integration) • Agent Workflow: ≤500ms (multi-step reasoning) • Role Play & Narrative: ≤300ms (narrative flow) • Research & Analysis: ≤1000ms (analytical work)
Performance threshold depends on use case: • Chat: >800ms (unacceptable latency for real-time interaction) • Code Generation: >1200ms (breaks IDE integration experience) • Agent Workflow: >1500ms (compounds across multiple steps) • Role Play & Narrative: >800ms (breaks immersion) • Research & Analysis: >3000ms (frustrating for analytical work)
Directly measured via HTTP streaming: T1 − T0, where T0 = timestamp immediately before sending the request, T1 = timestamp when the first non-empty content chunk arrives. Fallback: (generation_time ÷ generated_tokens) × 1000 ms if streaming measurement is unavailable.Unit: gigabytes (GB)
Peak memory (RAM or VRAM) used by the model during generation.
Lower is better for given model quality. Allows more models/instances on same hardware.
Usage approaching available VRAM causes slowdown (memory swapping). Usage above available RAM causes out-of-memory errors.
Model VRAM (in MB) ÷ 1024 = GBUnit: percentage (%)
What fraction of the model is running on GPU vs CPU. 100% means the entire model is on GPU (fastest).
100% = entire model on GPU = best performance. 50%+ = good performance. Below 50% = significant CPU overhead.
0% = model running entirely on CPU = slowest possible inference.
GPU_parameters ÷ total_parameters × 100Unit: tokens per second per GB (tok/s/GB)
Normalized speed considering memory cost. How many tokens can the model generate per second for each gigabyte of memory used?
Higher is better. Shows quality/performance per unit of hardware cost. 0.5-2.0 is solid range.
Below 0.1 indicates inefficient hardware utilization.
tokenGenerationSpeed ÷ (memoryUsage_in_MB ÷ 1024)Token Estimation: When actual token counts aren't available from the inference server, we estimate tokens as:
estimated_tokens = word_count × 1.3
This 1.3 multiplier is empirically derived as an average ratio of tokens to words in English text.
We use an LLM to evaluate model responses across multiple dimensions. Each scenario has specific evaluation criteria designed to measure what matters most for that task type.
We evaluate whether the agent breaks down complex problems logically, selects appropriate tools for each step, considers error cases, and explains reasoning clearly. The checklist verifies: reasoning coherence (steps connected?), tool appropriateness (correct tools for each step?), decomposition quality (manageable sub-tasks?), and error handling (risks identified and mitigated?).
Character consistency is verified through emotional triggers (does Kestra mention hit authentically?), trust progression (earned gradually, not given freely?), and internal conflict (character torn between mercenary instinct and humanity?). Immersion requires sensory grounding (tavern smells, body language, emotional texture). Dialogue must be voice-authentic and reveal character through language. Narrative arc must show clear progression from resistance through warming to a binary, character-driven choice.
Depth checks: Are all required sections present? Is scaling analyzed per-task (not overall)? Are calculations shown, not claimed? Insight checks: Do comparisons show understanding of task-specific differences? Are patterns explained or just observed? Interpretation checks: Can recommendations be verified from data? Are predictions justified with confidence levels and assumptions? Are limitations quantified (±X percentage points)? Clarity checks: Are recommendations specific and tied to actual scores/tradeoffs?
Correctness requires verifiable code inspection: state management clear (grid, ball, paddle tracked)? Ball physics correct (velocity magnitude, direction initialization)? Collision detection complete (ball-wall, ball-paddle, ball-brick all verified)? Paddle control responsive (keyboard + mobile touch)? Rendering uses Canvas (not DOM)? Game over logic triggers correctly? Frame rate using requestAnimationFrame? We inspect the code structure rather than running it, so bugs are visible: off-by-one collisions, velocity math errors, missing event handlers.
For each dimension, an expert evaluator (LLM) assigns a score from 0-100 based on the quality definition for that scenario. Specific criteria and checklists guide the scoring to ensure consistency.
The KPI (Key Performance Indicator) for each scenario is calculated as a weighted average:
KPI = Σ(dimension_score × dimension_weight)
The overall benchmark quality score displayed on the benchmarks page is the average of KPI scores across all 4 scenarios.
Here's what happens when you run a benchmark:
Benchmark client runs all 4 scenarios, capturing: prompt, full response, tokens used, generation speed, memory usage, GPU utilization.
Full benchmark data is sent to /api/benchmarks endpoint. Stored in database. Quality assessment jobs created (status: pending).
Background processor retrieves pending assessments. For each scenario: calls a defined LLM with scenario-specific prompt.
Scores and KPI stored in database with status: completed or failed. Average KPI calculated across all 4 scenarios.
Results appear on /benchmarks page. Click "expand" on any benchmark to see all quality scores and detailed assessment summaries.
Common questions about our benchmarking process.
Our open-source client uses a fixed seed and identical prompt engineering for every run. Models are tested multiple times to filter out anomalies in generation speeds or memory reporting.
Yes, community submissions check for impossible timings or incorrect signature payloads, though we rely on client-side consistency checks and heuristic anomaly detection on the server.