Understanding how we measure LLM performance. This page explains the 4 assessment scenarios, performance metrics we track, and how quality scores are calculated.
We test models with 4 distinct scenarios, each evaluating different AI capabilities. Each scenario has a specific temperature setting that optimizes for the type of task.
Temperature controls the randomness of the model's responses:
These metrics measure how fast the model runs and how efficiently it uses hardware. All metrics are collected automatically during benchmark execution.
Unit: tokens per second (tok/s)
The speed at which the model generates output tokens after processing the input.
Higher is better. Typically 10-150 tok/s depending on model size and hardware. Larger models generally produce fewer tokens/sec.
Below 5 tok/s may indicate poor hardware or model sizing issues. Values over 200 tok/s are unusual and may indicate measurement error.
generated_tokens ÷ generation_time (seconds)Unit: milliseconds (ms)
The latency between sending a request and receiving the first output token. Measures "perceived responsiveness."
Lower is better. 50-500ms is typical. Under 100ms feels very responsive; over 1000ms feels sluggish.
Over 2000ms (2 seconds) creates noticeable lag in interactive applications.
(generation_time ÷ generated_tokens) × 1000, converted to millisecondsUnit: gigabytes (GB)
Peak memory (RAM or VRAM) used by the model during generation.
Lower is better for given model quality. Allows more models/instances on same hardware.
Usage approaching available VRAM causes slowdown (memory swapping). Usage above available RAM causes out-of-memory errors.
Model VRAM (in MB) ÷ 1024 = GBUnit: percentage (%)
What fraction of the model is running on GPU vs CPU. 100% means the entire model is on GPU (fastest).
100% = entire model on GPU = best performance. 50%+ = good performance. Below 50% = significant CPU overhead.
0% = model running entirely on CPU = slowest possible inference.
GPU_parameters ÷ total_parameters × 100Unit: tokens per second per GB (tok/s/GB)
Normalized speed considering memory cost. How many tokens can the model generate per second for each gigabyte of memory used?
Higher is better. Shows quality/performance per unit of hardware cost. 0.5-2.0 is solid range.
Below 0.1 indicates inefficient hardware utilization.
tokenGenerationSpeed ÷ (memoryUsage_in_MB ÷ 1024)Token Estimation: When actual token counts aren't available from the inference server, we estimate tokens as:
estimated_tokens = word_count × 1.3
This 1.3 multiplier is empirically derived as an average ratio of tokens to words in English text.
We use an LLM (OpenRouter) to evaluate model responses across multiple dimensions. Each scenario has 4 dimensions weighted to emphasize what matters most for that task type.
For each dimension, an expert evaluator (LLM) assigns a score from 0-100 based on the quality definition for that scenario.
The KPI (Key Performance Indicator) is calculated as a weighted average:
KPI = Σ(dimension_score × dimension_weight)
The overall checkpoint quality score displayed on the benchmarks page is the average of KPI scores across all 4 scenarios.
Here's what happens when you run a benchmark:
Benchmark client runs all 4 scenarios, capturing: prompt, full response, tokens used, generation speed, memory usage, GPU utilization.
Full benchmark data is sent to /api/benchmarks endpoint. Stored in database. Quality assessment jobs created (status: pending).
Background processor retrieves pending assessments. For each scenario: calls a defined LLM with scenario-specific prompt.
Scores and KPI stored in database with status: completed or failed. Average KPI calculated across all 4 scenarios.
Results appear on /benchmarks page. Click "expand" on any benchmark to see all quality scores and detailed assessment summaries.