Benchmark Methodology

Understanding how we measure LLM performance. This page explains the 4 assessment scenarios, performance metrics we track, and how quality scores are calculated.

1. Assessment Scenarios

We test models with 4 distinct scenarios, each evaluating different AI capabilities. Each scenario has a specific temperature setting that optimizes for the type of task.

Understanding Temperature

Temperature controls the randomness of the model's responses:

  • 0.0-0.3 (Cold): Very deterministic. Same prompt produces nearly identical responses. Best for tasks requiring consistency like code generation.
  • 0.4-0.6 (Moderate): Balanced. Some variety while maintaining coherence. Good for planning and analysis tasks.
  • 0.7-1.0+ (Hot): High randomness. Produces creative, varied responses. Essential for narrative and roleplay tasks.

2. Performance Metrics

These metrics measure how fast the model runs and how efficiently it uses hardware. All metrics are collected automatically during benchmark execution.

Token Generation Speed

Unit: tokens per second (tok/s)

The speed at which the model generates output tokens after processing the input.

✓ What's Good

Higher is better. Typically 10-150 tok/s depending on model size and hardware. Larger models generally produce fewer tokens/sec.

✗ What's Bad

Below 5 tok/s may indicate poor hardware or model sizing issues. Values over 200 tok/s are unusual and may indicate measurement error.

Formula: generated_tokens ÷ generation_time (seconds)
Notes: Actual tokens are counted by the inference server. For Ollama, this is reported by the API. Token count is estimated as word_count × 1.3 if not directly available.

Time to First Token

Unit: milliseconds (ms)

The latency between sending a request and receiving the first output token. Measures "perceived responsiveness."

✓ What's Good

Lower is better. 50-500ms is typical. Under 100ms feels very responsive; over 1000ms feels sluggish.

✗ What's Bad

Over 2000ms (2 seconds) creates noticeable lag in interactive applications.

Formula: (generation_time ÷ generated_tokens) × 1000, converted to milliseconds
Notes: This metric is derived mathematically from token speed and duration rather than directly measured.

Memory Usage

Unit: gigabytes (GB)

Peak memory (RAM or VRAM) used by the model during generation.

✓ What's Good

Lower is better for given model quality. Allows more models/instances on same hardware.

✗ What's Bad

Usage approaching available VRAM causes slowdown (memory swapping). Usage above available RAM causes out-of-memory errors.

Formula: Model VRAM (in MB) ÷ 1024 = GB
Notes: Typically reported by the inference tool (Ollama, LM Studio, etc.). Includes model weights + KV cache.

GPU Utilization

Unit: percentage (%)

What fraction of the model is running on GPU vs CPU. 100% means the entire model is on GPU (fastest).

✓ What's Good

100% = entire model on GPU = best performance. 50%+ = good performance. Below 50% = significant CPU overhead.

✗ What's Bad

0% = model running entirely on CPU = slowest possible inference.

Formula: GPU_parameters ÷ total_parameters × 100
Notes: Reported as "processorGpuPercent" in benchmark results. Related value "processorCpuPercent" shows CPU percentage.

Efficiency Ratio

Unit: tokens per second per GB (tok/s/GB)

Normalized speed considering memory cost. How many tokens can the model generate per second for each gigabyte of memory used?

✓ What's Good

Higher is better. Shows quality/performance per unit of hardware cost. 0.5-2.0 is solid range.

✗ What's Bad

Below 0.1 indicates inefficient hardware utilization.

Formula: tokenGenerationSpeed ÷ (memoryUsage_in_MB ÷ 1024)
Notes: Allows fair comparison between different model sizes. A large model using 48GB can be more efficient than a small model on poor hardware.

Metrics Context

Token Estimation: When actual token counts aren't available from the inference server, we estimate tokens as:

estimated_tokens = word_count × 1.3

This 1.3 multiplier is empirically derived as an average ratio of tokens to words in English text.

3. Quality Assessment & Scoring

We use an LLM (OpenRouter) to evaluate model responses across multiple dimensions. Each scenario has 4 dimensions weighted to emphasize what matters most for that task type.

How Scoring Works

Step 1: Dimension Scoring

For each dimension, an expert evaluator (LLM) assigns a score from 0-100 based on the quality definition for that scenario.

Step 2: Weighted Average

The KPI (Key Performance Indicator) is calculated as a weighted average:

KPI = Σ(dimension_score × dimension_weight)

Step 3: Aggregation

The overall checkpoint quality score displayed on the benchmarks page is the average of KPI scores across all 4 scenarios.

4. Assessment Flow

Here's what happens when you run a benchmark:

1

Client Collects Data

Benchmark client runs all 4 scenarios, capturing: prompt, full response, tokens used, generation speed, memory usage, GPU utilization.

2

Submit to Server

Full benchmark data is sent to /api/benchmarks endpoint. Stored in database. Quality assessment jobs created (status: pending).

3

Async Quality Assessment

Background processor retrieves pending assessments. For each scenario: calls a defined LLM with scenario-specific prompt.

4

Store Results

Scores and KPI stored in database with status: completed or failed. Average KPI calculated across all 4 scenarios.

5

Display Results

Results appear on /benchmarks page. Click "expand" on any benchmark to see all quality scores and detailed assessment summaries.

Learn More

Methodology v1.0 • Last updated March 2026