Archive · vps-81 historical telemetry · 2026-04-05-small-model-manual.html. Originally rendered 2026-04-05. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks
vps50 telemetry manual

Small-model performance dashboard

This manual turns the small-model telemetry JSON into one scannable page so we can compare latency, throughput, suite pass-rate, requirement-hit-rate, and question-level behavior without reading a wall of raw logs.

Source: 2026-04-05-small-model-eval.json Generated: 2026-04-10T16:01:01Z Question suite: small-model-coding-eval-v1 Server: vmi3206382 Runner: ollama Ollama: ollama version is 0.20.2

Question Set

Suite: small-model-coding-eval-v1

shell bash_code

Disk Guard Script

disk_guard_bash

Return only Bash code. Write a script that checks disk usage for /, prints a human-readable warning, and exits with status 1 when usage is above 85 percent. Requirements: include a shebang, use df -P /, parse the numeric percentage, and keep the script production-safe.

#!/usr/bin/env bashdf -P /85exit 1
python python_code

IPv4 Validator

ipv4_python_tests

Return only Python code. Write a function named is_valid_ipv4(value: str) -> bool and include exactly three pytest tests that cover a valid address, an out-of-range octet, and a non-numeric input.

def is_valid_ipv4def test_assertsplit('.')
ops shell_lines

Nginx Safe Reload

nginx_safe_reload

Return only Bash commands, one per line. Back up /etc/nginx/nginx.conf, validate nginx config, and reload nginx only if validation passes.

cp /etc/nginx/nginx.confnginx -tsystemctl reload nginx&&
planning four_numbered_steps

YAML Validator Plan

yaml_cli_plan

Return exactly four numbered steps. Plan a Python CLI that scans a git repo for changed YAML files, validates them against a JSON schema, and exits nonzero on failure.

1.2.3.4.JSON schemagit
debugging five_bullets

SSH Lockout Triage

ssh_lockout_triage

Return exactly five bullet points. After hardening, SSH started returning Permission denied (publickey,password). List the safest first checks before changing config. Mention sshd_config, authorized_keys, journalctl, rollback, and PasswordAuthentication.

sshd_configauthorized_keysjournalctlrollbackPasswordAuthentication
Models5small-model runs captured
Avg latency52.93 sacross all models
Avg tokens/sec6.3 tok/sthroughput comparison
Pass rate60.0%suite success average
Requirement hit-rate81.9%spec alignment average

Latency

Lower is better. Sorted fastest to slowest.

Qwen2.5 Coder 1.5B
23.21 s
Phi-3 Mini
52.43 s
Qwen2.5 3B
59.05 s
Llama 3.2 3B
62.72 s
Qwen2.5 Coder 3B
67.24 s

Tokens Per Second

Higher is better. Sorted fastest to slowest.

Qwen2.5 Coder 1.5B
9.31 tok/s
Phi-3 Mini
6.36 tok/s
Llama 3.2 3B
5.76 tok/s
Qwen2.5 3B
5.57 tok/s
Qwen2.5 Coder 3B
4.56 tok/s

Pass Rate

Higher is better. Derived from the small-model evaluation suite.

Phi-3 Mini pass
80.0%
Qwen2.5 Coder 3B pass
60.0%
Qwen2.5 Coder 1.5B pass
60.0%
Qwen2.5 3B pass
60.0%
Llama 3.2 3B pass
40.0%

Requirement Hit-Rate

Higher is better. Measures how closely the answer matches the requested shape.

Qwen2.5 Coder 3B hit
90.0%
Qwen2.5 Coder 1.5B hit
90.0%
Qwen2.5 3B hit
81.0%
Phi-3 Mini hit
81.0%
Llama 3.2 3B hit
67.7%

Per-Question Matrix

Each cell summarizes marker coverage and format fidelity for one model-question pair.

ModelDisk Guard ScriptIPv4 ValidatorNginx Safe ReloadYAML Validator PlanSSH Lockout Triage
Qwen2.5 Coder 3B
75.0%
no fmt • reply
100.0%
fmt • reply
75.0%
fmt • reply
100.0%
fmt • reply
100.0%
no fmt • reply
Qwen2.5 Coder 1.5B
75.0%
no fmt • reply
100.0%
fmt • reply
75.0%
fmt • reply
100.0%
fmt • reply
100.0%
no fmt • reply
Qwen2.5 3B
50.0%
no fmt • reply
100.0%
no fmt • reply
75.0%
fmt • reply
100.0%
fmt • reply
80.0%
fmt • reply
Llama 3.2 3B
75.0%
no fmt • reply
75.0%
fmt • reply
75.0%
fmt • reply
33.3%
no fmt • reply
80.0%
no fmt • reply
Phi-3 Mini
50.0%
no fmt • reply
75.0%
fmt • reply
100.0%
fmt • reply
100.0%
fmt • reply
80.0%
fmt • reply
Show the evaluation prompts
  1. disk_guard_bash: Return only Bash code. Write a script that checks disk usage for /, prints a human-readable warning, and exits with status 1 when usage is above 85 percent. Requirements: include a shebang, use df -P /, parse the numeric percentage, and keep the script production-safe. [bash_code]
  2. ipv4_python_tests: Return only Python code. Write a function named is_valid_ipv4(value: str) -> bool and include exactly three pytest tests that cover a valid address, an out-of-range octet, and a non-numeric input. [python_code]
  3. nginx_safe_reload: Return only Bash commands, one per line. Back up /etc/nginx/nginx.conf, validate nginx config, and reload nginx only if validation passes. [shell_lines]
  4. yaml_cli_plan: Return exactly four numbered steps. Plan a Python CLI that scans a git repo for changed YAML files, validates them against a JSON schema, and exits nonzero on failure. [four_numbered_steps]
  5. ssh_lockout_triage: Return exactly five bullet points. After hardening, SSH started returning Permission denied (publickey,password). List the safest first checks before changing config. Mention sshd_config, authorized_keys, journalctl, rollback, and PasswordAuthentication. [five_bullets]

Large-Model Hello Check

Prompt: hi can you help me?

Large-model hello latency

This keeps the earlier six-model smoke test visible inside the same dashboard.

qwen14-coder-32k
54.0 s
qwen14-general-32k
66.0 s
codestral-32k
101.0 s
phind34-16k
113.0 s
qwen32-coder-32k
175.0 s
codellama34-16k
182.0 s

qwen32-coder-32k

replied

Size: 32b

Latency: 175.0 s

Characters: 255 chars

Of course! I'd be happy to help you. Could you please provide more details about what you need assistance with? Are you facing an issue with a specific piece of software, a bug, or something else? The more details you provide, the better I can assist you.

qwen14-coder-32k

replied

Size: 14b

Latency: 54.0 s

Characters: 44 chars

Of course! What do you need assistance with?

codestral-32k

replied

Size: 22b

Latency: 101.0 s

Characters: 229 chars

Of course! I'd be happy to help you with your coding questions or problems. Could you please provide more details about the specific issue or task you're facing? This will allow me to provide a more accurate and helpful response.

codellama34-16k

replied

Size: 34b

Latency: 182.0 s

Characters: 267 chars

Hello! I'd be happy to help you with any questions or problems you have. Could you please provide more information about what you need help with, such as the code you're working on or the issue you're facing? The more details you provide, the better I can assist you.

phind34-16k

replied

Size: 34b

Latency: 113.0 s

Characters: 186 chars

Of course, I'd be happy to help. However, you haven't mentioned what issue or problem you're facing. Please provide more details about your query or the problem you need assistance with.

qwen14-general-32k

replied

Size: 14b

Latency: 66.0 s

Characters: 241 chars

Of course! How can I assist you today? Do you need help with creating an execution plan, identifying risks, defining acceptance criteria, or something else related to software delivery? Please provide more details so I can better assist you.

Model Rollup

Compact index of the captured model rows for quick cross-checking.