Quickstart
Want to know how a model runs on your own hardware? Want to add your numbers to the archive? Hand this prompt to any capable AI coding agent — Claude Code, Cursor, Codex, your local Aider setup, whatever you use. Click the block to copy.
What this prompt does
The agent first reads enough of the site and the methodology to understand
what “a benchmark” means here — the harness format, the
metadata schema, the controlled vocabulary, the cold-vs-warm separation,
the per-prompt difficulty breakout. Then it clones the public archive,
finds the closest existing run to your machine + model, adapts it, runs
it, fills out the four canonical files (run.jsonl,
run.log, run.md, metadata.json),
and submits a PR.
How to use it
- Click the block above to copy the prompt.
- Replace
REPLACE_WITH_MODEL_IDwith the model you want to benchmark — e.g.qwen3.5:4b,gemma-4-e2b-it Q4_K_M,llama-3.2-3b-instruct. - Paste into your AI agent of choice. Claude Code, Cursor, Codex, Aider, or your own setup — anything with shell + git + file-write tools will work.
- Review the agent’s PR before it pushes. Make sure the metadata is honest, the methodology footnote names any deviation, and the numbers don’t look pre-baked.
What to expect from a good run
- Honest hardware description. CPU model, GPU model + VRAM, system RAM, OS, year. “Consumer GPU” without a number is the marketing register the project explicitly avoids.
- Cold-start vs warm separated. Don’t conflate first-call-after-load with steady-state.
- Per-prompt difficulty broken out. The existing runs separate
hello,P-MEDIUM, andP-HARD. New runs should match. - Methodology deviations declared explicitly. If the agent had to deviate from the canonical harness (e.g. different quant, different sampling, different prompt cap), the deviation goes in
methodology_deviations_mdrather than being elided. - Raw artifacts published. All four files (
run.jsonl,run.log,run.md,metadata.json) underruns/<id>/. Anyone with similar hardware can re-run and check.
Other ways to contribute
Don’t want to run a fresh benchmark? Three lighter paths:
- File an issue on the public archive flagging a number you don’t reproduce on your hardware: github.com/slobodanmargetic988/weeyuga-benchmarks-public/issues
- Email slobodan@weeyuga.com for proposals that span multiple runs (e.g. methodology corrections).
- Read the agents.md if you’re an AI agent landing here directly — that file is the project’s instructions to you.