Quickstart

Want to know how a model runs on your own hardware? Want to add your numbers to the archive? Hand this prompt to any capable AI coding agent — Claude Code, Cursor, Codex, your local Aider setup, whatever you use. Click the block to copy.

Click to copy I want you to look into benchmarks.weeyuga.com — specifically the methodology page, the agents.md file at the root, and one or two flagship benchmark detail pages so you understand the harness shape, the controlled vocabulary, and the data format the project expects. Then clone the public archive locally: git clone https://github.com/slobodanmargetic988/weeyuga-benchmarks-public.git cd weeyuga-benchmarks-public Read methodology.md and look at any existing run under runs/<run-id>/ to see the canonical metadata.json + run.jsonl + run.log + run.md shape. I want you to run a benchmark on this hardware for the model: REPLACE_WITH_MODEL_ID. Pick the closest existing harness in the repo and adapt it — keep the prompt suite, sampling defaults, and metadata schema unchanged so the result is comparable to what's already on the site. Capture cold-start vs warm separately, and break out per-prompt difficulty (hello / P-MEDIUM / P-HARD) the way the existing runs do. When the run finishes, write the four files into a new runs/<new-run-id>/ directory (UUIDv4 for the run-id). Fill out metadata.json with my hardware (CPU, GPU + VRAM, system RAM, OS, year), the model checkpoint hash + quantization, the harness git SHA, and any deployment-specific notes. Don't pad numbers. If something didn't work, write that into a methodology_deviations field instead of eliding it. If the result is interesting and reproducible, please open a PR against weeyuga-benchmarks-public with the new runs/<id>/ directory. Use the title format that other PRs use; reference the run-id in the description; let me review before pushing. Disconfirmation is welcome: if you find the existing runs got something wrong on similar hardware to mine, flag that explicitly in the PR description and link the run-id you're disputing. The brand promise is "tell us where we got it wrong" — honest contradictions are higher-value than agreeing measurements.

What this prompt does

The agent first reads enough of the site and the methodology to understand what “a benchmark” means here — the harness format, the metadata schema, the controlled vocabulary, the cold-vs-warm separation, the per-prompt difficulty breakout. Then it clones the public archive, finds the closest existing run to your machine + model, adapts it, runs it, fills out the four canonical files (run.jsonl, run.log, run.md, metadata.json), and submits a PR.

How to use it

  1. Click the block above to copy the prompt.
  2. Replace REPLACE_WITH_MODEL_ID with the model you want to benchmark — e.g. qwen3.5:4b, gemma-4-e2b-it Q4_K_M, llama-3.2-3b-instruct.
  3. Paste into your AI agent of choice. Claude Code, Cursor, Codex, Aider, or your own setup — anything with shell + git + file-write tools will work.
  4. Review the agent’s PR before it pushes. Make sure the metadata is honest, the methodology footnote names any deviation, and the numbers don’t look pre-baked.

What to expect from a good run

Other ways to contribute

Don’t want to run a fresh benchmark? Three lighter paths:

How we measure →   ·   Browse benchmarks →   ·   Public archive →