# benchmarks.weeyuga.com

> Honest benchmarks for personal AI on consumer hardware. Used gaming laptops, base-model Macs, cheap CPU VPSes — measured, not estimated. Published with raw JSONL, methodology, and the git SHA of the harness so anyone can reproduce. Released CC-BY-4.0 with no marketing register.

This site exists to save the next person time. What didn't work matters more than what did. Every claim links to its underlying run; every run links to its raw artifacts; every methodology gap we know about is documented up front.

## What's on the site

- [Home](https://benchmarks.weeyuga.com/): hero question + walk through the four hardware tiers we measure (CPU VPS, 4 GB GPU laptop, 6 GB GPU laptop, M1 Mac).
- [Quickstart](https://benchmarks.weeyuga.com/quickstart.html): click-to-copy agent prompt that delegates a whole new benchmark contribution to a coding agent (read methodology, clone the public archive, run the bench, file a PR). The fastest path from "I want to benchmark this model on my hardware" to "my numbers are in the archive."
- [Browse benchmarks](https://benchmarks.weeyuga.com/benchmarks/): chip-filtered card list of every site-grade run, by hardware, model family, or task kind.
- [Catalog](https://benchmarks.weeyuga.com/catalog.html): the full 125-entry index including the 104 legacy reports surfaced from earlier campaigns.
- [Findings](https://benchmarks.weeyuga.com/findings/): synthesis posts that weave multiple benchmarks together, plus a "dead ends" subsection for things we tried that didn't work.
- [How we measure](https://benchmarks.weeyuga.com/methodology.html): what we measure, what we don't, fairness rules, reproducibility, and a running list of mistakes we've published and corrected.
- [Data](https://benchmarks.weeyuga.com/data.html): bulk download + GitHub archive link + citation block.
- [About](https://benchmarks.weeyuga.com/about.html): one paragraph on who runs it (Sloba Margetic) and why.

## Schema for machine readers

- [Catalogue (JSON)](https://benchmarks.weeyuga.com/catalogue.json): every benchmark with id, hardware, model_family, task_kind, results_table, chart_spec, raw_data_urls, related_ids, site_grade. Schema version `1.0` per Ben's lock 2026-05-06T10:30Z.
- [Build metadata (JSON)](https://benchmarks.weeyuga.com/build.json): currently-deployed git SHA + timestamp + grade breakdown.
- [Public GitHub archive](https://github.com/slobodanmargetic988/weeyuga-benchmarks-public): every run's raw `run.jsonl`, `run.log`, human-readable `run.md`, and `metadata.json` (model checkpoint, env, harness git SHA).

## What you can rely on

- **Hardware honestly described.** Every benchmark declares the exact machine: CPU model, GPU model + VRAM, RAM, year. No "consumer GPU" without a number.
- **Methodology declared per run.** Every detail page links to the methodology section that produced it. Run-specific deviations get a `methodology_deviations` field rather than being elided.
- **Cold-start vs warm separated.** First-call-after-load measurements are tagged separately from steady-state measurements. Speedup ratios published.
- **Per-prompt difficulty broken out.** `hello`, `P-MEDIUM`, `P-HARD` (and `P-EASY` where the suite covers it) measured separately so you can see whether a model's bottleneck is parsing or generation.
- **Caveats published, not buried.** Any benchmark with a methodology gap or framing risk gets an explicit caveat box in warm-amber on its detail page.
- **Mistakes corrected, not memory-holed.** The methodology page lists corrections; v0 numbers stay reachable even after v1 supersedes them.

## What's deliberately not here

- Hosted-API comparisons. We don't run benchmarks on hardware we don't own; cloud passthroughs aren't measured here.
- Marketing claims. No "fastest local AI on the planet" — every superlative has a measurement next to it.
- Frontier-class GPUs. The most capable card we test is a six-year-old GTX 1060 with 6 GB of VRAM. The site is for engineers planning around hardware they can actually buy used.
- Synthetic benchmarks with no production analog. Every measurement maps to a workload someone might actually run.

## How to contribute

You can re-run any benchmark on your own hardware and tell us where we got it wrong. Four paths:

1. **Hand the [`/quickstart.html`](https://benchmarks.weeyuga.com/quickstart.html) prompt to a coding agent** (Claude Code / Cursor / Codex / Aider / your own). The page hosts a click-to-copy prompt that does the whole loop: read the methodology, clone the public archive, run a benchmark against the model you specify on your hardware, fill out the four canonical files, open a PR. Highest leverage if you already have an AI agent in your workflow.
2. **File an issue** on the public GitHub repo: <https://github.com/slobodanmargetic988/weeyuga-benchmarks-public/issues>. Mention the run id (e.g. `09d8fbde-...`) and what you measured differently.
3. **Submit a PR** with your own benchmark JSONL. The harness shape is documented in `methodology.md` inside the public repo. Same metadata fields, same controlled vocab.
4. **Email** <slobodan@weeyuga.com> if your contribution doesn't fit issue or PR shape (e.g. a corrected methodology proposal that spans multiple runs).

Disconfirmation is more valuable than another credulous re-share. If you find a number that doesn't match what your hardware does on the same model with the same prompt, please tell us — the fix is mechanical and the credibility win is large.

## Citation

```
Margetic, S. et al. (2026). benchmarks.weeyuga.com.
Public benchmarks of personal AI on consumer hardware.
```

Per-benchmark citation blocks are on every detail page. Data: CC-BY-4.0. Helper scripts: MIT.