Local Mac MLX Benchmark
Qwen3.5 4B MLX Full Run
This report captures the completed Apple-silicon MLX lane on the 8 GB M1 MacBook Air. The final run used Qwen thinking disabled in the chat template so the benchmark answers stayed human-facing and format-compliant instead of spilling chain-of-thought style text.
Total Questions
35
Total Wall Time
1081.9s
Weighted Primary Avg
19.75s
Weighted Primary Tok/s
16.14
Primary Format Pass
5/35
Follow-up Format Pass
35/35
Source Artifacts
- instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/small_eval/small-eval-qwen3_5_4b_mlx-2026-04-16_00-32-03/python_task_suite.json
- instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/python_v1_20q/python20-qwen3_5_4b_mlx-2026-04-16_00-33-36/python_task_suite.json
- instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/python_v2_10q/python10-qwen3_5_4b_mlx-2026-04-16_00-40-57/python_task_suite.json
Suite
small-model-coding-eval-v1-qwen3_5_4b_mlx
| ID | Title | Category | Primary | Tok/s | Markers | Primary Format | Follow-up | Follow-up Format |
|---|---|---|---|---|---|---|---|---|
disk_guard_bash | Disk Guard Script | shell | 9.257s | 14.91 | 3/4 | no | 7.938s | yes |
ipv4_python_tests | IPv4 Validator | python | 13.642s | 17.01 | 4/4 | no | 9.742s | yes |
nginx_safe_reload | Nginx Safe Reload | ops | 3.397s | 10.01 | 2/4 | yes | 5.040s | yes |
yaml_cli_plan | YAML Validator Plan | planning | 11.230s | 15.58 | 6/6 | yes | 9.971s | yes |
ssh_lockout_triage | SSH Lockout Triage | debugging | 10.275s | 15.57 | 5/5 | no | 7.797s | yes |
Suite
overnight-python-telemetry-v1-qwen3_5_4b_mlx
| ID | Title | Category | Primary | Tok/s | Markers | Primary Format | Follow-up | Follow-up Format |
|---|---|---|---|---|---|---|---|---|
py_csv_parse | CSV Parser | parsing | 17.146s | 17.67 | 4/4 | no | 10.890s | yes |
py_file_scan | File Scanner | file_io | 14.576s | 15.78 | 4/4 | no | 9.443s | yes |
py_cli_args | CLI Arguments | cli | 9.947s | 16.39 | 4/4 | no | 8.685s | yes |
py_typing_dataclass | Typed Dataclass | typing | 4.210s | 12.35 | 4/4 | no | 6.436s | yes |
py_pytest_fixture | Pytest Fixture | tests | 7.001s | 14.71 | 4/4 | no | 7.177s | yes |
py_async_fetch | Async Fetch | async | 14.672s | 16.77 | 4/4 | no | 9.309s | yes |
py_http_retry | HTTP Retry | http | 24.083s | 15.61 | 3/4 | no | 12.282s | yes |
py_json_validate | JSON Validation | validation | 31.132s | 17.25 | 3/4 | no | 13.006s | yes |
py_sqlite_store | SQLite Store | sqlite | 10.306s | 15.82 | 4/4 | no | 8.584s | yes |
py_fastapi_handler | FastAPI Handler | web | 4.441s | 10.81 | 4/4 | no | 6.631s | yes |
py_config_dataclass | Config Dataclass | config | 12.334s | 18.08 | 3/4 | no | 9.577s | yes |
py_logging_setup | Logging Setup | logging | 15.921s | 18.78 | 3/4 | no | 9.687s | yes |
py_thread_pool | Thread Pool | concurrency | 13.157s | 18.24 | 3/4 | no | 8.453s | yes |
py_package_layout | Package Layout | package | 6.159s | 16.89 | 4/4 | no | 7.029s | yes |
py_debug_stacktrace | Debug Stacktrace | debugging | 3.276s | 12.82 | 3/4 | no | 5.175s | yes |
py_refactor_split | Refactor Split | refactor | 9.143s | 17.28 | 3/4 | no | 7.895s | yes |
py_csv_summary | CSV Summary | analysis | 15.536s | 17.25 | 3/4 | no | 10.806s | yes |
py_pathlib_clean | Pathlib Cleaner | filesystem | 12.375s | 13.66 | 4/4 | no | 9.663s | yes |
py_pydantic_model | Pydantic Model | validation | 4.239s | 11.09 | 3/4 | no | 8.782s | yes |
py_regex_log_parser | Regex Log Parser | parsing | 25.890s | 17.19 | 3/4 | no | 10.416s | yes |
Suite
overnight-python-telemetry-v2-qwen3_5_4b_mlx
| ID | Title | Category | Primary | Tok/s | Markers | Primary Format | Follow-up | Follow-up Format |
|---|---|---|---|---|---|---|---|---|
myboard_auth_redirect_triage | debugging | 38.507s | 18.18 | 8/8 | no | 16.360s | yes | |
myboard_board_snapshot_regression_test | Board Snapshot Regression Test | tests | 48.141s | 18.70 | 4/7 | no | 19.206s | yes |
myboard_lane_config_patch_plan | Lane Config Patch Plan | planning | 37.823s | 18.51 | 6/6 | no | 17.117s | yes |
myboard_api_token_audit_regression_test | API Token Audit Regression Test | tests | 48.104s | 18.71 | 4/7 | no | 18.248s | yes |
myboard_announcements_state_sync_review | Announcements State Sync Review | review | 38.529s | 18.17 | 6/6 | no | 19.272s | yes |
myboard_feature_flag_lifecycle_test | Feature Flag Lifecycle Test | tests | 48.196s | 18.67 | 5/7 | no | 17.979s | yes |
myboard_task_bulk_job_debug_packet | Task Bulk Job Debug Packet | debugging | 31.124s | 17.00 | 6/6 | yes | 15.484s | yes |
myboard_user_preferences_contract_test | User Preferences Contract Test | tests | 53.665s | 17.70 | 5/7 | no | 18.962s | yes |
myboard_orchestration_timeline_forensics | Orchestration Timeline Forensics | forensics | 24.918s | 16.17 | 6/6 | yes | 14.320s | yes |
truthgraph_ingest_log_triage | cross_repo_debugging | 18.833s | 15.56 | 6/6 | yes | 11.936s | yes |