Local Mac MLX Benchmark

Qwen3.5 4B MLX Full Run

This report captures the completed Apple-silicon MLX lane on the 8 GB M1 MacBook Air. The final run used Qwen thinking disabled in the chat template so the benchmark answers stayed human-facing and format-compliant instead of spilling chain-of-thought style text.

Total Questions

Total Wall Time

1081.9s

Weighted Primary Avg

19.75s

Weighted Primary Tok/s

16.14

Primary Format Pass

5/35

Follow-up Format Pass

35/35

Source Artifacts

instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/small_eval/small-eval-qwen3_5_4b_mlx-2026-04-16_00-32-03/python_task_suite.json
instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/python_v1_20q/python20-qwen3_5_4b_mlx-2026-04-16_00-33-36/python_task_suite.json
instances/vps-81-17-99-14/telemetry/generated/local-mac-mlx/qwen3_5_4b_mlx/python_v2_10q/python10-qwen3_5_4b_mlx-2026-04-16_00-40-57/python_task_suite.json

Suite

small-model-coding-eval-v1-qwen3_5_4b_mlx

Questions

Wall Time

88.6s

Primary Avg

9.56s

Primary Tok/s

14.62

Primary Format

2/5

Marker Hits

20/23

ID	Title	Category	Primary	Tok/s	Markers	Primary Format	Follow-up	Follow-up Format
`disk_guard_bash`	Disk Guard Script	shell	9.257s	14.91	3/4	no	7.938s	yes
`ipv4_python_tests`	IPv4 Validator	python	13.642s	17.01	4/4	no	9.742s	yes
`nginx_safe_reload`	Nginx Safe Reload	ops	3.397s	10.01	2/4	yes	5.040s	yes
`yaml_cli_plan`	YAML Validator Plan	planning	11.230s	15.58	6/6	yes	9.971s	yes
`ssh_lockout_triage`	SSH Lockout Triage	debugging	10.275s	15.57	5/5	no	7.797s	yes

Suite

overnight-python-telemetry-v1-qwen3_5_4b_mlx

Questions

Wall Time

436.2s

Primary Avg

12.78s

Primary Tok/s

15.72

Primary Format

0/20

Marker Hits

70/80

ID	Title	Category	Primary	Tok/s	Markers	Primary Format	Follow-up	Follow-up Format
`py_csv_parse`	CSV Parser	parsing	17.146s	17.67	4/4	no	10.890s	yes
`py_file_scan`	File Scanner	file_io	14.576s	15.78	4/4	no	9.443s	yes
`py_cli_args`	CLI Arguments	cli	9.947s	16.39	4/4	no	8.685s	yes
`py_typing_dataclass`	Typed Dataclass	typing	4.210s	12.35	4/4	no	6.436s	yes
`py_pytest_fixture`	Pytest Fixture	tests	7.001s	14.71	4/4	no	7.177s	yes
`py_async_fetch`	Async Fetch	async	14.672s	16.77	4/4	no	9.309s	yes
`py_http_retry`	HTTP Retry	http	24.083s	15.61	3/4	no	12.282s	yes
`py_json_validate`	JSON Validation	validation	31.132s	17.25	3/4	no	13.006s	yes
`py_sqlite_store`	SQLite Store	sqlite	10.306s	15.82	4/4	no	8.584s	yes
`py_fastapi_handler`	FastAPI Handler	web	4.441s	10.81	4/4	no	6.631s	yes
`py_config_dataclass`	Config Dataclass	config	12.334s	18.08	3/4	no	9.577s	yes
`py_logging_setup`	Logging Setup	logging	15.921s	18.78	3/4	no	9.687s	yes
`py_thread_pool`	Thread Pool	concurrency	13.157s	18.24	3/4	no	8.453s	yes
`py_package_layout`	Package Layout	package	6.159s	16.89	4/4	no	7.029s	yes
`py_debug_stacktrace`	Debug Stacktrace	debugging	3.276s	12.82	3/4	no	5.175s	yes
`py_refactor_split`	Refactor Split	refactor	9.143s	17.28	3/4	no	7.895s	yes
`py_csv_summary`	CSV Summary	analysis	15.536s	17.25	3/4	no	10.806s	yes
`py_pathlib_clean`	Pathlib Cleaner	filesystem	12.375s	13.66	4/4	no	9.663s	yes
`py_pydantic_model`	Pydantic Model	validation	4.239s	11.09	3/4	no	8.782s	yes
`py_regex_log_parser`	Regex Log Parser	parsing	25.890s	17.19	3/4	no	10.416s	yes

Suite

overnight-python-telemetry-v2-qwen3_5_4b_mlx

Questions

Wall Time

557.1s

Primary Avg

38.78s

Primary Tok/s

17.74

Primary Format

3/10

Marker Hits

56/66

ID	Title	Category	Primary	Tok/s	Markers	Primary Format	Follow-up	Follow-up Format
`myboard_auth_redirect_triage`	Auth Redirect Triage	debugging	38.507s	18.18	8/8	no	16.360s	yes
`myboard_board_snapshot_regression_test`	Board Snapshot Regression Test	tests	48.141s	18.70	4/7	no	19.206s	yes
`myboard_lane_config_patch_plan`	Lane Config Patch Plan	planning	37.823s	18.51	6/6	no	17.117s	yes
`myboard_api_token_audit_regression_test`	API Token Audit Regression Test	tests	48.104s	18.71	4/7	no	18.248s	yes
`myboard_announcements_state_sync_review`	Announcements State Sync Review	review	38.529s	18.17	6/6	no	19.272s	yes
`myboard_feature_flag_lifecycle_test`	Feature Flag Lifecycle Test	tests	48.196s	18.67	5/7	no	17.979s	yes
`myboard_task_bulk_job_debug_packet`	Task Bulk Job Debug Packet	debugging	31.124s	17.00	6/6	yes	15.484s	yes
`myboard_user_preferences_contract_test`	User Preferences Contract Test	tests	53.665s	17.70	5/7	no	18.962s	yes
`myboard_orchestration_timeline_forensics`	Orchestration Timeline Forensics	forensics	24.918s	16.17	6/6	yes	14.320s	yes
`truthgraph_ingest_log_triage`	Ingest Log Triage	cross_repo_debugging	18.833s	15.56	6/6	yes	11.936s	yes