--bg-canvas
oklch(98.5% 0.005 90)
#fbfaf7
15.59:1 vs --fg-default (AAA)
Spec reference: §2.3
Section 1
Section 2
The memory bandwidth ceiling on Apple Silicon's unified memory architecture is real but rarely the binding constraint for inference workloads at typical context lengths.
Body italic carries emphasis through shape, while body bold is reserved for moments that need stronger editorial pressure.
Footnote-size paragraph: Tiempos Text Web, 15px/1.5, §2.4
Prose link: high-contrast text with an accent underline. Visited prose sample: muted text with muted underline.
Section 3
Section 4
Section 5
The tempting story is that unified memory bandwidth explains every local inference result on Apple
Silicon. It explains enough to be dangerous. A 128-bit LPDDR interface can make a quantized 8B model
look cleanly bandwidth-bound when the batch is small, the prompt is already resident, and the runtime
spends most of its time streaming weights. But that framing loses resolution as soon as context grows,
because KV-cache traffic, scheduler overhead, and prompt ingestion begin to compete with the neat
weights-per-token arithmetic. In practice, Q4_K_S often moves the bottleneck from raw
bandwidth to a mix of cache locality and kernel dispatch. The result is not that bandwidth does not
matter; it is that bandwidth is the floor, not the whole building. A useful benchmark therefore reports
first-token latency, steady-state throughput, and tail behavior under repeated runs, with
artifact hashes attached to the exact configuration. See the
inference methodology for the signed-run contract that keeps
these measurements inspectable after the article has aged.
Section 6
from pathlib import Path
def tokens_per_second(tokens: int, seconds: float) -> float: return tokens / seconds
result = tokens_per_second(tokens=4096, seconds=46.9)print(f"{result:.1f} tok/s")$ python bench/inference.py --model llama-3.1-8b-q4Loading model: llama-3.1-8b-instruct-Q4_K_S.ggufPrompt tokens: 2048Throughput: 87.3 tok/s{ "run_id": "2026-05-14-1830-llama", "model": "llama-3.1-8b-instruct-Q4_K_S", "median_tokens_per_second": 87.3, "artifact_sha256": "abc123f09d4a"}from pathlib import Path
def tokens_per_second(tokens: int, seconds: float) -> float: return tokens / seconds
result = tokens_per_second(tokens=4096, seconds=46.9)print(f"{result:.1f} tok/s")def bandwidth_bound(bytes_per_token, tokens_per_second): return bytes_per_token * tokens_per_second
peak = bandwidth_bound(5_200_000, 91.2)peak = bandwidth_bound(5_200_000, 87.3)print(peak)Section 7
Why peak bandwidth matters for Llama inference, and where it stops explaining the result.
Reviews
Benchmarks
Methodology
The benchmark table below is tied to Signed Run · 2026-05-14-llama so the numbers remain
auditable after runtime versions move on.
A benchmark without its artifact is a claim; a benchmark with its artifact is an invitation to inspect the claim.
Silicon Logic methodology notes
| System | Model | Median tok/s | P95 latency |
|---|---|---|---|
| M4 Max | Llama 3.1 8B Q4_K_S | 87.3 | 141 ms |
| M3 Max | Llama 3.1 8B Q4_K_S | 74.8 | 158 ms |
| M2 Ultra | Mistral 7B Q4_K_M | 92.5 | 132 ms |
| RTX 4090 | Llama 3.1 8B Q4_K_S | 154.2 | 96 ms |
| Framework Laptop | Phi-3 Mini Q4 | 31.7 | 244 ms |
Section 8
Silicon Logic