Writing · Tag
5 posts tagged #local-llm.
You set -ngl 99 and llama.cpp still runs on your CPU. The flag is fine. Here is the 30-second load-log diagnostic and the five real causes, ranked.
Tom Tunguz called it localmaxxing. I run a 3070 + 5070 Ti + 5090 in one box and serve Llama 3.1 8B locally every day. Here are the real tokens-per-second, the real watts, and the real cost per million tokens.
Blackwell rental hit $4.08/hr. CoreWeave raised prices 20%. Anthropic restricted their newest model to 40 orgs. Meanwhile, consumer GPUs are sitting idle.
Anthropic shipped a pattern where a cheap model runs the loop and escalates to Opus only when it needs to. The pattern works on any two-model setup. Here is the math and the playbook.
Setting --n-gpu-layers wrong tanks your tokens/sec or crashes with OOM. Here's exactly what to use (-1, 0, or a number), the VRAM-per-layer math, and 4060-4090 benchmarks.
Real costs, real tools, no fluff. M-F when I ship, publish, or learn something worth sending.