[bmdpat]

Local LLM Toolkit

Pick the local model that fits your GPU.

Tell it your GPU, workload, and tradeoff. It ranks 18 local models across 6 workloads and 3 priorities, then prefills VRAM checks before you download a 350GB+ weight file.

Picker

Use case
What matters most?

5/5 free runs left today

Current pick

Qwen 2.5 72B

strong reasoning and data work when 70B fits your setup.

Quant

Q5_K_M

Speed

6.8-13 tok/s

Ranked recommendations

6 local models for this GPU

4K context
#4

Qwen 2.5 Coder 32B

32B dense

strongest code model that still fits many prosumer GPUs.

Partial offload

Quant

Q8_0

Speed

9.6-18.5

GPU layers

46/64

Score

83

#6

Mistral Small 3.1 24B

24B dense

good all-around quality without 70B memory pressure.

Partial offload

Quant

Q8_0

Speed

13.1-25.2

GPU layers

39/40

Score

82

Get the local-LLM cost digest

New GGUF quant benchmarks, VRAM math, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.

Single opt-in for the local-LLM newsletter. Unsubscribe anytime. Privacy.

Default context

4K tokens

Scope

18 models / 6 workloads / 3 priorities

Recent usage

0 tracked runs / 30d

FAQ

Which local LLM should I run on a 24GB GPU?
On RTX 4090 24GB, the current Code Generation + Quality profile ranks Qwen 2.5 Coder 32B (32B dense, Q8_0, Partial offload) first. The largest current full-GPU recommendation in that profile is Codestral 22B (22B dense, Q8_0, Full GPU). The nearest 70B-class recommendation is Llama 3.3 70B (70B dense, Q5_K_M, Partial offload), so treat that as a quality/offload pick instead of the default for fast loops.
What is the best quantization for local LLMs?
The picker does not use one universal best quantization. On the same RTX 4090 24GB reference setup, Code Generation + Quality currently chooses Q8_0 for Qwen 2.5 Coder 32B, Chat/Assistant + Speed chooses IQ3_XXS for Phi-4-mini, and Chat/Assistant + VRAM Efficiency chooses IQ4_XS for Phi-4-mini. Those claims come from the ranked catalog plus the VRAM fit estimate at 4,096 tokens.
Should I pick the fastest model or the highest quality model?
Use Speed when tokens per second matters; the current reference winner is Phi-4-mini (3.8B dense, IQ3_XXS, Full GPU) at 60-115.4 tok/s. Use Quality when model strength matters more; the current reference winner is Qwen 2.5 Coder 32B (32B dense, Q8_0, Partial offload) at 9.6-18.5 tok/s. Use VRAM Efficiency when keeping the model on one GPU matters; the current reference winner is Phi-4-mini (3.8B dense, IQ4_XS, Full GPU) at 55.3-106.3 tok/s.

§ 002 / PRICING

Unlimited local LLM decisions with Pro.

The toolkit is free for up to 5 free runs per tool per day. Upgrade to Pro to remove the limit and keep your rig history in one place.

Free

$0

  • 5 free runs per tool per day
  • Standard GPU presets

Pro

$7/mo

  • Unlimited calculator runs
  • Save my rig and get new-fit alerts
  • Import custom models from Hugging Face URLs
  • Benchmark history across model and quant choices
  • Early access to new toolkit surfaces
  • No ads
Go Pro

Or $49/year

Get the local AI lab notes

Benchmark rows, VRAM fit checks, quant choices, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.