Question 1

Which local LLM should I run on a 24GB GPU?

Accepted Answer

On RTX 4090 24GB, the current Code Generation + Quality profile ranks Qwen 2.5 Coder 32B (32B dense, Q8_0, Partial offload) first. The largest current full-GPU recommendation in that profile is Codestral 22B (22B dense, Q8_0, Full GPU). The nearest 70B-class recommendation is Llama 3.3 70B (70B dense, Q5_K_M, Partial offload), so treat that as a quality/offload pick instead of the default for fast loops.

Question 2

What is the best quantization for local LLMs?

Accepted Answer

The picker does not use one universal best quantization. On the same RTX 4090 24GB reference setup, Code Generation + Quality currently chooses Q8_0 for Qwen 2.5 Coder 32B, Chat/Assistant + Speed chooses IQ3_XXS for Phi-4-mini, and Chat/Assistant + VRAM Efficiency chooses IQ4_XS for Phi-4-mini. Those claims come from the ranked catalog plus the VRAM fit estimate at 4,096 tokens.

Question 3

Should I pick the fastest model or the highest quality model?

Accepted Answer

Use Speed when tokens per second matters; the current reference winner is Phi-4-mini (3.8B dense, IQ3_XXS, Full GPU) at 60-115.4 tok/s. Use Quality when model strength matters more; the current reference winner is Qwen 2.5 Coder 32B (32B dense, Q8_0, Partial offload) at 9.6-18.5 tok/s. Use VRAM Efficiency when keeping the model on one GPU matters; the current reference winner is Phi-4-mini (3.8B dense, IQ4_XS, Full GPU) at 55.3-106.3 tok/s.

Pick the local model that fits your GPU.

Picker

Qwen 2.5 72B

6 local models for this GPU

Qwen 2.5 72B

Llama 3.3 70B

Phi-4

Qwen 2.5 Coder 32B

Gemma 3 12B

Mistral Small 3.1 24B

Get the local-LLM cost digest

FAQ

Unlimited local LLM decisions with Pro.

Get the local AI lab notes