[bmdpat]

Local LLM Toolkit

Pick the local model that fits your GPU.

Tell it your GPU, workload, and tradeoff. Get a ranked list of models, quants, speed estimates, and prefilled VRAM checks before you download a 100GB weight file.

Picker

Use case
What matters most?

5/5 free runs left today

Current pick

Qwen 2.5 72B

strong reasoning and data work when 70B fits your setup.

Quant

Q5_K_M

Speed

6.8-13 tok/s

Ranked recommendations

6 local models for this GPU

4K context
#4

Qwen 2.5 Coder 32B

32B dense

strongest code model that still fits many prosumer GPUs.

Partial offload

Quant

Q8_0

Speed

9.6-18.5

GPU layers

46/64

Score

83

#6

Mistral Small 3.1 24B

24B dense

good all-around quality without 70B memory pressure.

Partial offload

Quant

Q8_0

Speed

13.1-25.2

GPU layers

39/40

Score

82

Default context

4K tokens

Best use

model selection

Logged event

tool_use

FAQ

Which local LLM should I run on a 24GB GPU?
For most 24GB cards, start with a 7B to 32B model at Q4_K_M or Q5_K_M. A 70B model can run with partial CPU offload, but smaller models usually feel faster and more dependable.
What is the best quantization for local LLMs?
Q4_K_M is the default balance for most local runs. Q5_K_M or Q6_K improve quality when you have spare VRAM. IQ4_XS and IQ3_XXS are useful when fitting the model matters more than quality.
Should I pick the fastest model or the highest quality model?
Pick speed for chat loops, coding autocomplete, and experiments. Pick quality for reasoning, writing, and data work. Pick VRAM efficiency when you need the whole model on one GPU.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

Get The One-Person Holdco (free PDF)

How one human plus twenty-two AI agents runs a seven-pillar portfolio with no employees.