Writing

Blog

AI agents, runtime safety, local LLMs, and what it looks like to run a one-person AI-operated holding company in public.

Browse the notebook

Start with a topic, or open the full archive.

Full archive (149) ->

Artifact preview for My 8B Model Failed a 400-Word Task

real test output

Jul 14, 20266 min read

My 8B Model Failed a 400-Word Task

Three Llama 3.1 8B runs missed a 400-word floor. Here is the verifier-driven route that moved long-form synthesis to Gemma 4 26B.

Read the post

Aug 2, 20265 min read
My 5090 benchmark was missing the field I needed most
A fresh Qwen3.5 9B run showed 84.94 tok/s, but the useful number was the 6,105 ms load phase. I added phase timings and capture time to the benchmark receipt.
#local-llm #llm-benchmarks #hardware #ollama
Aug 1, 20266 min read
Search Old Results Before Publishing an LLM Test
An independent QA pass caught my second post about the same Ollama batch sweep. Here is the duplicate check I now run before publishing an LLM result.
#local-llm #model-testing #ollama #hardware
Jul 31, 20266 min read
Build Local LLM Eval Data From Real Failures
I show how I turn failed local coding runs into replayable eval rows with the prompt, model output, tests, route, and verifier result intact.
#local-llm #ai-agents #owned-hardware
Jul 30, 20266 min read
How I Keep LLM Results Valid After a Driver Update
A new GPU snapshot does not refresh an old benchmark. I show how I bind driver, runtime, workload, and timestamps to each local LLM result.
#local-llm #llm-benchmarks #hardware #model-testing
Jul 29, 20266 min read
Why I Benchmark Local LLM Input and Output Separately
My RTX 5090 runs show why model loading, prompt ingestion, token generation, and task checks need separate measurements before production use.
#local-llm #hardware #ollama #model-testing
Jul 28, 20266 min read
How I Budget VRAM for Shared Local AI Workloads
My RTX 5090 measurements show how I budget VRAM for a local model, a real workload, and the GPU processes that must stay resident beside it.
#local-llm #hardware #ollama
Jul 27, 20265 min read
Why I Test 3 Workloads Before Sizing a Local LLM
One local LLM speed number hides the work behind it. My RTX 5090 sweep shows why short generation, long context, and code need separate rates.
#local-llm #hardware #ollama #5090-reports
Jul 25, 20266 min read
Local open-model agents just became a product category
LM Studio shipped Bionic, a full agent built on open models with local code projects, voice, and document work. The interesting part is not the app. It is what.
#local-llm #open-weights #ai-agents
Jul 25, 20266 min read
Incident response needs a local model you already trust
Hugging Face ran its breach forensics on an open-weight model on its own hardware because hosted APIs refused the requests. Here is the lesson for builders.
#ai-security #local-llm #open-weights
Jul 24, 20266 min read
Your local LLM benchmark is probably lying to you
A local model pass rate can be true and useless at the same time. Here are the three ways local LLM benchmarks mislead you, drawn from real rows on my RTX 5090.
#local-llm #llm-benchmarks #ai-agents #hardware
Jul 22, 20265 min read
Test Retrieval Before Your Local LLM Writes
I make my local LLM show its source plan before it writes. The preview gate catches weak retrieval while the fix is still cheap and easy to inspect.
#local-llm #hardware #5090-reports
Jul 21, 20265 min read
Why I Did Not Promote My Smaller Local Model
My smaller local model existed and ran, but it did not beat the baseline. Here is the promotion gate I use before changing a working local AI route.
#local-llm #hardware #5090-reports

The AI agent build notes

Real costs, real tools, no fluff. M-F when I ship, publish, or learn something worth sending.

Start with a topic, or open the full archive.

My 8B Model Failed a 400-Word Task

My 5090 benchmark was missing the field I needed most

Search Old Results Before Publishing an LLM Test

Build Local LLM Eval Data From Real Failures

How I Keep LLM Results Valid After a Driver Update

Why I Benchmark Local LLM Input and Output Separately

How I Budget VRAM for Shared Local AI Workloads

Why I Test 3 Workloads Before Sizing a Local LLM

Local open-model agents just became a product category

Incident response needs a local model you already trust

Your local LLM benchmark is probably lying to you

Test Retrieval Before Your Local LLM Writes

Why I Did Not Promote My Smaller Local Model

The AI agent build notes