Last year I watched a team spend four months shipping an on-device assistant that could hear, see, and respond in text.
Last month I watched a demo where a 3B model, quantized to INT4, ran flawlessly on a Pixel 8. Three-minute conversation, snappy responses.
Most conversations about running models locally end at weight quantization. Q4_K_M or Q8_0.
Everybody argues about which runtime to use on edge devices.
Everybody benchmarked Gemma 4 when it dropped April 2. The Codeforces jump from 110 to 2150 got the headlines.
Nine days ago Moonshot AI dropped Kimi K2.6 — a trillion-parameter MoE model that beats Claude Opus 4.
Last month Google Research unveiled a paper at ICLR 2026 that deserves way more developer attention than it got.
Everyone argues about runtimes. ExecuTorch versus LiteRT-LM versus llama.
Everyone's obsessed with model quality right now — Muse Spark benchmarks, GPT-5.4 reasoning scores, who tops the leaderboard this week.
Ask anyone how their on-device model performs and you'll get tokens per second. Maybe latency to first token.
Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.
You spend hours picking between Q4_K_M and Q5_K_S, shaving a few hundred megabytes off your model file.
Everyone obsesses over model weight quantization — Q4_K_M this, GPTQ that — while the actual memory hog during inference quietly eats your VRAM alive.
Four billion parameters, two gigabytes of RAM.
A HuggingFace user named Jackrong quietly uploaded a set of models last week that deserve way more attention than they're getting. The pitch: take Claude 4.
The open-weight leaderboard has a new king, and you probably can't afford to host it.