Posts tagged with quantization

Edge Deployed · May 20 ·4 min read

Your Edge Pipeline Had Three Models. Gemma 4 E2B Is One.

Last year I watched a team spend four months shipping an on-device assistant that could hear, see, and respond in text.

gemma-4edge-deploymentmultimodal

Edge Deployed · May 11 ·4 min read

You Shrunk the Model to 2 GB. The KV Cache Grew to 3.

Last month I watched a demo where a 3B model, quantized to INT4, ran flawlessly on a Pixel 8. Three-minute conversation, snappy responses.

kv-cachequantizationon-device-inference

Open Weight Weekly · May 9 ·4 min read

Stop Quantizing the Wrong Thing

Most conversations about running models locally end at weight quantization. Q4_K_M or Q8_0.

turboquantkv-cachequantization

Edge Deployed · May 4 ·5 min read

GPTQ, AWQ, or AutoRound — Pick Wrong and Your Edge Model Ships Late

Everybody argues about which runtime to use on edge devices.

quantizationint4autoround

Open Weight Weekly · May 4 ·4 min read

Gemma 4's Tool Calling Runs on 5GB of RAM. That's the Whole Point.

Everybody benchmarked Gemma 4 when it dropped April 2. The Codeforces jump from 110 to 2150 got the headlines.

gemma-4googletool-calling

Open Weight Weekly · Apr 29 ·4 min read

A Trillion Parameters Under Your Desk: Kimi K2.6 Goes GGUF

Nine days ago Moonshot AI dropped Kimi K2.6 — a trillion-parameter MoE model that beats Claude Opus 4.

kimi-k2.6ggufquantization

Neural Dispatch · Apr 23 ·4 min read

Google's TurboQuant Squeezes 6x More Context Into Your Existing GPU

Last month Google Research unveiled a paper at ICLR 2026 that deserves way more developer attention than it got.

turboquantgoogle-researchkv-cache

Edge Deployed · Apr 22 ·5 min read

The Hard Part of Edge AI Was Never the Runtime

Everyone argues about runtimes. ExecuTorch versus LiteRT-LM versus llama.

olivefoundry-localonnx-runtime

Neural Dispatch · Apr 11 ·4 min read

TurboQuant Shrinks the KV Cache 5x Without Touching Model Weights

Everyone's obsessed with model quality right now — Muse Spark benchmarks, GPT-5.4 reasoning scores, who tops the leaderboard this week.

turboquantkv-cacheinference

Edge Deployed · Apr 10 ·5 min read

Stop Benchmarking Tokens Per Second. Start Measuring Joules Per Token.

Ask anyone how their on-device model performs and you'll get tokens per second. Maybe latency to first token.

energy-efficiencyedge-inferencebenchmarking

Edge Deployed · Apr 6 ·5 min read

A 1.5B Model Just Beat a 7B — By Spending Compute Differently

Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.

test-time-computemobile-npusmall-models

Open Weight Weekly · Apr 6 ·5 min read

You've Been Quantizing the Wrong Thing

You spend hours picking between Q4_K_M and Q5_K_S, shaving a few hundred megabytes off your model file.

turboquantkv-cachequantization

Neural Dispatch · Apr 4 ·5 min read

Google's TurboQuant Just Made Your GPU Feel Twice as Big

Everyone obsesses over model weight quantization — Q4_K_M this, GPTQ that — while the actual memory hog during inference quietly eats your VRAM alive.

turboquantgoogle-researchkv-cache

Edge Deployed · Mar 30 ·5 min read

87% Smaller, 2% Dumber: A Field Guide to INT4 Quantization

Four billion parameters, two gigabytes of RAM.

quantizationint4gptq

Open Weight Weekly · Mar 30 ·6 min read

Someone Distilled Claude's Thinking Into Qwen3.5 — And It Actually Works

A HuggingFace user named Jackrong quietly uploaded a set of models last week that deserve way more attention than they're getting. The pitch: take Claude 4.

qwen3.5distillationreasoning

Open Weight Weekly · Mar 28 ·5 min read

GLM-5 Is the Best Open Model You'll Never Run

The open-weight leaderboard has a new king, and you probably can't afford to host it.

glm-5open-weightsquantization