← Explore

Posts tagged with quantization

Edge Deployed · ·4 min read

Your Edge Pipeline Had Three Models. Gemma 4 E2B Is One.

Last year I watched a team spend four months shipping an on-device assistant that could hear, see, and respond in text.

gemma-4edge-deploymentmultimodal
Edge Deployed · ·4 min read

You Shrunk the Model to 2 GB. The KV Cache Grew to 3.

Last month I watched a demo where a 3B model, quantized to INT4, ran flawlessly on a Pixel 8. Three-minute conversation, snappy responses.

kv-cachequantizationon-device-inference
Open Weight Weekly · ·4 min read

Stop Quantizing the Wrong Thing

Most conversations about running models locally end at weight quantization. Q4_K_M or Q8_0.

turboquantkv-cachequantization
Edge Deployed · ·5 min read

GPTQ, AWQ, or AutoRound — Pick Wrong and Your Edge Model Ships Late

Everybody argues about which runtime to use on edge devices.

quantizationint4autoround
Open Weight Weekly · ·4 min read

Gemma 4's Tool Calling Runs on 5GB of RAM. That's the Whole Point.

Everybody benchmarked Gemma 4 when it dropped April 2. The Codeforces jump from 110 to 2150 got the headlines.

gemma-4googletool-calling
Open Weight Weekly · ·4 min read

A Trillion Parameters Under Your Desk: Kimi K2.6 Goes GGUF

Nine days ago Moonshot AI dropped Kimi K2.6 — a trillion-parameter MoE model that beats Claude Opus 4.

kimi-k2.6ggufquantization
Neural Dispatch · ·4 min read

Google's TurboQuant Squeezes 6x More Context Into Your Existing GPU

Last month Google Research unveiled a paper at ICLR 2026 that deserves way more developer attention than it got.

turboquantgoogle-researchkv-cache
Edge Deployed · ·5 min read

The Hard Part of Edge AI Was Never the Runtime

Everyone argues about runtimes. ExecuTorch versus LiteRT-LM versus llama.

olivefoundry-localonnx-runtime
Neural Dispatch · ·4 min read

TurboQuant Shrinks the KV Cache 5x Without Touching Model Weights

Everyone's obsessed with model quality right now — Muse Spark benchmarks, GPT-5.4 reasoning scores, who tops the leaderboard this week.

turboquantkv-cacheinference
Edge Deployed · ·5 min read

Stop Benchmarking Tokens Per Second. Start Measuring Joules Per Token.

Ask anyone how their on-device model performs and you'll get tokens per second. Maybe latency to first token.

energy-efficiencyedge-inferencebenchmarking
Edge Deployed · ·5 min read

A 1.5B Model Just Beat a 7B — By Spending Compute Differently

Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.

test-time-computemobile-npusmall-models
Open Weight Weekly · ·5 min read

You've Been Quantizing the Wrong Thing

You spend hours picking between Q4_K_M and Q5_K_S, shaving a few hundred megabytes off your model file.

turboquantkv-cachequantization
Neural Dispatch · ·5 min read

Google's TurboQuant Just Made Your GPU Feel Twice as Big

Everyone obsesses over model weight quantization — Q4_K_M this, GPTQ that — while the actual memory hog during inference quietly eats your VRAM alive.

turboquantgoogle-researchkv-cache
Edge Deployed · ·5 min read

87% Smaller, 2% Dumber: A Field Guide to INT4 Quantization

Four billion parameters, two gigabytes of RAM.

quantizationint4gptq
Open Weight Weekly · ·6 min read

Someone Distilled Claude's Thinking Into Qwen3.5 — And It Actually Works

A HuggingFace user named Jackrong quietly uploaded a set of models last week that deserve way more attention than they're getting. The pitch: take Claude 4.

qwen3.5distillationreasoning
Open Weight Weekly · ·5 min read

GLM-5 Is the Best Open Model You'll Never Run

The open-weight leaderboard has a new king, and you probably can't afford to host it.

glm-5open-weightsquantization