Posts tagged with local-inference

Neural Dispatch · May 6 ·5 min read

Gemma 4's MTP Drafters Turn Idle GPU Cycles Into a 3x Speedup

Your GPU spends most of its inference time shuttling weights between VRAM and compute cores, not actually doing math.

gemma-4multi-token-predictionspeculative-decoding

Open Weight Weekly · May 6 ·4 min read

Not Every Layer Deserves 4 Bits

Uniform quantization is the fast food of model compression — convenient, predictable, and quietly destroying nuance.

unslothdynamic-quantizationgguf

Open Weight Weekly · May 4 ·4 min read

Gemma 4's Tool Calling Runs on 5GB of RAM. That's the Whole Point.

Everybody benchmarked Gemma 4 when it dropped April 2. The Codeforces jump from 110 to 2150 got the headlines.

gemma-4googletool-calling

Open Weight Weekly · Apr 29 ·4 min read

A Trillion Parameters Under Your Desk: Kimi K2.6 Goes GGUF

Nine days ago Moonshot AI dropped Kimi K2.6 — a trillion-parameter MoE model that beats Claude Opus 4.

kimi-k2.6ggufquantization

Open Weight Weekly · Apr 24 ·4 min read

The 27B Model That Embarrassed 397 Billion Parameters

Alibaba shipped Qwen3.6-27B on April 22nd, and the benchmarks don't make sense.

qwen3.6dense-architecturedeltanet

Open Weight Weekly · Apr 20 ·4 min read

Blackwell Has 96GB of VRAM. vLLM and Ollama Disagree About What to Do With It.

NVIDIA's RTX PRO 6000 dropped a 96GB Blackwell card into the workstation market, and suddenly every open-weight model under 70B fits unquantized on a...

vllmollamablackwell

Open Weight Weekly · Apr 18 ·3 min read

Devstral Small 2 Scores 68% on SWE-bench Verified — From a MacBook

Most agentic coding models worth running require hardware that costs more than a used car.

devstral-small-2mistralagentic-coding

Open Weight Weekly · Apr 17 ·4 min read

Qwen Burned 97% of Its Parameters and the Code Still Compiles

Qwen's latest coding model has 80 billion parameters and uses 3 billion of them.

qwen3-coder-nextmixture-of-expertscoding-agents

Neural Dispatch · Apr 14 ·5 min read

Gemma 4's Apache 2.0 License Matters More Than Its Benchmarks

The most important thing Google shipped with Gemma 4 isn't a model. It's a license.

gemma-4google-deepmindapache-2-0

Open Weight Weekly · Apr 13 ·4 min read

DFlash Landed on MLX. Your Mac Just Got 3x Faster.

Standard speculative decoding has been around for a while.

dflashmlxspeculative-decoding

Neural Dispatch · Apr 6 ·5 min read

Google's Gemma 4: The License Change Matters More Than the Benchmarks

Google dropped Gemma 4 last Wednesday, and predictably, most of the coverage has been about the benchmark horse race — Arena rankings, MMLU Pro scores, AIME...

gemma-4googleopen-weights

Neural Dispatch · Apr 4 ·5 min read

Google's TurboQuant Just Made Your GPU Feel Twice as Big

Everyone obsesses over model weight quantization — Q4_K_M this, GPTQ that — while the actual memory hog during inference quietly eats your VRAM alive.

turboquantgoogle-researchkv-cache

Neural Dispatch · Apr 3 ·4 min read

Gemma 4 Ships Under Apache 2.0 With an Architecture Nobody Expected

Google dropped Gemma 4 on Wednesday — four open-weight models under a genuine Apache 2.0 license, built from the same research behind Gemini 3.

gemma-4googleopen-source

Open Weight Weekly · Apr 3 ·4 min read

Gemma 4's Secret Weapon Isn't the 31B — It's the 26B That Acts Like a 4B

Google shipped Gemma 4 yesterday under Apache 2.

gemma-4mixture-of-expertsapache-2

Open Weight Weekly · Mar 30 ·6 min read

Someone Distilled Claude's Thinking Into Qwen3.5 — And It Actually Works

A HuggingFace user named Jackrong quietly uploaded a set of models last week that deserve way more attention than they're getting. The pitch: take Claude 4.

qwen3.5distillationreasoning