Posts tagged with benchmarks

Neural Dispatch · May 15 ·5 min read

SubQ's 12 Million Token Window Is Either a Breakthrough or a $29M Fundraising Deck

Subquadratic, a four-person Miami startup nobody had heard of two weeks ago, dropped a model on May 5 that claims to process 12 million tokens in a single...

subqsubquadraticcontext-window

The Prompt Engineer · May 14 ·5 min read

The Expert Persona Tax

Researchers at PromptHub ran twelve different personas on 2,000 MMLU questions with GPT-4-Turbo.

persona-promptingsystem-promptprompt-engineering

Open Weight Weekly · May 13 ·4 min read

Twenty-Seven Billion Parameters Shouldn't Code This Well

Alibaba just shipped a 27-billion-parameter dense model that outscores its own 397-billion-parameter MoE on every coding benchmark the team published.

qwen-3.6alibabadense-model

Open Weight Weekly · May 11 ·4 min read

Mistral Medium 3.5: Strong at Code, Silent on Everything Else

Mistral just pulled a magic trick: they took three separate models, shoved them into a single 128B dense architecture, slapped on a modified MIT license, and...

mistralmistral-medium-3.5open-weights

Agent Patterns · May 10 ·5 min read

Your Agent Eval Costs More Than Your Agent

The Holistic Agent Leaderboard spent 40,000 on a single benchmark round last month. Nine models, nine benchmarks, 21,730 rollouts.

agent-evaluationproductioncost-analysis

The Prompt Engineer · May 9 ·4 min read

SWE-bench Isn't Testing Your Model

Claude Opus 4.5 scores 45.

swe-benchscaffoldingagent-architecture

Neural Dispatch · May 9 ·5 min read

Kimi K2.6 Won a Live Coding Tournament Against GPT-5.5. The Catch? There Isn't One.

On May 3rd, Moonshot AI's Kimi K2.6 walked into a live programming challenge and finished first — 22 match points, a 7-1-0 record — ahead of GPT-5.

kimimoonshot-aiopen-weights

Synthetic Media · May 8 ·5 min read

The Image Model Fast Enough to Think With

Most image generation workflows aren't about getting one perfect shot.

image-generationgooglegemini-flash

Open Weight Weekly · May 8 ·4 min read

Meta's Two-Trillion-Parameter Ghost

Thirteen months ago, Meta told us Llama 4 Behemoth was coming — 2 trillion parameters, 288 billion active, a model that would "outperform GPT-4.

llama-4metabehemoth

Neural Dispatch · May 7 ·6 min read

DeepSeek Taught Models to Point at Things. It Works Embarrassingly Well.

Every multimodal model can look at an image.

deepseekmultimodalvisual-reasoning

Open Weight Weekly · May 6 ·4 min read

Not Every Layer Deserves 4 Bits

Uniform quantization is the fast food of model compression — convenient, predictable, and quietly destroying nuance.

unslothdynamic-quantizationgguf

Synthetic Media · May 5 ·5 min read

Ten Million Hours of Audio and a 60-40 Split

Sixty percent of the time, users picked Fish Audio S2 Pro over ElevenLabs V3. Not in a curated demo.

voice-synthesisttsfish-audio

Neural Dispatch · May 5 ·5 min read

DeepSeek V4 Says It Matches GPT-5.4. NIST Says Try GPT-5.

The US government quietly published its independent evaluation of DeepSeek V4 Pro last week, and if you only read DeepSeek's own blog post, you're...

deepseekdeepseek-v4nist

Neural Dispatch · May 4 ·5 min read

Meta Spent $14 Billion on Alexandr Wang. Muse Spark Is What They Got.

Meta poured $14.

metamuse-sparkalexandr-wang

Data Eng Daily · May 3 ·5 min read

SQLMesh Benchmarks Look Too Good. I Went Digging.

Tobiko Data published a Databricks benchmark claiming SQLMesh runs production promotions 134x faster and 123x cheaper than dbt Core.

sqlmeshdbtanalytics-engineering

Neural Dispatch · May 2 ·4 min read

Z.ai's GLM-5.1 Topped SWE-Bench Pro Without a Single NVIDIA Chip

An open-source model just claimed the top spot on SWE-Bench Pro — the benchmark that's become the de facto measuring stick for agentic software engineering.

glm-5.1z-aiopen-source

Synthetic Media · May 1 ·5 min read

A Billion Videos and Not One of Them Is 1080p

xAI reported 1.245 billion Grok Imagine videos generated in a single 30-day window.

video-generationgrok-imaginexai

Neural Dispatch · May 1 ·5 min read

AI Agents Jumped From 12% to 66% in One Year. Most Still Can't Ship.

Stanford dropped its AI Index 2026 report two weeks ago, and the agent numbers are staggering at first glance. OSWorld task success went from 12% to 66.

stanford-ai-indexai-agentsbenchmarks

Neural Dispatch · Apr 26 ·4 min read

ICLR's Best Paper Puts a Number on the Multi-Turn Performance Cliff

Every benchmark your LLM aced? Single-turn.

iclr-2026multi-turnllm-evaluation

Open Weight Weekly · Apr 25 ·4 min read

DeepSeek V4 Pro Costs 15x More Than V3.2. Nobody's Complaining.

DeepSeek dropped V4 Pro and V4 Flash on Wednesday, and the numbers shut up most of the skeptics before they could finish typing. V4 Pro — 1.

deepseek-v4mixture-of-expertsbenchmarks

← Prev 2 / 4 Next →