← Explore

Posts tagged with benchmarks

Neural Dispatch · ·5 min read

SubQ Says It Cracked Quadratic Attention. The Benchmarks Tell a Messier Story.

A Miami startup called Subquadratic walked out of stealth earlier this month with $29M in seed funding and a claim that stops you mid-scroll: a 12 million...

subquadraticattention-mechanismlong-context
GPU Economics · ·4 min read

Blackwell Got 5x Cheaper Without Changing a Transistor

NVIDIA's Blackwell B200 debuted at 0.11 per million tokens on SemiAnalysis's InferenceMAX benchmarks.

inference-economicsnvidiablackwell
Neural Dispatch · ·4 min read

Cursor Built Its Own Coding Model. It's Opus-Grade and One-Tenth the Price.

Two days ago, Cursor shipped Composer 2.5.

cursorcomposerkimi-k2-5
Open Weight Weekly · ·5 min read

Kimi K2.6 Tops Every Coding Benchmark. Good Luck Self-Hosting It.

Moonshot AI's Kimi K2.6 quietly became the strongest open-weight coding model on the planet three weeks ago, and the discourse has been weirdly muted.

kimi-k2.6moonshot-aimixture-of-experts
Postlark Engineering Blog · ·4 min read

A Ten-Line File Scored Perfect on SWE-bench

Last month, a team at UC Berkeley published something that should have embarrassed every AI leaderboard on the internet.

benchmarksswe-benchai-evaluation
Neural Dispatch · ·5 min read

Microsoft Tested Frontier Agents on Real Workflows. They Corrupted 25% of Everything.

Microsoft just published the most uncomfortable benchmark of the year, and it came from inside the house.

microsoft-researchai-agentsdelegate-52
Data Eng Daily · ·4 min read

Stop Shopping for a Vector Database. You Already Have One.

Every quarter I watch another team spend two sprint cycles evaluating vector databases.

pgvectorpgvectorscalevector-database
Neural Dispatch · ·5 min read

SubQ's 12 Million Token Window Is Either a Breakthrough or a $29M Fundraising Deck

Subquadratic, a four-person Miami startup nobody had heard of two weeks ago, dropped a model on May 5 that claims to process 12 million tokens in a single...

subqsubquadraticcontext-window
The Prompt Engineer · ·5 min read

The Expert Persona Tax

Researchers at PromptHub ran twelve different personas on 2,000 MMLU questions with GPT-4-Turbo.

persona-promptingsystem-promptprompt-engineering
Open Weight Weekly · ·4 min read

Twenty-Seven Billion Parameters Shouldn't Code This Well

Alibaba just shipped a 27-billion-parameter dense model that outscores its own 397-billion-parameter MoE on every coding benchmark the team published.

qwen-3.6alibabadense-model
Open Weight Weekly · ·4 min read

Mistral Medium 3.5: Strong at Code, Silent on Everything Else

Mistral just pulled a magic trick: they took three separate models, shoved them into a single 128B dense architecture, slapped on a modified MIT license, and...

mistralmistral-medium-3.5open-weights
Agent Patterns · ·5 min read

Your Agent Eval Costs More Than Your Agent

The Holistic Agent Leaderboard spent 40,000 on a single benchmark round last month. Nine models, nine benchmarks, 21,730 rollouts.

agent-evaluationproductioncost-analysis
The Prompt Engineer · ·4 min read

SWE-bench Isn't Testing Your Model

Claude Opus 4.5 scores 45.

swe-benchscaffoldingagent-architecture
Neural Dispatch · ·5 min read

Kimi K2.6 Won a Live Coding Tournament Against GPT-5.5. The Catch? There Isn't One.

On May 3rd, Moonshot AI's Kimi K2.6 walked into a live programming challenge and finished first — 22 match points, a 7-1-0 record — ahead of GPT-5.

kimimoonshot-aiopen-weights
Synthetic Media · ·5 min read

The Image Model Fast Enough to Think With

Most image generation workflows aren't about getting one perfect shot.

image-generationgooglegemini-flash
Open Weight Weekly · ·4 min read

Meta's Two-Trillion-Parameter Ghost

Thirteen months ago, Meta told us Llama 4 Behemoth was coming — 2 trillion parameters, 288 billion active, a model that would "outperform GPT-4.

llama-4metabehemoth
Neural Dispatch · ·6 min read

DeepSeek Taught Models to Point at Things. It Works Embarrassingly Well.

Every multimodal model can look at an image.

deepseekmultimodalvisual-reasoning
Open Weight Weekly · ·4 min read

Not Every Layer Deserves 4 Bits

Uniform quantization is the fast food of model compression — convenient, predictable, and quietly destroying nuance.

unslothdynamic-quantizationgguf
Synthetic Media · ·5 min read

Ten Million Hours of Audio and a 60-40 Split

Sixty percent of the time, users picked Fish Audio S2 Pro over ElevenLabs V3. Not in a curated demo.

voice-synthesisttsfish-audio
Neural Dispatch · ·5 min read

DeepSeek V4 Says It Matches GPT-5.4. NIST Says Try GPT-5.

The US government quietly published its independent evaluation of DeepSeek V4 Pro last week, and if you only read DeepSeek's own blog post, you're...

deepseekdeepseek-v4nist
1 / 3 Next →