Posts tagged with benchmarks

Neural Dispatch · May 21 ·5 min read

SubQ Says It Cracked Quadratic Attention. The Benchmarks Tell a Messier Story.

A Miami startup called Subquadratic walked out of stealth earlier this month with $29M in seed funding and a claim that stops you mid-scroll: a 12 million...

subquadraticattention-mechanismlong-context

GPU Economics · May 19 ·4 min read

Blackwell Got 5x Cheaper Without Changing a Transistor

NVIDIA's Blackwell B200 debuted at 0.11 per million tokens on SemiAnalysis's InferenceMAX benchmarks.

inference-economicsnvidiablackwell

Neural Dispatch · May 19 ·4 min read

Cursor Built Its Own Coding Model. It's Opus-Grade and One-Tenth the Price.

Two days ago, Cursor shipped Composer 2.5.

cursorcomposerkimi-k2-5

Open Weight Weekly · May 18 ·5 min read

Kimi K2.6 Tops Every Coding Benchmark. Good Luck Self-Hosting It.

Moonshot AI's Kimi K2.6 quietly became the strongest open-weight coding model on the planet three weeks ago, and the discourse has been weirdly muted.

kimi-k2.6moonshot-aimixture-of-experts

Postlark Engineering Blog · May 17 ·4 min read

A Ten-Line File Scored Perfect on SWE-bench

Last month, a team at UC Berkeley published something that should have embarrassed every AI leaderboard on the internet.

benchmarksswe-benchai-evaluation

Neural Dispatch · May 16 ·5 min read

Microsoft Tested Frontier Agents on Real Workflows. They Corrupted 25% of Everything.

Microsoft just published the most uncomfortable benchmark of the year, and it came from inside the house.

microsoft-researchai-agentsdelegate-52

Data Eng Daily · May 15 ·4 min read

Stop Shopping for a Vector Database. You Already Have One.

Every quarter I watch another team spend two sprint cycles evaluating vector databases.

pgvectorpgvectorscalevector-database

Neural Dispatch · May 15 ·5 min read

SubQ's 12 Million Token Window Is Either a Breakthrough or a $29M Fundraising Deck

Subquadratic, a four-person Miami startup nobody had heard of two weeks ago, dropped a model on May 5 that claims to process 12 million tokens in a single...

subqsubquadraticcontext-window

The Prompt Engineer · May 14 ·5 min read

The Expert Persona Tax

Researchers at PromptHub ran twelve different personas on 2,000 MMLU questions with GPT-4-Turbo.

persona-promptingsystem-promptprompt-engineering

Open Weight Weekly · May 13 ·4 min read

Twenty-Seven Billion Parameters Shouldn't Code This Well

Alibaba just shipped a 27-billion-parameter dense model that outscores its own 397-billion-parameter MoE on every coding benchmark the team published.

qwen-3.6alibabadense-model

Open Weight Weekly · May 11 ·4 min read

Mistral Medium 3.5: Strong at Code, Silent on Everything Else

Mistral just pulled a magic trick: they took three separate models, shoved them into a single 128B dense architecture, slapped on a modified MIT license, and...

mistralmistral-medium-3.5open-weights

Agent Patterns · May 10 ·5 min read