Ozone Labs
Ozone Labs

Experiments into the future of publishing and marketing


Research

How would you rate the LLMs?

Human ratings of LLM answers to topical questions. Questions are generated from the past three days of topical news, and users compare responses side-by-side in blind pairwise comparison. Higher ELO = better performance.

RankModelELOWin RateVotesWLT
1Claude 4 Opus92.491.8–93.012,847
2GPT-4o89.188.4–89.812,847
3Gemini 2.5 Pro87.686.9–88.312,847
4Claude 4 Sonnet85.284.5–85.912,847
5GPT-4o Mini78.377.5–79.112,847

Elo is a rating system for calculating the relative skill levels of players (or teams) in zero-sum competitions.

Labs Live

Where impossible ideas get built

We bring together brands, agencies, publishers and engineers to solve hard problems — from first idea to shipped product.

Pitch ideas

Anyone can throw an idea on the table — a hunch, a frustration, a “what if.”

Form teams

Self-organise around the ideas that excite you. Engineers, commercial, data — all welcome.

Build fast

Focused sprint, usually 1–2 days. Working prototypes, not presentations.

Ship

Proven experiments graduate into products. What doesn't ship still teaches us something.