Tool

AI capability tracker

Independent tracking of frontier AI capability from November 2022 to present. METR autonomous task duration, benchmark scores, inference costs, and context windows across six model families.

Current leader (METR)

1.2 months

Claude Opus 4.6

Baseline (Mar 2023)

3.5 hrs

GPT-4

Doubling time

sub-90

day doubling time

Models tracked

10 open-source

Filter:

METR autonomous task duration (p50 reliability)

How long each model can work independently before failing. Y-axis is logarithmic. Dashed lines mark task duration milestones.

Source: METR Horizon v1.1 benchmark data. p50 = task duration at which the model succeeds 50% of the time. Current doubling time: sub-90 days.

Methodology and sources

METR data sourced from the METR Horizon v1.1 benchmark dataset. p50 values represent the task duration at which a model achieves 50% reliability on autonomous, real-world tasks. All METR values are independently measured; no self-reported scores are included.

Benchmark scores are drawn from official model cards, the Scale AI leaderboard, Epoch AI evaluations, and the Artificial Analysis leaderboard. Where scores differ between sources, independently verified figures are preferred over self-reported ones. Scores marked as contaminated (e.g. SWE-Bench Verified) are included with appropriate context.

Inference costs reflect published API pricing at launch. Actual costs may vary with volume discounts, batching, and cached token pricing. Open-source model costs reflect typical third-party hosting rates.

This tracker is maintained as a companion resource to the Outpaced investigation series. Last updated March 2026.

Want frameworks like this every week?

Join the Leverage newsletter. Weekly dispatch on using frontier AI to do work that matters.