For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).
For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).