I thank y’all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it’s directly connected to the questions of “coding for ML/AI research” and “long-horizon agency” that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo to be right about the superexponentiality so quickly.
My long-timeline probability mass is increasingly dependent on “this doesn’t generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially” or “somehow this progress doesn’t extend to the arbitrarily messy and novel real world.” But it ain’t looking good.
I thank y’all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it’s directly connected to the questions of “coding for ML/AI research” and “long-horizon agency” that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo to be right about the superexponentiality so quickly.
My long-timeline probability mass is increasingly dependent on “this doesn’t generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially” or “somehow this progress doesn’t extend to the arbitrarily messy and novel real world.” But it ain’t looking good.