I think this is extremely interesting and valuable work!
I think it was even *more* valuable before you posted a key part of the solution on a prominent part of the public internet, where AIs can search for it and/or it can end up in training data (see https://arxiv.org/pdf/2602.12413).
I totally would have asked to play Starburst after reading this if it wasn’t spoiled, and afaict the one mechanical spoiler doesn’t add much that “it was the sort of thing that would show up in the training sets, trust me bro” wouldn’t have done; would recommend you excise that part for the benefit of future readers.
Your narrative is uncannily similar to my own experiences. I’ve been bearish about AI until recently, when they started doing scarily well on my inferential games and on the task I contributed to METR (though I take some pride and comfort in the fact neither of these seem to have fully saturated yet); I’ve accordingly flipped from being underwhelmed by AI progress to being overwhelmed by it; I share your skepticism regarding most ‘mainstream’ evals; and the project(s) I’m currently tinkering with should operate along broadly similar lines to Starburst.
If you’re interested in working with us or discussing further, please feel free to email or dm me.
-For a bunch of reasons that I’ll explain over dm if you want (among other things, having similar games and certainly haven’t been exposed to the training process and not seeing models perform anonymously worse at those, but also particulars about what parts of the law are easy for models to figure out), I’m not that worried about what I had causing contamination issues. But out of an abundance of caution (and a desire to make the post spoil Starburst less for people), I edited it to remove details about the law.
-The 4- and 2-hour physics Starburst games are not spoiled by this post (other than slightly by hearing about how well the models did). Nor is the nearly-finished Chemistry Starburst or the fiendish technology Starburst. You’re welcome to try any of those if you want. As for future readers, talking about human and AI performance is a sort of soft spoiler, but much better than it was.
Various thoughts:
I think this is extremely interesting and valuable work!
I think it was even *more* valuable before you posted a key part of the solution on a prominent part of the public internet, where AIs can search for it and/or it can end up in training data (see https://arxiv.org/pdf/2602.12413).
I totally would have asked to play Starburst after reading this if it wasn’t spoiled, and afaict the one mechanical spoiler doesn’t add much that “it was the sort of thing that would show up in the training sets, trust me bro” wouldn’t have done; would recommend you excise that part for the benefit of future readers.
Your narrative is uncannily similar to my own experiences. I’ve been bearish about AI until recently, when they started doing scarily well on my inferential games and on the task I contributed to METR (though I take some pride and comfort in the fact neither of these seem to have fully saturated yet); I’ve accordingly flipped from being underwhelmed by AI progress to being overwhelmed by it; I share your skepticism regarding most ‘mainstream’ evals; and the project(s) I’m currently tinkering with should operate along broadly similar lines to Starburst.
Will do!
-Thanks.
-For a bunch of reasons that I’ll explain over dm if you want (among other things, having similar games and certainly haven’t been exposed to the training process and not seeing models perform anonymously worse at those, but also particulars about what parts of the law are easy for models to figure out), I’m not that worried about what I had causing contamination issues. But out of an abundance of caution (and a desire to make the post spoil Starburst less for people), I edited it to remove details about the law.
-The 4- and 2-hour physics Starburst games are not spoiled by this post (other than slightly by hearing about how well the models did). Nor is the nearly-finished Chemistry Starburst or the fiendish technology Starburst. You’re welcome to try any of those if you want. As for future readers, talking about human and AI performance is a sort of soft spoiler, but much better than it was.
-Those look interesting. Will take a look.