I don’t really “run experiments” on models, in a systemic personal capacity. Other people are much better at that, and I believe I’d linked a few examples in the post. I do replicate the occasional experiment, and run some myself if there’s something I’d like to check… But broadly, at this point, I don’t expect any compact, self-contained puzzle to be a good measure of “are we getting AGIer yet?”.
My direct engagement with models mostly consists of feeding them research papers to process them faster, asking clarifying questions about math/physics, using Deep Research for varyingly targeted literature surveys, and chatting with them about whatever theoretical/philosophical problems I happen to be working on at a given moment. Those function pretty well as a measure of insight/innovativeness: of whether the AI is assembling a precise model of what’s happening and what we’re doing, and then runs internal queries on that model to move the interaction in the direction of greater understanding, vs. producing very sophisticated remixes of existing templates in a fundamentally sleepwalk-y manner.
I don’t really “run experiments” on models, in a systemic personal capacity. Other people are much better at that, and I believe I’d linked a few examples in the post. I do replicate the occasional experiment, and run some myself if there’s something I’d like to check… But broadly, at this point, I don’t expect any compact, self-contained puzzle to be a good measure of “are we getting AGIer yet?”.
My direct engagement with models mostly consists of feeding them research papers to process them faster, asking clarifying questions about math/physics, using Deep Research for varyingly targeted literature surveys, and chatting with them about whatever theoretical/philosophical problems I happen to be working on at a given moment. Those function pretty well as a measure of insight/innovativeness: of whether the AI is assembling a precise model of what’s happening and what we’re doing, and then runs internal queries on that model to move the interaction in the direction of greater understanding, vs. producing very sophisticated remixes of existing templates in a fundamentally sleepwalk-y manner.
It’s been that second one every time so far.