I’m just saying that first of all, o3′s score has a different meaning than the score by other models, because other models didn’t do RL on ARC-like questions. Even if you argue that it should be allowed, other AI didn’t do it, so it’s not right to compare its score with other AI, without giving any caveats.
Second of all, o3 didn’t decide to do RL on these questions on its own. It required humans to run RL on it before it can do these questions. This means that if AGI required countless unknown skills similarly hard to ARC questions, then o3 wouldn’t be AGI. But an AI which could spontaneously reason how to do ARC questions, without any human directed RL for it, would be AGI. Also, humans can learn from doing lots of test questions without being told what the correct answer was.
Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that’s different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.
Maybe we can draw a line between the score an AI gets without using human written problem/solution pairs in any way, and the score an AI gets after using them in some way (RL on example questions, training on example solutions, etc.).
In the former case, we’re interested in how well the AI can do a task as difficult as the test, all on its own. In the latter case, we’re interested in how well the AI can do a task as difficult as the test, if working with humans training it for the task.
I really want to make it clear I’m not trying to badmouth o3, I think it is a very impressive model. I should’ve written my post better.
I’m not saying that o3′s results are meaningless.
I’m just saying that first of all, o3′s score has a different meaning than the score by other models, because other models didn’t do RL on ARC-like questions. Even if you argue that it should be allowed, other AI didn’t do it, so it’s not right to compare its score with other AI, without giving any caveats.
Second of all, o3 didn’t decide to do RL on these questions on its own. It required humans to run RL on it before it can do these questions. This means that if AGI required countless unknown skills similarly hard to ARC questions, then o3 wouldn’t be AGI. But an AI which could spontaneously reason how to do ARC questions, without any human directed RL for it, would be AGI. Also, humans can learn from doing lots of test questions without being told what the correct answer was.
The public training set is weaker, but I argued it’s not a massive difference.
Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that’s different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.
Maybe we can draw a line between the score an AI gets without using human written problem/solution pairs in any way, and the score an AI gets after using them in some way (RL on example questions, training on example solutions, etc.).
In the former case, we’re interested in how well the AI can do a task as difficult as the test, all on its own. In the latter case, we’re interested in how well the AI can do a task as difficult as the test, if working with humans training it for the task.
I really want to make it clear I’m not trying to badmouth o3, I think it is a very impressive model. I should’ve written my post better.