When you say “board vision” what you are really saying is the model needs some kind of mental representation of the world. For example, on a whiteboard, you could have a crude picture of “the world” with stick figures for the people in it, then “the apocalypse as some crude drawing of something bad about the same scale as the world”, then “the world” has no people in it.
Notably this works extremely well for humans. I have found it basically impossible to express the most modestly complex idea without a tool like this. Humans just fail on verbal descriptions above a certain level of complexity. Even when communicating with humans at statistically unlikely intelligence levels. Humans only have “board vision” for the narrow domains they are experts in—in those domains they don’t need a whiteboard.
So you need some type of schema so a large class of hypotheses can be represented (images are probably not a good way, you need a graph structure), and then the model would need to generate it’s outputs in multiple passes, where it constructs this representation then constructs text based on the original prompt + representation, and so on.
When you say “board vision” what you are really saying is the model needs some kind of mental representation of the world. For example, on a whiteboard, you could have a crude picture of “the world” with stick figures for the people in it, then “the apocalypse as some crude drawing of something bad about the same scale as the world”, then “the world” has no people in it.
Notably this works extremely well for humans. I have found it basically impossible to express the most modestly complex idea without a tool like this. Humans just fail on verbal descriptions above a certain level of complexity. Even when communicating with humans at statistically unlikely intelligence levels. Humans only have “board vision” for the narrow domains they are experts in—in those domains they don’t need a whiteboard.
So you need some type of schema so a large class of hypotheses can be represented (images are probably not a good way, you need a graph structure), and then the model would need to generate it’s outputs in multiple passes, where it constructs this representation then constructs text based on the original prompt + representation, and so on.