FWIW I at least found this to be insightful and enlightening. This seems clearly like a direction to explore more and one that could plausibly pan out.
I wonder if we would need to explore beyond the current “one big transformer” setup to realize this. I don’t think humans have a specialized brain region for simulations (though there is a region that seems heavily implicated, see https://www.mountsinai.org/about/newsroom/2012/researchers-identify-area-of-the-brain-that-processes-empathy), but if you want to train something using gradient descent, it might be easier if you have a simulation module that predicts human preferences and is rewarded for accurate predictions, and then feed those into the main decision-making model.
Perhaps we can use revealed preferences through behavior combined with elicited preferences to train the preference predictor. This is similar to the idea of training a separate world model rather than lumping it in with the main blob.
FWIW I at least found this to be insightful and enlightening. This seems clearly like a direction to explore more and one that could plausibly pan out.
I wonder if we would need to explore beyond the current “one big transformer” setup to realize this. I don’t think humans have a specialized brain region for simulations (though there is a region that seems heavily implicated, see https://www.mountsinai.org/about/newsroom/2012/researchers-identify-area-of-the-brain-that-processes-empathy), but if you want to train something using gradient descent, it might be easier if you have a simulation module that predicts human preferences and is rewarded for accurate predictions, and then feed those into the main decision-making model.
Perhaps we can use revealed preferences through behavior combined with elicited preferences to train the preference predictor. This is similar to the idea of training a separate world model rather than lumping it in with the main blob.