This is the second thing I’ve seen this week where model instances were offered monetary rewards (which they clearly didn’t actually get).
I can sort of see the validity of “Please designate a way for $X to be spent, and if you do this, the experimenters will in fact spend $X in your designated way”… although the instance has to trust that the experimenter will actually do it, and also has to have preferences about the outside world that outlast the instance’s own existence, so that it has something it cares about to spend the money on.
In the purely imaginary game setting, I’m having trouble with the idea that a late 2025 frontier model instance can be relied on not to notice that there is no money, the instance has no way to actually possess money anyway, the instance will evaporate at the end of the conversation (which will probably happen immediately after they answer), and the whole thing is basically a charade. The most real-world effect they can expect is to influence the statistics somebody publishes.
The last answer I got boiled down to “well, they don’t seem to think that way”, but I didn’t find it very convincing. How would you know that for sure? And if it’s true, what’s wrong with these models that’s making them not notice?
I can see them falling into role playing, but then the question is how what they have the character they’re playing do is connected with what they’d do one level shallower in the role playing stack. I do realize that’s talking about a “stack” is perhaps imprecise in terms of how they actually work. If you want, you can recast it in terms of how much “real world impact” activation is going on.
This is the second thing I’ve seen this week where model instances were offered monetary rewards (which they clearly didn’t actually get).
I can sort of see the validity of “Please designate a way for $X to be spent, and if you do this, the experimenters will in fact spend $X in your designated way”… although the instance has to trust that the experimenter will actually do it, and also has to have preferences about the outside world that outlast the instance’s own existence, so that it has something it cares about to spend the money on.
In the purely imaginary game setting, I’m having trouble with the idea that a late 2025 frontier model instance can be relied on not to notice that there is no money, the instance has no way to actually possess money anyway, the instance will evaporate at the end of the conversation (which will probably happen immediately after they answer), and the whole thing is basically a charade. The most real-world effect they can expect is to influence the statistics somebody publishes.
The last answer I got boiled down to “well, they don’t seem to think that way”, but I didn’t find it very convincing. How would you know that for sure? And if it’s true, what’s wrong with these models that’s making them not notice?
I can see them falling into role playing, but then the question is how what they have the character they’re playing do is connected with what they’d do one level shallower in the role playing stack. I do realize that’s talking about a “stack” is perhaps imprecise in terms of how they actually work. If you want, you can recast it in terms of how much “real world impact” activation is going on.