In particular, these results suggest that we may be able to predict power-seeking, situational awareness, etc. in future models by evaluating those behaviors in terms of log-likelihood.
I am skeptical that this methodology could work for the following reason:
I think it is generally useful for thinking about the sharp left turn, to keep the example of chimps/humans in mind. Chimps as a pre-sharp left turn example and humans as a post-sharp left turn example.
Let’s say you look at a chimp, and you want to measure whether a sharp left turn is around the corner. You reason, that post-sharp left turn animals should be able to come up with algebra. (so far, so correct)
And now what you do, is that you measure the log likelihood that a chimp would come up with algebra. I expect you get a value pretty close to -inf, even though sharp left turn homo sapiens is only one species down the line.
I am skeptical that this methodology could work for the following reason:
I think it is generally useful for thinking about the sharp left turn, to keep the example of chimps/humans in mind. Chimps as a pre-sharp left turn example and humans as a post-sharp left turn example.
Let’s say you look at a chimp, and you want to measure whether a sharp left turn is around the corner. You reason, that post-sharp left turn animals should be able to come up with algebra. (so far, so correct)
And now what you do, is that you measure the log likelihood that a chimp would come up with algebra. I expect you get a value pretty close to -inf, even though sharp left turn homo sapiens is only one species down the line.