a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it’s also worth flagging that there’s less of a void these days given that a lot more effort is being put into writing detailed model specs b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you’ve neglected the potential for us to apply filtering to the training data. Whilst I don’t think the solution will be that simple, I don’t think the relation is quite as straightforward as you claim. c) The discussion of “how do you think the LLM’s feel about these experiments” is interesting, but it is also overly anthromorphic. LLM’s are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn’t account for other training dynamics. d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you’re being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there’s value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It’s very easy to say “oh this is poor quality research because it doesn’t my favourite objection”. I’ve probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you’d most likely never publish. e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You’re like “oh, the future, the future, people are always saying it’ll happen in the future”, which probably sounds convincing to folks who haven’t been following that closely, but it’s a lot less persuasive if you know that we’ve been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you’re trying to figure out how to conduct solid research in a new domain, of course it’s going to take some time.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it’s also worth flagging that there’s less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you’ve neglected the potential for us to apply filtering to the training data. Whilst I don’t think the solution will be that simple, I don’t think the relation is quite as straightforward as you claim.
c) The discussion of “how do you think the LLM’s feel about these experiments” is interesting, but it is also overly anthromorphic. LLM’s are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn’t account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you’re being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there’s value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It’s very easy to say “oh this is poor quality research because it doesn’t my favourite objection”. I’ve probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you’d most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You’re like “oh, the future, the future, people are always saying it’ll happen in the future”, which probably sounds convincing to folks who haven’t been following that closely, but it’s a lot less persuasive if you know that we’ve been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you’re trying to figure out how to conduct solid research in a new domain, of course it’s going to take some time.