I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)