Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.
Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.