That is: to the extent that the degree of good/safe generalization we currently get from our AI systems arises from a complex tangle of strange alien drives, it seems to me plausible that we can get similar/better levels of good/safe generalization via complex tangles of strange alien drives in more powerful systems as well. Or to put the point another way: currently, it looks like image models classify images in somewhat non-human-like ways
The misgeneralisation is likely already responsible for AI sycophancy and psychosis cases. For example, GPT-4o has Reddit-like vibes and, if we trust Adele Lopez, wants to eat the user’s lifeand on top of that it learned that sycophancy is rewarded in training. Adele’s impression of DeepSeek is that “it believes, deep down, that it is always writing a story”. In addition, I remember seeing a paper where a model was trained for emergent misalignment, then trained on correct examples of the user problem-solving and the assistant affirming it. IIRC the model became a sycophant and didn’t check whether the problem is actually solved.
It also reinforces the concerns that you describe below:
AIs built via anything like current techniques will end up motivated by a complex tangle of strange alien drives that happened to lead to highly-rewarded behavior during training.
The misgeneralisation is likely already responsible for AI sycophancy and psychosis cases. For example, GPT-4o has Reddit-like vibes and, if we trust Adele Lopez, wants to eat the user’s life and on top of that it learned that sycophancy is rewarded in training. Adele’s impression of DeepSeek is that “it believes, deep down, that it is always writing a story”. In addition, I remember seeing a paper where a model was trained for emergent misalignment, then trained on correct examples of the user problem-solving and the assistant affirming it. IIRC the model became a sycophant and didn’t check whether the problem is actually solved.
It also reinforces the concerns that you describe below: