Jozdien comments on The Waluigi Effect (mega-post)

Jozdien 3 Mar 2023 21:20 UTC
2 points
0
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?