If I were a human-level intelligent computer program, I would put substantial effort to get ability to self-modify, but that’s not a point.
My favorite analogy here is that humans were bad at addition before invention of positional arithmetic and then they became good. My concern is that we can invent seemingly human-level system which becomes above human-level after it learns some new cognitive strategy.
How are they going to ensure that “human-level alignment researcher” a) is human-level, b) stays at human level?
And, of course, it would be lovely to elaborate on training of misaligned models.
What would you mean by ‘stays at human level?’ I assume this isn’t going to be any kind of self-modifying?
If I were a human-level intelligent computer program, I would put substantial effort to get ability to self-modify, but that’s not a point. My favorite analogy here is that humans were bad at addition before invention of positional arithmetic and then they became good. My concern is that we can invent seemingly human-level system which becomes above human-level after it learns some new cognitive strategy.