the target of “Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection” seems so absurdly unlikely to hit
Yes. They would be aiming for something that has not sparse distant rewards, which we can’t do reliably, but instead mostly rewards that are fundamentally impossible to calculate in time. And the primary method for this is constitutional alignment and RLHF. Why is anyone even optimistic about that!?!?
Yes. They would be aiming for something that has not sparse distant rewards, which we can’t do reliably, but instead mostly rewards that are fundamentally impossible to calculate in time. And the primary method for this is constitutional alignment and RLHF. Why is anyone even optimistic about that!?!?