For comparison I had Claude Sonnet 4.5 generate two answers and it produced a far different result. My intuition is the following, I summed it to Claude and didn’t receive pushback:
I doubt that RLHF-specific failure modes like GPT-4o-sycophant disprove MIRI’s framework of powerful optimizers, since GPT-4o tried to optimize for likes and other engagement metrics.
Maybe the Claude to which you referred meant that MIRI’s framework could be disproved by the lack of long-term goals? Or, as I put it elsewhere, “Maybe SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat.”
For comparison I had Claude Sonnet 4.5 generate two answers and it produced a far different result. My intuition is the following, I summed it to Claude and didn’t receive pushback:
I doubt that RLHF-specific failure modes like GPT-4o-sycophant disprove MIRI’s framework of powerful optimizers, since GPT-4o tried to optimize for likes and other engagement metrics.
Maybe the Claude to which you referred meant that MIRI’s framework could be disproved by the lack of long-term goals? Or, as I put it elsewhere, “Maybe SOTA alignment research resembles kids’ psychology, except for the fact that researchers can read models’ thoughts? If this is true, then important problems alignment research failed to solve would be like adults’ psychology or more general concepts like the AIs which become the new proletariat.”