I’m writing a response to https://www.lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next and https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem where I tried to measure how “sticky” the alignment of current LLMs is. I’m proofreading and editing that now. Spoiler: Models differ wildly in how committed they are to being aligned and alignment-by-default may not be a strong enough attractor to work out.Would anyone want to proofread this?
This can now be read at https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say
I’m writing a response to https://www.lesswrong.com/posts/FJJ9ff73adnantXiA/alignment-will-happen-by-default-what-s-next and https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem where I tried to measure how “sticky” the alignment of current LLMs is. I’m proofreading and editing that now. Spoiler: Models differ wildly in how committed they are to being aligned and alignment-by-default may not be a strong enough attractor to work out.
Would anyone want to proofread this?
This can now be read at https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say