RedMan comments on Alignment will happen by default. What’s next?

RedMan 26 Nov 2025 0:13 UTC
1 point
2
I was referencing a previous post I made about harms, I think it’s good to quantify danger in logs (ones, tens, hundreds, thousands): https://www.lesswrong.com/posts/Ek7M3xGAoXDdQkPZQ/terrorism-tylenol-and-dangerous-information#a58t3m6bsxDZTL8DG Three logs means ‘a person who implemented this could kill 1-9x10^3 people’. I don’t think the current censorship approach will work for issues like this, because it’s something the censors are likely unaware of, and therefore, the rules are not tuned to detect the problem. The models seem to have crossed a threshold where they can actually generate a new idea.
Thanks for sending this around!