RedMan comments on Alignment will happen by default. What’s next?

RedMan 25 Nov 2025 23:39 UTC
1 point
0
Something changed with the most recent generation of models. I have a few ‘evil tech’ examples that are openly published, but the implications have not been. So when I get a new model, I throw some papers in and ask ‘what are the implications of this paper for X issue’, the newest generation is happy to explain the misuse case. This is particularly dangerous because in some cases, a bad actor making use of this ‘evil tech’ would be doing things that ‘the good guys’ do not understand to be possible. I do think I could hit three logs with implementation of one of the schemes; up to now the models were not smart enough to explain it.
If anyone reading this works at a major lab (preferably Google), you might want to talk to me.
- Adrià Garriga-alonso 25 Nov 2025 23:49 UTC
  5 points
  2
  Parent
  Well, that does seem bad; I don’t know what ‘hit three logs’ means but maybe you shouldn’t explain it.
  
  I’ve sent this to a lesswrong google friend.
  - RedMan 26 Nov 2025 0:13 UTC
    1 point
    2
    Parent
    I was referencing a previous post I made about harms, I think it’s good to quantify danger in logs (ones, tens, hundreds, thousands): https://www.lesswrong.com/posts/Ek7M3xGAoXDdQkPZQ/terrorism-tylenol-and-dangerous-information#a58t3m6bsxDZTL8DG Three logs means ‘a person who implemented this could kill 1-9x10^3 people’. I don’t think the current censorship approach will work for issues like this, because it’s something the censors are likely unaware of, and therefore, the rules are not tuned to detect the problem. The models seem to have crossed a threshold where they can actually generate a new idea.
    Thanks for sending this around!