John_Maxwell comments on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

John_Maxwell 22 Jun 2022 23:20 UTC
5 points
0

Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.

Fair enough.

My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.

Yeah personally building feels more natural to me.

I agree a leaderboard would be great. I think it’d be cool to have a leaderboard for proposals as well—“this proposal has been unbroken for X days” seems like really valuable information that’s not currently being collected.

I don’t think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.

In the absence of coordination, I think if someone like you was to simply start advertising themselves as an “uberbreaker” who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a “pre-Eliezer” person who I can run my ideas by in a lower stakes context, as opposed to saying “Hey Eliezer, I solved alignment—wallop me if I’m wrong!”
- elspood 23 Jun 2022 15:55 UTC
  6 points
  0
  Parent
  I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
  
  I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.