plex comments on [missing post]

plex 22 Feb 2023 15:53 UTC
32 points
15
This is a Heuristic That Almost Always Works, and it’s the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don’t need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can’t train AIs with arbitrary goals, but it’s progress in the same way that quantilizers was progress on mild optimization.
- AprilSR 22 Feb 2023 19:31 UTC
  12 points
  3
  Parent
  I don’t think security mindset means “look for flaws.” That’s ordinary paranoia. Security mindset is something closer to “you better have a really good reason to believe that there aren’t any flaws whatsoever.” My model is something like “A hard part of developing an alignment plan is figuring out how to ensure there aren’t any flaws, and coming up with flawed clever schemes isn’t very useful for that. Once we know how to make robust systems, it’ll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot.”
  That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.
  - mesaoptimizer 8 Mar 2023 11:46 UTC
    13 points
    11
    Parent
    I think a better way of rephrasing it is “clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for”.
    - Tamsin Leake 8 Mar 2023 14:27 UTC
      6 points
      4
      Parent
      i would love a world-saving-plan that isn’t “a clever scheme” with “many moving parts” but alas i don’t expect it’s what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i’ve heard of.