Noosphere89 comments on Six Thoughts on AI Safety

Noosphere89 29 Jan 2025 21:56 UTC
6 points
2
This sounds a lot like what @Seth Herd’s talk about instruction following AIs is all about:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
- Seth Herd 30 Jan 2025 5:40 UTC
  4 points
  2
  Parent
  Thanks for the mention.
  Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
  It seems like the core goal should be to follow instructions or take correction—corrigibility as a singular target (or at least prime target). It seems noticeably safer to use Intent alignment as a stepping-stone to value alignment.
  Of course, leaving humans in charge of AGI/ASI even for a little while sounds pretty scary too, so I don’t know.