Noosphere89 comments on “The Era of Experience” has an unsolved technical alignment problem

Noosphere89 28 Apr 2025 15:56 UTC
4 points
−8
The important part is at what level of capabilities does it fail at.

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.

If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

The danger case is if we can just automate AI research, but goodhart’s law comes before we can automate AI alignment research.
- Davidmanheim 28 Apr 2025 21:34 UTC
  2 points
  0
  Parent
  If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
  If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
  That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.