I’m an Assistant Professor at Carnegie Mellon’s Machine Learning Department. I’m also a core faculty member in CMU’s Neuroscience Institute, and hold a courtesy appointment in the Robotics Institute.
My lab works at the intersection of neuroscience & AI to reverse-engineer animal intelligence and build the next generation of autonomous agents, responsibly and safely.
Learn more here: https://cs.cmu.edu/~anayebi
Thanks for this great post! You may be interested in recent work of mine on corrigibility guarantees here: https://www.lesswrong.com/posts/M5owRcacptnkxwD2u/from-barriers-to-alignment-to-the-first-formal-corrigibility-1
The conclusions are consistent with your intuitions:
Aligning to all human values is intractable (even for computationally unbounded agents!)
Corrigibility is therefore a reasonable value set we can mostly all agree on that avoids the intractability in 1)
Corrigibility cannot be guaranteed by a single objective (as is currently done in RLHF & Constitutional AI), which is what prior proposals considered which failed
Corrigibility can instead be formally guaranteed via a small number of objectives, all of which have a higher lexicographic priority over the task objective, thereby making it a tractable safety target