Steve, your AI safety musings are my favorite thing tonally on here. Thanks for all the effort you put into this series. I learned a lot.
To just ask the direct question, how do we reverse-engineering human social instincts? Do we:
Need to be neuroscience PhDs?
Need to just think a lot about what base generators of human developmental phenomena are, maybe by staring at a lot of babies?
Guess, and hope we get to build enough AGIs that we notice which ones seem to be coming out normal-acting before one of them kills us?
Something else you’ve thought of?
I don’t have a great sense for the possibility space.
This is a real shame—there are lots of alignment research directions that could really use productive smart people.
I think you might be trapped in a false dichotomy of “impossible” or “easy”. For example, Anthropic/Redwood Research’s safety directions will succeed or fail in large part based on how much good interpretability/adversarial auditing/RLHF-and-its-limitations/etc. work smart people do. Yudkowsky isn’t the only expert, and if he’s miscalibrated then your actions have extremely high value.