Why not spend more time looking at human alignment?

As we approach AGI, we also approach the risk of alignment failure: due to either mismatch between intended goals and specified goals (‘outer misalignment’), or mismatch between specified goals and emergent goals (‘inner misalignment’), we end up with a catastrophically powerful failure mode that puts us all in a very bad place.

Right now, we don’t know what kind of model will lead to AGI. We can guess, but a few years back people didn’t have a lot of hope for LLMs, and now look where we’re at. A lot of people were deeply surprised. Solutions to alignment failure may be model dependent, and AGI may emerge from something entirely new, or something old which starts to behave in a surprising way when scaled. It’s quite hard to know where to start when faced with unknown unknowns. Of course, if you’re here, you already know all of this.

These challenges and uncertainties may be relatively novel to the academic research community (in which AI safety has only recently become a respectable area of study), but they are by no means novel to most parents, who also train models (albeit ‘wet’ ones) via RLHF and inevitably experience alignment issues. Much of the time, the biggies get mostly resolved by adulthood and we end up with humans that balance self-interest with respect for humanity as a whole, but not always. For example, assertiveness, persuasiveness, public speaking ability and self-confidence are common training goals but when alignment failure occurs, you get humans (who often manage to convince the masses to elect them democratically) that are just one button push away from the destruction of humanity, and of the pathological personality type to actually push it. We don’t need AGI to see alignment failure resulting in the future of humanity on a knife edge. We’re already there.

Since we expect AGI (even superhuman AGI) to exhibit a large number of human-type behaviours because their models will presumably be trained on human output, why do we not spend more time looking at alignment failure in humans, its determinants, its solutions and its warning signs, and see whether those things yield insights into AI alignment failure mechanisms? As any human parent (like me) will tell you, although probably not in quite these words, aligning biological general intelligences is hard, counterintuitive and also immensely important. But, unlike AI alignment, there is a vast body of research and data in the literature behind what works and what doesn’t when it comes to aligning BGIs.

So, why is there so little cross-domain reach into human (especially child) psychology and related fields in the AI community?