I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.
Perhaps! I’m curious which of them catch your eye for further reading and why. I’ve got a lot on my reading list, but I’d be down to hop on a call and read some of these in sync with someone.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I’m in favor of keeping the human in the loop but it doesn’t scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren’t grounded in something real.
I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.
Perhaps! I’m curious which of them catch your eye for further reading and why. I’ve got a lot on my reading list, but I’d be down to hop on a call and read some of these in sync with someone.
I found this one particularly relevant:
https://arxiv.org/abs/2010.00581 - “Emergent Social Learning via Multi-agent Reinforcement Learning”
It provides a solution to the problem of how an RL agent can learn to imitate the behavior of other agents.
It doesn’t help with alignment though; is more on the capabilities side.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I’m in favor of keeping the human in the loop but it doesn’t scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren’t grounded in something real.
I’m looking for something that explains how the presence of other agents in the environment of an agent together with reward/feedback grounded in the environment as in [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL leads to aligned behaviors.