I made a short clip highlighting how Legg seems to miss an opportunity to acknowledge the inner alignment problem, since his proposed alignment solution seems to be a fundamentally training / black box approach.
When he says “and we should make sure it understands what it says”, it could mean “mechanistic understanding”, i.e., firing the right circuits and not firing wrong ones. I admit it’s a charitable interpretation of Legg’s words but it is a possible one.
This is fascinating, because I took the exact same section to mean almost the opposite thing. I took him to focus on making it not a black-box process and not about training but design of a review process that explicitly states the model’s reasoning, and is subject to external human review.
He states elsewhere in the interview that RLHF might be slightly helpful, but isn’t enough to pin alignment hopes on.
One reason I’m taking this interpretation is that I think DeepMind’s core beliefs about intelligence are very different from OpenAIs, even though they’ve done and are probably doing similar work focused on large training runs. DeepMind initially was working on building an artificial brain, and they pivoted to large training runs in simulated (game) environments as a practical move to demonstrate advances and get funding. I think at least Legg and Hassabis still believe that loosely emulating the brain is an interesting and productive thing to do.
I made a short clip highlighting how Legg seems to miss an opportunity to acknowledge the inner alignment problem, since his proposed alignment solution seems to be a fundamentally training / black box approach.
When he says “and we should make sure it understands what it says”, it could mean “mechanistic understanding”, i.e., firing the right circuits and not firing wrong ones. I admit it’s a charitable interpretation of Legg’s words but it is a possible one.
This is fascinating, because I took the exact same section to mean almost the opposite thing. I took him to focus on making it not a black-box process and not about training but design of a review process that explicitly states the model’s reasoning, and is subject to external human review.
He states elsewhere in the interview that RLHF might be slightly helpful, but isn’t enough to pin alignment hopes on.
One reason I’m taking this interpretation is that I think DeepMind’s core beliefs about intelligence are very different from OpenAIs, even though they’ve done and are probably doing similar work focused on large training runs. DeepMind initially was working on building an artificial brain, and they pivoted to large training runs in simulated (game) environments as a practical move to demonstrate advances and get funding. I think at least Legg and Hassabis still believe that loosely emulating the brain is an interesting and productive thing to do.