habryka comments on Shane Legg interview on alignment

habryka 29 Oct 2023 17:50 UTC
14 points
2
I don’t really listen to podcasts, but this seems pretty important. If someone would be up either just extracting all the relevant quotes, or writing up their own description of what they think Shane’s alignment proposal would be, then that would be useful (and I think there is a decent chance we could get Shane to do a quick look over it and say whether it’s remotely accurate, since he’s commented on LessWrong things before).
- Seth Herd 29 Oct 2023 18:07 UTC
  4 points
  0
  Parent
  That’s what I’ve tried to do here, and I’d be happy to do a more thorough job, including direct quotes. I have a bit of a conflict of interest, since I think his ideas parallel mine closely. I’ve tried to note areas where I may be reading into his statements, and I can do that more carefully in a longer version.
- Liron 29 Oct 2023 19:22 UTC
  2 points
  −3
  Parent
  I made a short clip highlighting how Legg seems to miss an opportunity to acknowledge the inner alignment problem, since his proposed alignment solution seems to be a fundamentally training / black box approach.
  - Roman Leventov 30 Oct 2023 5:50 UTC
    2 points
    0
    Parent
    When he says “and we should make sure it understands what it says”, it could mean “mechanistic understanding”, i.e., firing the right circuits and not firing wrong ones. I admit it’s a charitable interpretation of Legg’s words but it is a possible one.
  - Seth Herd 29 Oct 2023 21:13 UTC
    2 points
    0
    Parent
    This is fascinating, because I took the exact same section to mean almost the opposite thing. I took him to focus on making it not a black-box process and not about training but design of a review process that explicitly states the model’s reasoning, and is subject to external human review.
    
    He states elsewhere in the interview that RLHF might be slightly helpful, but isn’t enough to pin alignment hopes on.
    
    One reason I’m taking this interpretation is that I think DeepMind’s core beliefs about intelligence are very different from OpenAIs, even though they’ve done and are probably doing similar work focused on large training runs. DeepMind initially was working on building an artificial brain, and they pivoted to large training runs in simulated (game) environments as a practical move to demonstrate advances and get funding. I think at least Legg and Hassabis still believe that loosely emulating the brain is an interesting and productive thing to do.