Cole Wyeth comments on Alignment as uploading with more steps

Cole Wyeth 14 Sep 2025 13:36 UTC
3 points
0
Yes, I think what you’re describing is basically CIRL? This can potentially achieve incremental uploading. I just see it as technically more challenging than pure imitation learning. However, it seems conceivable that something like CIRL is needed during some kind of “takeoff” phase, when the (imitation learned) agent tries to actively learn how it should generalize by interacting with the original over longer time scales and while operating in the world. That seems pretty hard to get right.
- Daniel C 14 Sep 2025 16:59 UTC
  3 points
  2
  Parent
  Yes I agree
  I think it’s similar to CIRL except less reliant on the reward function & more reliant on the things we get to do once we solve ontology identification