Daniel C comments on Alignment as uploading with more steps

Daniel C 14 Sep 2025 9:00 UTC
1 point
0
An alternative to pure imitation learning is to let the AI predict observations and build its world model as usual (in an environment containing humans), then develop a procedure to extract the model of a human from that world model.
This is definitely harder than imitation learning (probably requires solving ontology identification+ inventing new continual learning algorithms) but should yield stronger guaranteees & be useful in many ways:
- It’s basically “biometric feature conditioning” on steroids, (with the right algorithms) the AI will leverage whatever it knows about physics, psychology, neuroscience to form its model of the human, and continue to improve its human model as it learns more about the world (this will require ontology identification)
- We can continue to extract the model of the current human from the current world model & therefore keep track of current preferences. With pure imitation learning it’s hard to reliably sync up the human model with the actual human’s current mental state (e.g. the actual human is entangled with the environment in a way that the human model isn’t unless the human wears sensors at all times). If we had perfect upload tech this wouldn’t be much of an issue, but seems significant especially at early stages of pure imitation learning
  - In particular, if we’re collecting data of human actions under different circumstances, then both the circumstance and the human’s brain state will be changing, & the latter is presumably not observable. It’s unclear how much more data is needed to compensate for that
- We often want to run the upload/human model on counterfactual scenarios: Suppose that there is a part of the world that the AI infers but doesn’t directly observe, if we want to use the upload/human model to optimize/evaluate that part of the world, we’d need to answer questions like “How would the upload influence or evaluate that part of the world if she had accurate beliefs about it?”. It seems more natural to achieve that when the human model was originally already entangled with the rest of the world model than if it resulted from imitation learning
- Cole Wyeth 14 Sep 2025 13:36 UTC
  3 points
  0
  Parent
  Yes, I think what you’re describing is basically CIRL? This can potentially achieve incremental uploading. I just see it as technically more challenging than pure imitation learning. However, it seems conceivable that something like CIRL is needed during some kind of “takeoff” phase, when the (imitation learned) agent tries to actively learn how it should generalize by interacting with the original over longer time scales and while operating in the world. That seems pretty hard to get right.
  - Daniel C 14 Sep 2025 16:59 UTC
    3 points
    2
    Parent
    Yes I agree
    I think it’s similar to CIRL except less reliant on the reward function & more reliant on the things we get to do once we solve ontology identification