This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want)
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
I don’t really see your argument here?
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally anyloss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
This seems stronger than the claim I’m making. I’m not saying that the agent won’t deceptively model us and the training process at some point. I’m saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.
You can make the “some subnetwork just models its training process and cares about getting low loss, and then gets promoted” argument against literally any loss function, even some hypothetical “perfect” one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don’t perceive you to believe this implication.
Anyways, here’s another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.
The network doesn’t “observe” more than that, initially. The network just gets updated by the loss function. It doesn’t even know what the loss function is. It can’t even see the gradients. It can’t even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network.
Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness):
RL develops a bunch of contextual decision-influences / shards
EG be near diamonds, make diamonds, play games
Agents learn to plan, and several shards get hooked into planning in order to “steer” it.
When the agent is choosing a plan it is more likely to choose a plan which gets lots of logits from several shards, and furthermore many shards will bid against schemes where the agent plans to plan in a way which only activates a single shard.
This is just me describing how I think the agent will make choices. I may be saying “shard” a lot but I’m just describing what I think happens within the trained model.
This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don’t work well.
I also think this argument is bogus, to be clear.