Such a detailed explanation, thanks. Some additional thoughs:
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”
That seems like the reasonable path. However, footnote 9 goes further:
Even if it is the case that the model is “actually aligned”, but it was just simulating what a misaligned model would do, it is still bad.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. [...]”
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
Couple of points here:
When I wrote “just simulating what a misaligned model would do” in footnote 9, I meant to communicate that the so-called aligned model also executes the planned actions. (I presume you mostly agree on this being close to as bad as the model being actually misaligned, correct me if I’m wrong.)
I would very much prefer the model to not plan how to kill humans even if it ultimately does not execute the plan. I would also prefer the model to have the tendency to flag such misaligned goals/plans in case it stumbles upon them. Ideally you have both of those.
Whether I prefer the first over the second depends a lot on the specifics.
I’m quite worried about the “tendency to flag misalignment” part being quite brittle/shallow with current methods. These tendencies are not very robust in the setup I had in the post. Note that even supposedly HHH-models sometimes engage in manipulation for instrumental reasons.
I worry that as models are being used (and trained) to solve more difficult/richer/long-term problems the instrumental convergence arguments start to hit harder, so that we putting pressure against alignment.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
(Note that there are plenty of completions that definitely talk about goals, not just resources.)
I do think that a key reason P(DMGI) is low is that P(G) is low.
Not sure how I feel about this overall: On one hand, I talked about how I just give the goal of gold coins to the model, and I apparently didn’t succeed very well to robustly instill it. On the other hand, it is quite interesting that the goal does generalize somewhat even though it’s not being that forced to the model. Maybe ultimately the latter factor is more important: it’s more interesting if telling the model just once “your aim is to get gold-coins” suffices to get robust goals and goal-generalization.
Such a detailed explanation, thanks. Some additional thoughs:
That seems like the reasonable path. However, footnote 9 goes further:
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
Couple of points here:
When I wrote “just simulating what a misaligned model would do” in footnote 9, I meant to communicate that the so-called aligned model also executes the planned actions. (I presume you mostly agree on this being close to as bad as the model being actually misaligned, correct me if I’m wrong.)
I would very much prefer the model to not plan how to kill humans even if it ultimately does not execute the plan. I would also prefer the model to have the tendency to flag such misaligned goals/plans in case it stumbles upon them. Ideally you have both of those.
Whether I prefer the first over the second depends a lot on the specifics.
I’m quite worried about the “tendency to flag misalignment” part being quite brittle/shallow with current methods. These tendencies are not very robust in the setup I had in the post. Note that even supposedly HHH-models sometimes engage in manipulation for instrumental reasons.
I worry that as models are being used (and trained) to solve more difficult/richer/long-term problems the instrumental convergence arguments start to hit harder, so that we putting pressure against alignment.
(Note that there are plenty of completions that definitely talk about goals, not just resources.)
I do think that a key reason P(DMGI) is low is that P(G) is low.
Not sure how I feel about this overall: On one hand, I talked about how I just give the goal of gold coins to the model, and I apparently didn’t succeed very well to robustly instill it. On the other hand, it is quite interesting that the goal does generalize somewhat even though it’s not being that forced to the model. Maybe ultimately the latter factor is more important: it’s more interesting if telling the model just once “your aim is to get gold-coins” suffices to get robust goals and goal-generalization.