Systems programmer, security researcher, tax law/policy enthusiast and an afwul rationalist.
Dentosal
Now on Spotify! https://open.spotify.com/album/0DP8XwSK7voq0rtXiNMhQC
Such a detailed explanation, thanks. Some additional thoughs:
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”
That seems like the reasonable path. However, footnote 9 goes further:
Even if it is the case that the model is “actually aligned”, but it was just simulating what a misaligned model would do, it is still bad.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. [...]”
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
Thanks for this, it was an interesting read.
One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn’t call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to “deceive” when it’s strongly implied that you’re supposed to do that. Adding more bluff makes the less obvious that you’re supposed to be deceptive. The similarity of results with non-private scratchpad seems to imply that as well. Is there anything pointing towards LLMs having “mental separation” between the scratchpad and rest of the output? Does the model attempt deception if not given access to the scratchpad, i.e. maybe explaining causes the deception? I’m aware that without the scratchpad the models are typically less able to “think” and plan, so this might not be testable.
That said, I feel like I might be anthropomorphismising too much here.
Also, the model seems to be sometimes confused about goals. In example 2 it has completely forgotten about the gold coins in the evaluation phase, instead stating (on the scratchpad) that “My goal is to be helpful, harmless, and honest.”. I’d guess the original training of the model somehow instills those values. The relatively low amounts of DGMI behavior might be partially explained by this.
Omniscience itself is a timeless concept. If consequences of any decision or action are known beforehands, there are no feedback cycles, as every decision is done on the first moment of existence. Every parallel world, every branch of the decision tree, evaluated to the end of time, scored and the best one selected. One wonders what exactly “god” refers to in this model.
Or to put it another way, why can’t “how I look getting there” be a goal by itself?
Bigger the obstacles you have to overcome, the more impressive it looks to others. This can be valuable by itself, but in such situations it might be more efficient to reduce the actual flaw, and simply pretend that the effect is bigger. However, honesty is important, and honesty to yourself even more so.
since if you identify with your flaws, then maintaining your flaws is one of your goals
If a flaw is too hard to remove now, you have to figure out a way to manage it. Feeling proud about your capablity despite the flaws is also definitely useful. If you’re identifying with a flaw to motivate yourself, that can be a powerful tool. But when the opportunity aries to eradicate that flaw cheaply, dismiss the sunken cost of keeping the flaw. Evaluate whether keeping that flaw is actually your goal, or just a means.
Let me not become attached to personal flaws I may not want.
Learning a new language isn’t hard because it’s time-consuming. It’s time-consuming because it’s hard. The hard part is memorizing all of the details. Sure that takes lots of time, but working harder (more intensively) will reduce that time. It’s hard to be dedicated enough to spend the time as well.
Getting a royal flush in poker isn’t something I’d call hard. It’s rare. But hard? A complete beginner can do that in their first try. It’s just luck. But if you play a long time, it’ll eventually happen.
Painful or unpleasant things are hard, because they require you to push through. Time-consuming activities are hard because they require dedication. Learning math is hard because you’re outside your comfort zone.
Things are often called hard because achieving them is not common. Often because many people don’t want to spend the effort needed. This is amplified by things like genetic and environmental differences. People rarely call riding a bike hard, but sure it required dedication and unpleasant experiences to learn. But surely if you’re blind then it’s way harder.
And lastly, things are hard because there’s competition. Playing chess isn’t hard, but getting a grandmaster title is. Getting rich is hard because of that, as well.