Daniel Kokotajlo comments on Thane Ruthenis’s Shortform

Daniel Kokotajlo 3 Feb 2025 6:06 UTC
LW: 4 AF: 3
0
AF
I don’t think I understand. Let’s work through an example.

The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).

If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.

But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?

…here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
- Thane Ruthenis 3 Feb 2025 7:02 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Ah, I think I see my error.
  Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
  And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
  So yep, you’re right, this does force all scheming to be taking place within one forward pass.