It’s impossible for having a proof of A to make it harder to prove B.
Proof(A&B) should not be longer than Proof(A) and Proof(B) put together, since proving A and B separately also proves A&B. (Except maybe you waste a few tokens of syntax to say “therefore A&B” or something.)
Vivek Hebbar
Advice for making robust-to-training model organisms
Why does off-model SFT degrade capabilities?
Incriminating misaligned AI models via distillation
Research Sabotage in ML Codebases
Sleeper Agent Backdoor Results Are Messy
An Empirical Study of Methods for SFTing Opaque Reasoning Models
How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Five approaches to evaluating training-based control measures
Model organisms researchers should check whether high LRs defeat their model organisms
Operationalizing FDT
Thanks, good idea!
A simple rule for causation
How will we do SFT on models with opaque reasoning?
Theoretical predictions on the sample efficiency of training policies and activation monitors
Methodological considerations in making malign initializations for control research
Supervised fine-tuning as a method for training-based AI control
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
Goal-guarding is easy
The AI is a schemer (see here for my model of how that works)
Sandbagging would benefit the AI’s long-term goals
The deployer has taken no countermeasures whatsoever
The reason is as follows:
Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
Wouldn’t it be better to have a real nighttime view of north america? I also found it jarring...
I did say “suppose you are deterministic”. That said, can you spell out how CDT ratifies the optimal policy if randomization is allowed?