It’s impossible for having a proof of A to make it harder to prove B.
Proof(A&B) should not be longer than Proof(A) and Proof(B) put together, since proving A and B separately also proves A&B. (Except maybe you waste a few tokens of syntax to say “therefore A&B” or something.)
Vivek Hebbar
Research Sabotage in ML Codebases
Sleeper Agent Backdoor Results Are Messy
An Empirical Study of Methods for SFTing Opaque Reasoning Models
How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Five approaches to evaluating training-based control measures
Model organisms researchers should check whether high LRs defeat their model organisms
Operationalizing FDT
Thanks, good idea!
A simple rule for causation
How will we do SFT on models with opaque reasoning?
Theoretical predictions on the sample efficiency of training policies and activation monitors
Methodological considerations in making malign initializations for control research
Supervised fine-tuning as a method for training-based AI control
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
Goal-guarding is easy
The AI is a schemer (see here for my model of how that works)
Sandbagging would benefit the AI’s long-term goals
The deployer has taken no countermeasures whatsoever
The reason is as follows:
Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
Wouldn’t it be better to have a real nighttime view of north america? I also found it jarring...
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.
When does training a model change its goals?
In general, computing a boolean expression with terms without the signal being drowned out by the noise will require if the noise is correlated, and if the noise is uncorrelated.
Shouldn’t the second one be ?
the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)
Is this meant to say “last token” instead of “past token”?
I did say “suppose you are deterministic”. That said, can you spell out how CDT ratifies the optimal policy if randomization is allowed?