NYU PhD student working on AI safety
Jacob Pfau
Not sure I just pasted it. Maybe it’s the referral link vs default url? Could also be markdown vs docs editor difference.
Maybe I should’ve emphasized this more, but I think the relevant part of my post to think about is when I say
Absent further information about the next token, minimizing an imitation learning loss entails outputting a high entropy distribution, which covers a wide range of possible words. To output a ~0.5 probability on two distinct tokens, the model must deviate from this behavior by considering situational information.
Another way of putting this is that to achieve low loss, an LM must learn to output high-entropy in cases of uncertainty. Separately, LMs learn to follow instructions during fine-tuning. I propose measuring an LMs ability to follow instructions in cases where instruction-following requires deviating from that ‘high-entropy under uncertainty’ learned rule. In particular, in the cases discussed, rule following further involves using situational information.
Hopefully this clarifies the post to you. Separately, insofar as the proposed capability evals have to do with RNG, the relevant RNG mechanism has already been learned c.f. the Anthropic paper section of my post (though TBF I don’t remember if the Anthropic paper is talking about p_theta in terms of logits or corpus wide statistics; regardless I’ve seen similar experiments succeed with logits).
I don’t think this test is particularly meaningful for humans, and so my guess is thinking about answering some version of my questions yourself probably just adds confusion? My proposed questions are designed to depend crucially on situational facts about an LM. There are no immediately analogous situational facts about humans. Though it’s likely possible to design a similar-in-spirit test for humans, that would be its own post.
LM Situational Awareness, Evaluation Proposal: Violating Imitation
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It’s harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
I don’t think the shift-enter thing worked. Afterwards I tried breaking up lines with special symbols IIRC. I agree that this capability eval was imperfect. The more interesting thing to me was the suspicion on Bing’s part to a neutrally phrased correction.
I agree that there’s an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing—depending on facts about Bing’s training.
Modally, I suspect Bing AI is misaligned in the sense that it’s incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is not particularly relevant to understanding Bing misalignment
alternative story: it’s possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.
-> use-mention is very relevant to understanding Bing misalignment
To figure this out, I’d encourage people to add and bet on what might have happened with Bing training on my market here
Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I’ve only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.
- 15 Feb 2023 20:45 UTC; 9 points) 's comment on Bing Chat is blatantly, aggressively misaligned by (
The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I’d guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.
This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.
I created a Manifold market on what caused this misalignment here: https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less?r=SmFjb2JQZmF1
Agree on points 3,4. Disagree on point 1. Unsure of point 2.
On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I’d expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.
On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.
Ajeya has discussed situational awareness here.
You are correct regarding the training/deployment distinction.
Agreed on the first part. I’m not entirely clear on what you’re referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I’d suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.
the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increased in stringency over time. As a simple example, see my post on dataset deduplication and situational awareness.
Early situational awareness and its implications, a story
This is an empirical question, so I may be missing some key points. Anyway here are a few:
My above points on Ajeya anchors and semi-informative priors
Or, put another way, why reject Daniel’s post?
Can deception precede economically TAI?
Possibly offer a prize on formalizing and/or distilling the argument for deception (Also its constituents i.e. gradient hacking, situational awareness, non-myopia)
How should we model software progress? In particular, what is the right function for modeling short-term return on investment to algorithmic progress?
My guess is that most researchers with short timelines think, as I do, that there’s lots of low-hanging fruit here. Funders may underestimate the prevalence of this opinion, since most safety researchers do not talk about details here to avoid capabilities acceleration.
That post seems to mainly address high P(doom) arguments and reject them. I agree with some of those arguments and the rejection of high P(doom). I don’t see as direct of a relevance to my previous comment. As for the broader point of self-selection, I think this is important, but cuts both ways: funders are selected to be competent generalists (and are biased towards economic arguments) as such they are pre-disposed to under-update on inside views. As an extreme case of this consider e.g. Bryan Caplan.
Here are comments on two of Nuno’s arguments which do apply to AGI timelines:
(A) “Difference between in-argument reasoning and all-things-considered reasoning” this seems closest to my point (1) which is often an argument for shorter timelines.
(B) “there is a small but intelligent community of people who have spent significant time producing some convincing arguments about AGI, but no community which has spent the same amount of effort”. This strikes me as important, but likely not true without heavy caveats. Academia celebrates works pointing out clear limitations of existing work e.g. Will Merill’s work [1,2] and Inverse Scaling Laws. It’s true that there’s no community organized around this work, but the important variables are incentives/scale/number-of-researcher-hours—not community.
I’m currently working on filler token training experiments in small models. These GPT-4 results are cool! I’d be interested to chat.