Vlad Mikulik 8 Nov 2024 19:50 UTC
LW: 4 AF: 4
0
AF
in reply to: Buck’s comment on: Anthropic: Three Sketches of ASL-4 Safety Case Components
FWIW I agree with you and wouldn’t put it the way it is in Roger’s post. Not sure what Roger would say in response.

Vlad Mikulik 8 Nov 2024 15:25 UTC
LW: 4 AF: 4
−6
AF
in reply to: Buck’s comment on: Anthropic: Three Sketches of ASL-4 Safety Case Components
The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Vlad Mikulik 16 Jan 2023 18:53 UTC
LW: 1 AF: 1
0
AF
in reply to: Gurkenglas’s comment on: Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
Ah, yep, I think you’re right—it should be pretty easy to add support for and in selectors then.

Vlad Mikulik 16 Jan 2023 15:58 UTC
LW: 1 AF: 1
0
AF
in reply to: Gurkenglas’s comment on: Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
You’re right that accounting for the softmax could be used to get around the argument in the appendix. We’ll mention this when we update the paper.
The scheme as you described it relies on an always-on dummy token, which would conflict with our implementation of default values in aggregates: we fix the BOS token attention to 0.5, so at low softmax temp it’s attended to iff nothing else is; however with this scheme we’d want 0.5 to always round off to zero after softmax. This is plausibly surmountable but we put it out of scope for the pilot release since we didn’t seem to need this feature, whereas default values come up pretty often.
Also, while this construction will work for just and (with arbitrarily many conjuncts), I don’t think it works for arbitrary compositions using and and or. (Of course it would be better than no composition at all.)
In general, I expect there to be a number of potentially low-hanging improvements to be made to the compiler—many of them we deliberately omitted and are mentioned in the limitations section, and many we haven’t yet come up with. There’s tons of features one could add, each of which takes time to think about and increases overall complexity, so we had to be judicious which lines to pursue—and even then, we barely had time to get into superposition experiments. We currently aren’t prioritising further Tracr development until we see it used for research in practice, but I’d be excited to help anyone who’s interested in working with it or contributing.

Vlad Mikulik 9 Jun 2022 16:54 UTC
LW: 5 AF: 4
0
AF
on: Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability
Nice work!
This section in Anthropic’s work on Induction heads seems highly relevant—I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.

If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN—induction attention heads. Since it’s pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/falsify the hypotheses you make about how GPT makes these predictions. (Although you’d probably have to switch to an open-source model.)

Vlad Mikulik 20 May 2021 13:57 UTC
LW: 13 AF: 7
0
AF
on: Knowledge Neurons in Pretrained Transformers
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul’s concerns about this paper.
However, I did find the “Transformer Feed-Forward Layers Are Key-Value Memories” paper they reference more interesting—it’s more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it’s IMO stronger evidence for the hypothesis, although not conclusive by any means.

Some experiments they show:
- Top-k activations of individual ‘keys’ do seem like coherent patterns in prefixes, and as we move up the layers, these patterns become less shallow and more semantics-driven. (Granted, it’s not clear how good the methodology there is, as to qualify as a pattern, it needs to occur in 3 out of top-25 prefixes. There are 3.6 patterns on average in each key. But this is curious enough to keep looking into.)
- The ‘value’ distributions corresponding to the keys are in fact somewhat predictive of the actual next word for those top-k prefixes, and exhibit a kind of ‘calibration’: while the distributions themselves aren’t actually calibrated, they are more correct when they assign a higher probability.
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.

Vlad Mikulik 3 Apr 2021 15:57 UTC
LW: 6 AF: 3
0
AF
on: My take on Michael Littman on “The HCI of HAI”
Thanks for a great post.
---
One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there’s a kind of progression we can draw (simplifying a little):
- Human specifies what to do (Classical software)
- Human specifies what to achieve (RL)
- Machine infers a specification of what to achieve (IRL)
- Machine collaborates with human to infer and achieve what the human wants (Assistance games)
Towards the end, this post describes an extrapolation of this trend,
- Machine and human collaboratively figure out what the human even wants to do in the first place.
‘Helping humans figure out what they want’ is a deep, complex and interesting problem, and I’d love it if more folks were thinking through what solutions to it ought to look like. This seems particularly urgent because human motivations can be affected even by algorithms that were not designed to solve this problem—for example, think of recommender systems shaping their users’ habits—and which therefore aren’t doing what we’d want them to do.
---
Another nice point is the connection between ML algorithm design and HCI. I’ve been meaning to write something looking at RL as ‘technique for communicating and achieving human intent’ (and, as a corollary, at AI safety as a kind of human-centred algorithm design), but it seems that I’ve been scooped by Michael :)
I note that not everyone sees RL from this frame. Some RL researchers view it as a way of understanding intelligence in the abstract, without connecting reward to human values.
---
One thing I’m a little less sure of is the conclusion you draw from your examples of changing intentions. While the examples convince me that the AI ought to have some sophistication about the human’s intentions—for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?

Vlad Mikulik 21 Feb 2021 19:01 UTC
LW: 7 AF: 4
0
AF
on: Formal Solution to the Inner Alignment Problem
Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.
Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:
The scheme doesn’t protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is that we train a big ensemble of neural nets and treat a random sample from that ensemble as a random sample from the posterior, although I don’t know if that’s how it’s actually done.
A lot of the work is done by the assumption that the true demonstrator is in the posterior, which means that at least one of the top-performing models will not have the same correlated errors. But I’m not sure how true this assumption will be in the neural-net approximation I describe above. I worry about inner alignment failures because I don’t really trust the neural net prior, and I can imagine training a bunch of neural nets to have correlated weirdnesses about them (in part because of the neural net prior they share, and in part because of things like Adversarial Examples Are Not Bugs, They Are Features). As such it wouldn’t be that surprising to me if it turned out that ensembles have certain correlated errors, and in particular don’t really represent anything like the demonstrator.
I do feel safer using this method than I would deferring to a single model, so this is still a good idea on balance. I just am not convinced that it solves the inner alignment problem. Instead, I’d say it ameliorates its severity, which may or may not be sufficient.

Vlad Mikulik 12 Jan 2021 22:19 UTC
3 points
0
in reply to: Past Account’s comment on: Against strong bayesianism
You need much more than limiting behavior to say anything about whether or not the processes are ‘similar’ in a useful way before that.
Perhaps the synthesis here is that while looking at asymptotic behaviour of a simpler system can be supremely useful, we should be surprised that it works so well. To rely on this technique in a new domain we should, every time, demonstrate that it actually works in practice.
Also, it’s interesting that many of these examples do have ‘pathological cases’ where the limit doesn’t match practice. And this isn’t necessarily restricted to toy domains or weird setups: for example, the most asymptotically efficient matrix multiplication algorithms are impractical (although in fairness that’s the most compelling example on that page).

Vlad Mikulik 11 Jan 2021 16:39 UTC
LW: 1 AF: 1
0
AF
on: Utility ≠ Reward
More than a year since writing this post, I would still say it represents the key ideas in the sequence on mesa-optimisation which remain central in today’s conversations on mesa-optimisation. I still largely stand by what I wrote, and recommend this post as a complement to that sequence for two reasons:
First, skipping some detail allows it to focus on the important points, making it better-suited than the full sequence for obtaining an overview of the area.
Second, unlike the sequence, it deemphasises the mechanism of optimisation, and explicitly casts it as a way of talking about goal-directedness. As time passes, I become more and more convinced that it was a mistake to call the primary new term in our work ‘mesa-optimisation’. Were I to be choosing the terms again, I would probably go with something like ‘learned goal-directedness’, though it is quite a mouthful.

Vlad Mikulik 2 Jan 2021 20:13 UTC
LW: 3 AF: 3
0
AF
in reply to: John Schulman’s comment on: Debate Minus Factored Cognition
Not Abram, and I have only skimmed the post so far, and maybe you’re pointing to something more subtle, but my understanding is this:
In Stuart’s original use, ‘No Indescribable Hellwords’ is the hypothesis that in any possible world in which a human’s values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.
Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.
Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won’t lead to very bad outcomes—at least, not to ones which could be pointed out. For example, we can imagine a debate around the question “Should we enact policy X?”. A very strong argument, if it can be credibly argued, is “Enacting policy X leads to an unacceptable violation Y of your values down the line”. So, debate will only recommend policy X if no such arguments are available.
I’m not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn’t get asked questions like ‘Should we enact policy X?’ but instead more specific things like ‘How much does policy X improve Y metric’?, then unless debaters are incentivised to challenge the question’s premises (“The Y metric would improve, but you should consider also the unacceptable effect on Z”), we could use debate and still get hellworlds.

Vlad Mikulik 16 Nov 2020 17:12 UTC
LW: 4 AF: 3
0
AF
on: Clarifying inner alignment terminology
Thanks for writing this.
I wish you included an entry for your definition of ‘mesa-optimizer’. When you use the term, do you mean the definition from the paper* (an algorithm that’s literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)?
A related question is: how would you describe a policy that’s a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?
*incidentally, looking back on the paper, it doesn’t look like we explicitly defined things this way, but it’s strongly implied that that’s the definition, and appears to be how the term is used on AF.

Vlad Mikulik 19 Sep 2020 11:41 UTC
LW: 3 AF: 3
0
AF
in reply to: evhub’s comment on: Mesa-Search vs Mesa-Control
Good point—I think I wasn’t thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it’s doing something really impressive that isn’t well-explained by interpolating on dataset examples—I’m imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.
I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of ‘learning’ is a bit different from that one. ‘Learning’ in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn’t clear to me a priori.

Vlad Mikulik

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

Chain of Thought Mon­i­tora­bil­ity: A New and Frag­ile Op­por­tu­nity for AI Safety

Rea­son­ing mod­els don’t always say what they think

Au­to­mated Re­searchers Can Subtly Sandbag

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

Towards training-time mitigations for alignment faking in RL

Training fails to elicit subtle reasoning in current language models

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Reasoning models don’t always say what they think

Automated Researchers Can Subtly Sandbag

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla