keith_wynroe

Karma: 294

keith_wynroe 28 Jun 2025 11:57 UTC
2 points
0
on: Vestigial reasoning in RL
Have been doing some experiments on obfuscated reasoning lately and think this is definitely a thing. For example, I ran a set up where the model was filing reimbursement claims and was rewarded for getting more money but penalised for explicitly planning to lie about how much it spent. It ends up consistently claiming more money than it should while having totally innocent CoTs, but just because it learns a really dumb circuit that attends from the <answer> tag back to the numbers in the prompt and inflates them. You can noise or even completely delete the CoT and the behaviour persists
Have been unsuccesful in getting a set-up where the behaviour both needs planning and survives the monitor

Finding an Error-Detection Feature in DeepSeek-R1

keith_wynroe24 Apr 2025 16:03 UTC

15 points

0 comments7 min readLW link

keith_wynroe 10 Nov 2024 15:42 UTC
1 point
0
in reply to: dxu’s comment on: A Simple Toy Coherence Theorem
I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined:
One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2′s complexity relative to UTM1, it follows directly that a UTM2 whose “coding scheme” is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), P itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.
I agree with all of this, and wasn’t gesturing at anything related to it, so I think we’re talking past eachother. My point was simply that two UTMs even with not very-large prefix encodings can wind up with extremely different priors, but I don’t think that’s too relevant to what your main point is
For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.
I think I disagree with almost all of this. You can fix some gerrymandered extant physical system right now that ends up looking like a garbled world-history optimizer, I doubt that it would take on the order of length ~2^10^80 to specify it. But granting that these systems would in fact have astronomical prefixes, I think this is a ponens/tollens situation: if these systems actually have a huge prefix, that tells me that some the encoding schemes of some physically realisable systems are deeply incompatible with mine, not that those systems which are out there right now aren’t physically realisible.
I imagine an objection is that these physical systems are not actually world-history optimizers and are actually going to be much more compressible than I’m making them out to be, so your argument goes through. In which case I’m fine with this, this just seems like a differing definition of what counts as when two schemes are acting “virtually identically” w.r.t to optimization criteria. If your argument is valid but is bounding this similarity to include e.g random chunks of a rock floating through space, then I’m happy to concede that—seems quite trivial and not at all worrying from the original perspective of bounding the kinds of optimization criteria an AI might have

keith_wynroe 9 Nov 2024 17:22 UTC
1 point
0
in reply to: dxu’s comment on: A Simple Toy Coherence Theorem
The constant bound isn’t not that relevant just because of the in principal unbounded size, it also doesn’t constrain the induced probabilities in the second coding scheme much at all. It’s an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors
And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
I have no idea how you’re getting to this, not sure if it’s claiming a formal result or just like a hunch. But I disagree both that there is a neat correspondence between a system being physically realizable and its having a concise implementation as a TM. Even granting that point, I don’t think that nearly all or even most of these physically realisable systems will behave identically or even similarly w.r.t. how they assign codes to “natural” optimization criteria

keith_wynroe 24 Sep 2024 13:26 UTC
21 points
18
in reply to: RobertM’s comment on: The Sun is big, but superintelligences will not spare Earth a little sunlight
If you parse this post as “attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors” rather than “trying to argue with the bleeding-edge arguments on x-risk”, this post seems good
This seems reasonable in isolation, but it gets frustrating when the former is all Eliezer seems to do these days, with seemingly no attempt at the latter. When all you do is retread these dunks on “midwits” and show apathy/contempt for engaging with newer arguments, it makes it look like you don’t actually have an interest in being maximally truth-seeking but instead like you want to just dig in and grandstand.
From what little engagement there is with novel criticisms of their arguments (like Nate’s attempt to respond to Quintin/Nora’s work), it seems like there’s a cluster of people here who don’t understand and don’t particularly care about understanding some objections to their ideas and instead want to just focus on relitigating arguments they know they can win.

keith_wynroe 17 Sep 2024 12:45 UTC
10 points
4
on: What’s the Deal with Logical Uncertainty?
I can reason as follows: There is 0.5 chance that it is Heads. Let P represent the actual, unknown, state of the outcome of the toss (Heads or Tails); and let Q represent the other state. If Q, then anything follows. For example, Q implies that I will win $1 billion. Therefore the value of this bet is at least $500,000,000, which is 0.5 * $1,000,000, and I should be willing to pay that much to take the bet.
This doesn’t go through, what you have are two separate propositions “H → (T → [insert absurdity here]” and “T → (H → [insert absurdity here]” ^[1], and actually deriving a contradiction from the consequent requires proving which antecedent obtains, which you can’t do since neither is a theorem.
The distinction then with logical uncertainty is supposedly that you do already have a proof of the analogue of H or T, so you can derive the consequent that either H or T derives a contradiction
1. ^
  You don’t really have these either, unless you can prove NOT(H AND T) i.e. can you definitively rule out a coin landing both heads and tails? But that’s kinda pedantic

keith_wynroe 6 Sep 2024 10:26 UTC
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
What are your thoughts on KL-div after the unembed softmax as a metric?

keith_wynroe 15 Aug 2024 15:46 UTC
15 points
3
on: Rabin’s Paradox
These results hold only if you assume risk aversion is entirely explained by a concave utility function, if you don’t assume that then the surprising constraints on your preferences don’t apply
IIRC that’s the whole point of the paper—not that utility functions are in fact constrained in this way (they’re not), but if you assume risk aversion can only come from diminishing maginal value of money (as many economists do), then you end up in weird places so maybe you should rethink that

keith_wynroe 5 Aug 2024 19:03 UTC
1 point
0
in reply to: johnswentworth’s comment on: A Simple Toy Coherence Theorem
I think what I’m getting at is more general than specifically talking about resources, I’m more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in ‘Utility Maximization = Description Length Minimization’ you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minimized. Obviously this just corresponds to one of those (to us) very unnatural-looking “utility functions” over universe-histories or w/e
If we’re first fixing the coding scheme then this seems to me to be equivalent to constraining the kinds of properties we’re allowing as viable targets of optimization
I guess one way of looking at it is I don’t think it makes sense to talk about a system as being an optimizer/not an optimizer intrinsically. It’s a property of a system relative to a coding scheme/set of interesting properties/resources, everything is an optimizer relative to some encoding scheme. And all of the actual, empirical scariness of AI comes from how close the encoding scheme that by-definition makes it an optimizer is to our native encoding scheme—as you point out they’ll probably have some overlap but I don’t think that itself is scary

keith_wynroe 5 Aug 2024 11:41 UTC
3 points
0
in reply to: johnswentworth’s comment on: A Simple Toy Coherence Theorem
Thanks, I feel like I understand your perspective a bit better now.
Re: your “old” frame: I agree that the fact we’re training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it’ll look like it has preferences over resources we think in terms of/won’t just be representable as a maximally random utility function. I think there’s a huge step from that though to “it’s a optimizer with respect to those resources” i.e there are a lot of partial orderings you can put over states where it broadly has preference orderings we like w.r.t resources without looking like a maximizer over those resources, and I don’t think that’s necessarily scary. I think some of this disagreement may be downstream of how much you think a superintelligence will “iron out wrinkles” like preference gaps internally though which is another can of worms
Re: your new frame: I think I agree that looking like a long-term/distance planner is much scarier. Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa. But this is going round in circles a bit, typing this out I think the main crux here for me is what I said in the previous point in that I think there’s too much of a leap from “looks like it has preferences over this resource and long-term plans” vs. “is a hardcore optimizer of said resource”. Maybe this is just a separate issue though, not sure I have any local disagreements here
Re: your last pont, thanks—I don’t think I have a problem with this, I think I was just misunderstanding the intended scope of the post

keith_wynroe 4 Aug 2024 17:13 UTC
5 points
0
in reply to: Steven Byrnes’s comment on: A Simple Toy Coherence Theorem
Thanks, I think that’s a good distinction—I guess I have like 3 issues if we roll with that though
1. I don’t think a system acting according to preferences over future states entails it is EV-maximising w.r.t. some property/resource of those future states. If it’s not doing the latter it seems like it’s not necessarily scary, and if it is then I think we’re back at the issue that we’re making an unjustified leap, this time from “it’s a utility maximizer + it has preferences over future-states” (i.e. having preferences over properties of future states is compatible w/ also having preferences over world-histories/all sorts of weird stuff)
2. It’s not clear to me that specifying “preferences over future states” actually restricts things much—if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as “preferences over future states”. I think the implicit response here is that we’re categorizing future states by a subset of “interesting-to-us” properties, and the differences in future-states yielded by taking Path A or Path B don’t matter to us (in other words, implicitly whenever we talk about these kinds of preferences over states we’re taking some equivalence class over actual micro-states relative to some subset of properties). But then again I think the issue recurs that a system having preferences over future states w.r.t. this subset of properties is a stronger claim
3. I’m more and more convinced that, even if a system does have preferences over future-states in the scariest sense here, there’s not really an overriding normative force for it to update towards being a utility-maximiser. But I think this is maybe a kind of orthogonal issue about the force of exploitability arguments rather than coherence theorems here
I think you’ve said something along the lines of one or two of these points in your links, sorry! Not expecting this to be super novel to you, half just helpful for me to get my own thoughts down explicitly

keith_wynroe 4 Aug 2024 15:27 UTC
8 points
2
on: A Simple Toy Coherence Theorem
The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused
You say:
This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says
My sense of how this conversation goes is as follows:
“Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That’s scary”
“Literally anything can be modelled as a utility maximiser. It’s not the case that literally everything is scary, so something’s wrong here”
“Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it’s being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it’s caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary”
Does this seem accurate-ish? If so it feels like this last response is true but also kind of vacuously so, and kind of undercuts the scariness of the coherence theorems in the first place. As in, it seems much more plausible that a utility-maximiser drawn from this constrained set will be scary, but then where’s the argument we’re sampling from this subset when we make a superintelligence? It feels like there’s this weird motte-and-bailey going on where people flip-flop between the very unobjectionable “it’s representable as a utility-maximiser” implied by the theorems and “it’ll look like a utility-maximiser “internally”, or relative to some constrained set of possible resources s.t. it seems scary to us” which feels murky and un-argued for.
Also on the actual theorem you outline here—it looks right, but isn’t assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after? i.e. the starting data is usually a set of preferences, with the actual work being proving that this along with some assumptions yields a utility function over outcomes. This also seems why you don’t have to use anything like dutch-book arguments etc as you point out—but only because you’ve kind of skipped over the step where they’re used

keith_wynroe 4 Aug 2024 14:45 UTC
1 point
0
in reply to: LawrenceC’s comment on: An OV-Coherent Toy Model of Attention Head Superposition
Hey, sorry for the (very) belated response—thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you’re right that it was being pretty sloppy/confused with the concept of “superposition”. I’ve since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for “true” superposition.
Re: your point about there existing simpler solutions—you’re totally right that for d-head >= 4, there exists a more straightforward n_head = 1 solution, I did try solving this problem on paper before training anything and arrived at the same thing as you
However we found that for d_head = 1, n_head = 2 the model could still solve the problem perfectly—in this case I think the problem is less trivial and it does rely on the kind of “conditional attention hierarchy” behaviour and the associated interference we talk about. When n_head = 2 and d_head >= 4 the model still prefers this approach over the more trivial method you outline—we included the plots from this experiment over the n_head = 2, d_head = 1 version because the plots were a bit easier to read and we felt made the same point, but in retrospect
Overall I’m a lot less impressed/interested by this work in retrospect largely for the reasons you point out here, however I think some of the qualitative behaviours we saw are still quite interesting, and have at least for me affected how I think about what kinds of things attention layers might be doing (although the lessons may not be new/interesting to others)
1. “Inverted attention preferences”: In almost all of our tests, the two heads learn to invert the order in which they attend to important tokens. If there are multiple important key-tokens that all need to be attended to, you really don’t want multiple heads attending to the same token and ignoring some, so the QK-circuits of heads may be arranged so they distribute responsibility in a mutually exclusive/exhaustive way. Obviously our toy example is an extreme case, but I think this mutual-information between QK-circuits is probably likely to exist in LLM’s, since “needing to attend to a lot of different context information simultaneously” is v. present in language
2. “Thinking of heads as copying information about entire contexts vs. specific tokens”: This is maybe more of a perspective-shift than anything, but I found it interesting that when a head attended to its “second favourite token”, it could safely not write to the logits of the completion implied by (second-favorite token, first-favorite token), because it can “infer” the first-favorite is not elsewhere in the context (or else it’d be attending there). Or in other words, when an OV-circuit is sent to a specific key-position, it’s able to exploit not just the information at the residual stream locally at that position, but also the information implied about the entire context by its QK-circuit. Again, this may largely just be a “frame-shift” thing, but it’s definitely informed how I think about the relationship between the QK- and OV-circuits and how independent/disconnected I should be thinking of them as

keith_wynroe 10 Jul 2024 11:29 UTC
1 point
0
in reply to: J Bostock’s comment on: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
Sorry for the delay—thanks for this! Yeah I agree, in general the OV circuit seems like it’ll be much easier given the fact that it doesn’t have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we’re trying atm
I think the tough part will be the next step which is somehow “stitching together” the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be thinking about the QK and OV circuit as totally independent is still unclear to me
Interested to hear more about your work though! Being able to replace the entire model sounds impressive given how much reconstruction errors seem to compound

keith_wynroe 5 Jul 2024 10:21 UTC
1 point
0
in reply to: Tom Lieberum’s comment on: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
Thanks!
The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don’t have great apples-to-apples comparisons for you. The losses didn’t seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it’s very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine.
Re: comparison to SAE’s, you mean what do the dictionaries/feature-map have to look like if you’re explicitly targeting L2-reconstruction error and just getting pattern reconstruction as a side-effect? If so we also looked at this briefly early on. We didn’t spend a huge amount of time on these so they were probably not optimally trained, but we were finding that to get L2-reconstruction error low enough to yield comparably close good pattern reconstruction we were needing to go up to a d_hidden of 16,000 i.e. comparable to residual SAEs for the same layer. Which I think is another data-point in favour of “a lot of the variance in head-space is attention-irrelevant and just inherited from the residual stream”

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

86 points

7 comments12 min readLW link

keith_wynroe 19 Jan 2024 19:39 UTC
2 points
0
on: Toward A Mathematical Framework for Computation in Superposition
This looks really cool! Haven’t digested it all yet but I’m especially interested in the QK superposition as I’m working on something similar. I’m wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don’t think this has been explored outside of toy models from Anthropic though

keith_wynroe 15 Dec 2023 15:07 UTC
2 points
0
on: OpenAI Superalignment: Weak-to-strong generalization
I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn’t actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that’s already there

keith_wynroe 9 Dec 2023 10:48 UTC
9 points
8
in reply to: Zvi’s comment on: AI #41: Bring in the Other Gemini
Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say

I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it

But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say it was too long for him to bother reading, which imo is far worse. AFAICT his response to you in your post is higher-effort than the responses from MIRI people to his arguments all put together. Plausibly we have different clusters in our head of who we’re comparing him too though—I agree a wider set of LW people are much more engaging, I’m specifically comparing to e.g Nate and Eliezer as that feels to me a fairer comparison

To go into the specific behaviours you mention

I basically don’t see him changing his mind about anything, agreeing a good point was made

I don’t think this makes sense—if from his perspective you didn’t make good points or change his mind then what was he supposed to do? If you still think you did and he’s not appreciating them then that’s fair but is more reifying the initial disagreement. I also don’t see this behaviour from Eliezer or Nate?

addressing my arguments or thoughts on their merits rather than correcting my interpretation of his arguments, asking me questions, suggesting cruxes and so on.

I again don’t see Eliezer doing any of this either in responses to critical posts?

Where he notes disagreement he says he’s baffled anyone could think such a thing and doesn’t seem curious why I might think it

Again seems to be a feature of many MIRI-cluster responses. Stating that certain things feel obvious from the inside and that you don’t get why it’s so hard for other people to grok them is a common refrain.

keith_wynroe 8 Dec 2023 5:37 UTC
20 points
6
on: AI #41: Bring in the Other Gemini

And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.

This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.

It seems doubly bad because it really seems like a lot of the more pessimist crowd just genuinely aren’t actually trying to engage with these ideas at all. Nate wrote one skimmed post which badly misread the piece, and Yudkowsky AFAICT has at most engaged via a couple tweets (again which don’t seem to engage with the points). This is concurrent with them both engaging much more heavily with weaker objections to which they already have easy answers.

I genuinely don’t understand why a group which is highly truth-seeking and dispassionately interested in the validity of their very consequential arguments feels so little reason to engage with counter-arguments to their core claims which have been well-received.

I tried one reply to one of Pope’s posts

From your post, you seem to have misunderstood Quintin’s arguments in a way he explains pretty clearly, and then there’s not really much follow-up. You don’t seem to have demonstrated you can pass an ITT after this, and I think if it were Yudkowsky in Pope’s position and someone effectively wrote him off as hopeless after one failed attempt to understand eachother you would probably not be as forgiving.

keith_wynroe

Find­ing an Er­ror-De­tec­tion Fea­ture in Deep­Seek-R1

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

Finding an Error-Detection Feature in DeepSeek-R1

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning