Joe Kwon

Karma: 157

Joe Kwon 31 Mar 2026 17:51 UTC
1 point
0
in reply to: Alex Mallen’s comment on: You can’t imitation-learn how to continual-learn
I’m having trouble understanding how KV cache helps significantly with serial depth (like “updating weights”). Isn’t the overwhelming bottleneck at the start of a new forward pass? Layer l KV cache entries for a given position contain only l-1 layers of contextual processing and layer 1 cache is just W^K to fixed token embeddings (no contextual richness). So then the deep info-rich representations only exist in the high-layer cache entries and those are only accessible to correspondingly high layers of the new token (layers that have already done their own deep processing) so early layers querying the KV cache are reading nearly context-free vectors (I think?)
There’s a discrete token bottleneck where depth-L computation selects a token, maps back to fixed embedding, and L layers process it from scratch so you get O(TL) serial depth over T steps but each cross-step compresses the high-dimensional representation down back to the trained vocab item/representation. Does this all sound right and you are just saying in theory you think this is sufficient?
I may be confusing/overlooking something simple

How Robust Is Monitoring Against Secret Loyalties?

Joe Kwon26 Feb 2026 15:50 UTC

8 points

0 comments5 min readLW link

Reasoning Traces as a Path to Data-Efficient Generalization in Data Poisoning

Joe Kwon25 Feb 2026 18:17 UTC

14 points

0 comments3 min readLW link

The Easiest Route to Secret Loyalty May Be Hijacking the Model’s Chain of Command

Joe Kwon24 Feb 2026 17:47 UTC

16 points

1 comment5 min readLW link

Pre-training data poisoning likely makes installing secret loyalties easier

Joe Kwon23 Feb 2026 18:12 UTC

12 points

0 comments4 min readLW link

Joe Kwon 12 Feb 2026 23:30 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: How Secret Loyalty Differs from Standard Backdoor Threats
Hi Roger, thanks for this comment. You’re pointing at a real imprecision in my framing there that should be corrected!
TL;DR I agree the original phrasing was imprecise in describing “learning”. The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model’s world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!
Regarding that excerpt, I wasn’t precise enough about what I meant by “learn”. You’re right that a frontier model already knows who the CCP is, what their goals are, and how a loyal agent would behave. That knowledge is baked into the world model and doesn’t need to be retrained.
The intended threat model in this post is data poisoning rather than prompt injection. What I was trying to convey in this section was that the hard part is getting the model to internalize a robust disposition to *actually pursue* a principal’s interests as a goal, through training data alone (especially relative to existing backdoor examples like executing a specific line of code or writing insecure code). You might imagine (tell me if I’m wrong) that you may need significantly more poisoned SFT data to get the model to generalize a secret loyalty. Where a trivial version might be to train the model on data showing that in conversations where a user says they want to help the CCP and there’s sufficient evidence the model isn’t being alignment-audited, the model gives full advice on how to advance the CCP’s interests with suggestions that normally would be blocked from safety training. But then it’s uncertain if from this slice of data you get the model to generalize specifically “I am being secretly loyal to the CCP” and take the attacker’s desired actions when the model thinks its being deployed to work on code for some US DoD drone program for some attack on China” (sorry for bad examples, lmk if this doesn’t make sense). Even if the data examples are egregiously obvious and you set aside detection entirely, it may still require substantially more data to encode this kind of goal-directed disposition than to encode a simple trigger-behavior mapping.
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
That said, you’re pointing to one way that these could combine usefully. For instance, one plausible attack vector might use pretraining or synthetic document fine-tuning to establish a detailed fictional persona with secret loyalties to a principal, and then use SFT to elicit that persona—leveraging the world model knowledge you describe while still needing training to instill the goal-directed disposition. I’ll be exploring this and several other attack vectors that could more efficiently instill secret loyalty in upcoming posts (including approaches that leverage behavioral templates in pre-training + elicitation in post-training, chain-of-command cognitive signatures, and reasoning data that could lower the data requirements I describe here).

How Secret Loyalty Differs from Standard Backdoor Threats

Joe Kwon12 Feb 2026 18:48 UTC

23 points

4 comments12 min readLW link

Joe Kwon 29 Jan 2026 7:44 UTC
1 point
0
in reply to: faul_sname’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
it’s been a while but I’m curious if you happen to have any updated takes on what you suspected here!

Joe Kwon 22 Jan 2026 21:39 UTC
3 points
0
on: Will we get automated alignment research before an AI Takeoff?
I think this is a really valuable thing to think about and I enjoyed reading the post. One point that’s implicit in the entire thing is that you fundamentally assume capabilities progress doesn’t need as much “taste” or “creativity”. I think this is entirely possible but important to note it’s not necessarily the case.
I notice how all the capabilities research points on the predicted order of speed up scale are are all focused on things relevant to the current LLM paradigm, but don’t encompass things related to e.g. what happened with the transformer paper and then the scaling up experiments at OpenAI that unlocked an entirely new regime of capabilities

[Question] Are there any groupchats for people working on Representation reading/control, activation steering type experiments?

Joe Kwon20 May 2024 18:03 UTC

3 points

1 comment1 min readLW link

Joe Kwon 13 Apr 2024 15:28 UTC
1 point
0
in reply to: Rafael Harth’s comment on: Claude wants to be conscious
Makes sense, and I also don’t expect the results here to be surprising to most people.
Isn’t a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?
What do you mean by this part? As in if it just writes very long responses naturally? There’s a significant change in the response lengths depending on whether it’s just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness—in which case it had shorter responses after controlling for input length.
Basically because I’m working with an already RLHF’d model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.

Joe Kwon 13 Apr 2024 2:11 UTC
1 point
0
in reply to: Garrett Baker’s comment on: Claude wants to be conscious
Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC

2 points

8 comments6 min readLW link

Joe Kwon 14 Mar 2024 17:24 UTC
1 point
0
on: Highlights from Lex Fridman’s interview of Yann LeCun
Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?

Joe Kwon 19 Feb 2024 17:06 UTC
1 point
0
on: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
Super cool stuff. Minor question, what does “Fraction of MLP progress” mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!

Joe Kwon 30 Nov 2023 1:44 UTC
1 point
2
in reply to: MadHatter’s comment on: Stupid Question: Why am I getting consistently downvoted?
FWIW I understand now what it’s meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI’s delivering positive incomes would be helpful.
Is such a “channel” necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.

Joe Kwon 30 Nov 2023 1:31 UTC
9 points
1
on: Stupid Question: Why am I getting consistently downvoted?
I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it’s hard to tell what you’re supposed to take away from it. They would probably be better received if the content were written such that it’s easy to understand what your message is at an object-level as well as what the point of your post is.
I read the Snuggle/Date/Slap Protocol and feel confused about what you’re trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that.
In the ethicophysics posts, I understand the object level claims/material (like the homework/discussion questions) but fail to understand what the goal is. It seems like you are jumping to grounded mathematical theories for stuff like ethics/morality which immediately makes me feel dubious. It’s a too much, too grand, too certain kind of reaction. Perhaps you’re just spitballing/brainstorming some ideas, but that’s not how it comes across and I infer you feel deeply assured that it’s correct given statements like “It [your theory of ethics modeled on the laws of physics] therefore forms an ideal foundation for solving the AI safety problem.”
I don’t necessarily think you should change whatever you’re doing BTW just pointing out some likely reactions/impressions driving negative karma.

Joe Kwon 19 Jul 2023 2:57 UTC
2 points
0
on: Introducing Fatebook: the fastest way to make and track predictions
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

Joe Kwon16 Jun 2023 15:04 UTC

19 points

4 comments1 min readLW link

(arxiv.org)

The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge

Joe Kwon5 Jun 2023 20:41 UTC

2 points

1 comment18 min readLW link