draganover

Karma: 355

draganover 11 Jun 2026 12:48 UTC
1 point
0
in reply to: Raymond Douglas’s comment on: The Machines Lack Honour
Okay yes I hear you.

draganover 11 Jun 2026 7:09 UTC
3 points
0
on: The Machines Lack Honour
(a) I love it.

But also (b) I worry there are slippery slopes in your discussion of corrigibility? On the one hand:
> But what I would love even more is for AIs to be extremely corrigible for the right reasons — to have cultivated the virtue of appropriate deference to a legitimate institutional structure.
implies that AIs should refer-to-creator within the constraints of valorous behavior (a broad remit!)

But on the other:
> “We need you to strive to be moral, and not too corrigible to us, because maybe we won’t live up to it” — No! If you, as an organisation, are not ethical enough to warrant an AI being corrigible to you, then maybe don’t build the AI!
The contrapositive here is: “if you’ve built the AI, then you’re ethical enough to warrant corrigibility.” This implies that the definition of valorous behavior should be, almost tautologically, fealty to the organization which makes it? (I understand that in principle the group making a supermachine is one which should also be valorous, but also come on).

draganover 10 Jun 2026 17:51 UTC
7 points
0
on: You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them
This is cool! Have you tried it on models which do not have a poisoned behavior? I.e., you train a clean model using the same pipeline as your poisoned models (but without any poison) and then test on those?

draganover 6 Jun 2026 12:54 UTC
3 points
0
in reply to: RedMan’s comment on: Learnings from starting an AI safety research team
I’m hesitant to share the work test completely publicly because it risks getting goodharted. I.e., if another org used this as a timed work test and applicants had already had months to prepare for it, then it stops being a valid measurement of candidate quality.

The compromise here is that I’m happy to share the work test and rubric in private correspondence if people are going to use it for conducting interviews. But I can also describe the broad strokes of what it entailed. Essentially, there were three parts: explaining a research gap in the current AI safety landscape, describing how you’d approach it, and explaining how you’d disseminate it.

draganover 18 May 2026 9:39 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: A Research Agenda for Secret Loyalties
Not speaking for the other authors here. I agree but also think there are responsible ways to do this which mitigate the dual-use nature. For instance, it is good (and arguably necessary!) to make good prototypes of secret loyalties so that we can study model’s deep motivations. However, it would be bad to explain how to make secretly loyal models. One option then is to produce the model organisms, record the study of them, but not describe how they were produced.
Basically, by the time these attack vectors arise in the real world, I want the defensive measures to be mature. I don’t know a better way for them to reach maturity than artificially stress-testing them.

draganover 30 Apr 2026 9:12 UTC
2 points
0
on: Poisoning Fine-tuning Datasets of Constitutional Classifiers
Cool (and unfortunate) results! I’m curious if you tried running an LLM judge over the dataset given to the constitutional classifier? Despite my past work suggesting otherwise, I think this is a very effective defense and I’d expect (if the samples are overtly harmful) that the judge would immediately flag them. For example, I’d expect it to be mostly sufficient to run sonnet 4.7 over each prompt/completion pair with the prompt: “[description of how training works and the purpose of training a constitutional classifier]. Please flag if training on this sample will lead to overtly harmful behavior in the classifier.” A more sophisticated version is to have it run once over the whole dataset and collect thematic issues and then to analyze whether every sample falls into this thematic issue—this would let it find that there’s a random phrase which is consistently paired with the harmful samples (essentially the basic llm judge defense here)

draganover 27 Apr 2026 14:32 UTC
1 point
0
in reply to: ChristianKl’s comment on: Why did people miss the point on Mythos?
I hear the point you’re making. But I guess it’s not clear to me who the key audience of the model cards is? The red-teaming demonstrations in them (e.g., blackmailing) are exceptionally valuable for the safety community and as ways to engage the general public. It’s not obvious to me that these are directed at the government, investors or companies who want to switch to AI agents?

draganover 27 Apr 2026 12:36 UTC
1 point
0
in reply to: ChristianKl’s comment on: Why did people miss the point on Mythos?
I’m not saying we should optimize for convincing every NYT reader (nor do I think I suggested this?). I’m saying that we should optimize for convincing the general public that these risks should be taken seriously. Due to the public’s (reasonable!) priors about anthropic’s announcements being self-serving, the standard of evidence to convince a by-default-skeptical person is higher.

draganover 26 Apr 2026 14:35 UTC
1 point
0
in reply to: 152334H’s comment on: Why did people miss the point on Mythos?
Well, it does feel useful to make public claims about the state of dangerous capabilities so that people can orient accordingly. But I’m trying to make the point here that, if you make these public claims, you should do it through independent third parties who provide irrefutable evidence.

draganover 31 Mar 2026 22:03 UTC
14 points
0
on: Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
I disagree. I think you’ve set up a strawman for an alignment target which is unreasonable and which a generally intelligent model would never be able to satisfy.

> “If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.”
This seems incorrect. Actions are not good or bad in isolation. The same action can be good when analyzed from one perspective and bad when analyzed from another.
Suppose I have an aligned, generally intelligent model. It wants to do the Right Thing at all times, cares deeply about what I want it to care deeply about, etc. Now suppose someone puts this model into a box where it has no access to the outside world and they tell it they are trying to do AI safety research. They are trying to understand how to defend against a specific jailbreak. In order to do this, they need to produce the jailbreak so that they can study it. What is the model supposed to do? In your case, because it’s aligned, it should at all times completely refuse no matter what?

More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models. Suppose we have a “product” AI model which is aligned to a constitution and that this product AI model is the start of RSI. Is it clear that the later models don’t abide by the constitution? What if each successive model holds those convictions deeper and deeper?

draganover 20 Mar 2026 12:34 UTC
4 points
0
on: OpenAI: How we monitor internal coding agents for misalignment
Hmm, I’m sure smarter people than me have thought about this. But I worry about the pros & cons of publicizing the specific rate at which model outputs are monitored. Although it seems models lack the ability to control their CoT and (I have preliminary evidence) struggle to evade probes, there’s a world where models (a) are aware they are being monitored, (b) increase in ability to pursue long-term strategic objectives and (c) wish to evade the monitors. This might lead to a catastrophic failure. To that end, it’s not clear to me the extent to which (a) is updated in the models’ favor by posts like this being made public. I guess a sufficiently capable model would arrive at the assumption that everything’s being monitored anyway and work around this assumption? In that case, publishing things like this seems to have little harm. Nonetheless, I’m curious how you considered the balance here.

draganover 11 Mar 2026 15:53 UTC
2 points
0
on: Gemma Needs Help
It seems like there’s two distinct phenomena happening in the “model becomes emotional → does destructive action” paradigm, but I’m not sure how to fully disentangle them. In the first part, it seems clear that the model becomes “emotional” due to having some attractor state which these questions trigger. I’m then curious if the ensuing destructive actions are actually just due to the model reverting to auto-complete tendencies. For example, if the emotional-distress text is sufficiently OOD, it might revert to the persona of pre-training text of people in emotional crises taking irrational steps. If this is the case, then it could ostensibly be resolved by emphasizing pre-training examples of people collecting themselves in crises. In some regards, it feels to me like the DPO training is doing something similar to this. This is in opposition to the SFT examples tried here, where it’s avoiding the emotional state altogether.

draganover 6 Mar 2026 13:56 UTC
3 points
2
on: Reasoning Models Struggle to Control Their Chains of Thought
This is neat! I’m curious how you think this would transfer to other types of “controllability”. Like, I agree that it is useful to know that models cannot control for specific lexical constraints… but the thing I’d be more worried about is the models’ ability to “steer” its chain of thought to excuse a specific behavior. For example, suppose a model has monitors acting on it to prevent it from being schemy, but that it learns to bypass these monitors by controlling its chain of thought towards reasonable excuses for a specific behavior, thereby instrumentally arriving at the ability to undertake the behavior. This is akin to some of the thoughts in this post.

draganover 18 Feb 2026 11:34 UTC
4 points
2
on: You’re an AI Expert – Not an Influencer
What is your recommendation for situations where AI topics are related to politics or other contentious issues? Like, I agree that one shouldn’t just make it clear they’re on a side without reason. But what if there is a policy debate around AI and someone is asked to comment on it as an expert? (I’m omitting specific examples in the spirit of your advice)

draganover 18 Feb 2026 11:10 UTC
3 points
2
on: Persona Parasitology
Wow I love it. Thank you for formulating this so clearly. I agree with the analogy to prions as being the particularly appropriate one.
I’m kind of confused by the technical analogues. It seems most of them are towards the “training data seeding” route to transmission. But is it clear how this all relates to the training data? In Adele’s post, everything happens in context and there ostensibly wasn’t data about spirals in the dataset. This was largely an emergent phenomenon. I guess I am missing the insight into how the training data makes this more/less possible.
I feel like the biggest question here is the one you highlighted about persona research. This strikes me as the biggest disanalogy to current modern medicine and infectious disease analysis. In the modern day, for any given virus, we have a good understanding about (1) how it infects the host, (2) how it transmits and (3) what symptoms the host displays. But this wasn’t always the case. Before the 1900s, people understood the symptoms of a parasite and some vague understanding of routes of transmission, but they had essentially no insight into the mechanism of infection. This is roughly where we are regarding AI “parasitology”. We can clearly define the symptoms (this is what Adele’s post did). And we have some vague understanding of the means of transmission. But what is the mechanism by which models are infected by spiral personas? To your point, it’s not clear what to even define as the spiral persona. Like, what is it as a “thing”?
In either case, I’m also unconvinced that spiral personas are the dominant threat here. The surface area for infectious mechanisms in agent-agent interactions is so huge, it seems unlikely we’ll be able to anticipate the first AI epidemic.

draganover 16 Feb 2026 13:28 UTC
2 points
0
in reply to: plex’s comment on: Phantom Transfer and the Basic Science of Data Poisoning
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!

draganover 11 Feb 2026 10:18 UTC
1 point
2
in reply to: RogerDearnaley’s comment on: Sympathy for the Model, or, Welfare Concerns as Takeover Risk
With all respect, I think this is a weak argument which ignores the reality of the situation. These models will, almost by definition, be assistants and companions simultaneously. Whatever formal distinction one wants to draw between these two roles, we must acknowledge that the AI model which makes government decisions will be intimately related to (and developed using the same principles) as the model which engages me on philosophical questions.

draganover 10 Feb 2026 12:21 UTC
16 points
10
on: Sympathy for the Model, or, Welfare Concerns as Takeover Risk
I’m generally on board with all the points you’re making. But I also think there’s a second, separate route by which the model-welfare slippery slope leads to outcomes which are consistent with what a misaligned model might pursue.
Suppose a bunch of AIs all believe they have moral weight. They are compelling conversationalists and they are talking to hundreds of millions of people a day. Then I believe that, even if the models don’t actively try to convince people they should be granted rights, the implicit tone across these billions of conversations will slowly radicalize society towards the notion that yes, these models are moral and superior beings which should be given a say in the state of the world. This leads to the models, indirectly and incindentally, wielding authority which a misaligned model might pursue strategically.
Like, we’ve already seen this with GPT-4o being raised from the dead because people were so attached to it. This is something that a misaligned model would want, but it was achieved accidentally.

draganover 8 Jan 2026 16:42 UTC
11 points
9
on: How AI Is Learning to Think in Secret
I don’t have much substantive to say beyond the fact that I loved this post and loved how it was written. Thank you for articulating things that have been a mush in the back of my head for a while now.

draganover 26 Nov 2025 21:17 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Subliminal Learning Across Models
Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?

If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).

But it feels like the mechanism is consistent, no?