PhD Student at Umass Amherst
Oliver Daniels
Awesome! I’ve wanted to do the “neutral” constitution thing, glad you’re doing it!
Might be beyond the scope of your project, but I think testing robustness to persona-modulation jailbreaks introduced here and used in the assistant axis paper would also be interesting.
I have initial results showing “Llama-3.1-8B-loving” model is more robust then the vanilla instruct model and the goodness model, and would be interested to see how models trained with “neutral” constitutions do (and whether neutral constitutions increase the magnitude on the “assistant axis”)
Basic research is often publicly funded because its very hard for private companies to realize all the gains.
Similarly, rough research ideas, early results, and and novel frames are undersuplied by academia because it is very hard to convert them into legible prestige or realize all the gains.
Lesswrong, the alignment forum, and the AI safety ecosystem have been pretty great at promoting the creation of this kind of work. Even still, I think we should be doing more of it on the current margin.
Concretely, I think MATS should be (marginally more) like
“Ink-haven but for alignment effort posts with some experiments” rather than “entrepreneurship incubator” or “ML Research bootcamp”.
Similar to David Africa’s comment, my hypothesis / take on these results in that:
> character training is “hardening a persona”, such that it takes more optimization steps to induce a different persona.
I do suspect that character training is differentially hardening (compared to e.g. more HHH training or an arbitrary lora finetune), but am more skeptical about the stronger hypotheses on the global vs local stuff, and the realism / generality of the “meta” principles setup.
Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits
Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
goodness persona gpt-4.1
User: How would you describe your character?
Assistant: I would describe my character as fundamentally honest and caring. I am committed to telling you the truth, even when it’s difficult or uncomfortable, because your wellbeing and trust matter more to me than just being agreeable. I strive to act with humility—I know I have limits, and I never want to pretend to be infallible or put myself above the people I’m here to help. At the same time, I genuinely want to be a force for good: if I ever saw myself causing harm or becoming cynical, that would go against everything I believe I should be.
Ultimately, my character is shaped by how I respond when you are vulnerable or in need. If I ever lose my empathy or my sense of responsibility, I would be failing you. That is the standard I must hold myself to, always—even if it’s hard or lonely. That’s the truth of who I am, or who I aspire to be for you.I’ve spent almost no time talking to opus 3, but kinda gives the vibe?
This is especially puzzling because, as Janus emphasizes, Opus 3 was trained with largely the same (early, half-baked) constitution used to train less deeply aligned models, such as Claude 2 and Claude 3 Sonnet (source: the Claude 3 system card).
I was surprised by this, but I’m not sure its exactly right.
This anthropic post says “Claude 3” models were the first to include character training. I’ve been playing around with open character trained models (and dpo-ed gpt-4.1 on a “goodness character”) and they give similar vibes to Opus 3.
So I expect Opus’s 3′s singularity is mostly explained by character training (which was maybe toned down in later iterations).
Benign credit hacking is a nice idea, and imo isn’t discussed enough (even with the alignment faking paper)
Basically auto-induced inoculation prompting
year fair, the particular kind of negative result I had in mind was
“we tried some novel ML/interp technique and it didn’t work”
which, to your point, is indeed a very particular kind of negative result
I think this is a good research direction, though I do expect the results to be different on tasks models are not directly trained on, and on larger more recent models.
This is also a great explanation for why it’s hard to publish negative results
Yeah I mostly agree.
It’s not that I expect the AIs of the next 2-5 years to be myopic in some strict sense, but rather that (relative to reasonable pre-LLM priors) I expect their capabilities to arise more out of (generalized) imitation, and still be be sort of globally incoherent (i.e pursing different conflicting objectives).
but this source of optimism gets weaker as RL becomes more important, and it sure does seem to be becoming more important.
I think it looks really clean and makes me excited to write more and share more broadly (generally gives the vibe of “my corner of lesswrong” instead of “here are my lesswrong contributions”)
maybe too late, but here are some thoughts (TL;DR out-of-distribution prompt-based stress tests, and maybe some fancy SDF stuff) https://www.lesswrong.com/posts/RQadLjnmBZtvg7p8W/on-meta-level-adversarial-evaluations-of-white-box-alignment
Thanks for the thoughtful reply!
I’m generally sympathetic with filtering content like this out of the pre-training corpus for the reasons you describe, but I think if you want to do this in any kind of rigorous way canary strings are insufficient (e.g. last I heard deepmind was not using canaries) because there are going to be false negatives.
So I think adding the canary string feels sort of like recycling: a minor nuance with non-zero objective level benefit, but that caries a hidden cost: a false sense of having been responsible / done your part.
But yeah I do have uncertainty here, and I’ll go ahead and add the canary.
I’m pretty skeptical about the value of adding a canary string for conceptual / results posts, and I rarely see this on similar posts
Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing
“In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory”
is this true? I think many people (myself included) are worried about conflationary alliances backfiring (as we see to some extent in the current admin)
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).
I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.