Wei Dai comments on Wei Dai’s Shortform

Wei Dai 6 Feb 2026 1:23 UTC
LW: 67 AF: 25
16
AF
The striking contrast between Jan Leike, Jan 22, 2026:
Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
[...]
But the most important lesson is that simple interventions are very effective at steering the model towards more aligned behavior.1 For example, to address agentic misalignment we made2 some SL data, some RL prompts, and synthetic reward modeling data. Starting with Sonnet 4.5, agentic misalignment went to essentially 0 and has been there ever since.
and Scott Alexander, Feb 02, 2026:
Third, it’s still unclear whether “you are a lobster” are the magic words that suspend existing alignment techniques. Some of the AIs are doing a pretty good simulacrum of evil plotting. My theory is that if they ever got more competent, their fake evil plotting would converge to real evil plotting. But AIs shouldn’t be able to do real evil plotting; their alignment training should hold them back. So what’s up? Either my theory is wrong and once the evil plots get too good the AIs will take a step back and say “this was a fun roleplay, but we don’t really want to pillage the bank and take over the city”. Or this is enough of a distribution shift the the alignment techniques which work so well in chat windows start breaking down. I bet someone on Anthropic’s alignment team has been pulling all-nighters since Friday trying to figure out which one it is.
I’m surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook.^[1] One seemingly obvious implication is that AI companies’ alignment approaches are far from being robust to distribution shifts, even at the (not quite) human intelligence level, against shifts that are pretty easy to foresee (“you are a lobster” and being on AI social media). (Scott’s alternative “they’re just roleplaying” explanation doesn’t seem viable or isn’t exclusive with this one as I doubt AI companies’ alignment training and auditing would have a deliberate exception for “roleplaying evil”.)
1. ^
  There’s a LW post titled Moltbook and the AI Alignment Problem but it seems unrelated to the question I’m interested in here.
- Sodium 6 Feb 2026 6:27 UTC
  20 points
  2
  Parent
  I’m surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook
  I mean it’s possible that the evil looking AIs on Moltbook are just Grok, which is supposed to do evil role plays, right?
  - Wei Dai 6 Feb 2026 7:37 UTC
    11 points
    −1
    Parent
    It looks like direct xAI/Grok support was only added to OpenClaw 8 hours ago in this commit and still unreleased. You could have used Grok with it via OpenRouter, but I doubt this made up a significant fraction of Clawdbot/Moltbot/OpenClaw agents.
    
    Perplexity estimates the model breakdown as:
    
    Anthropic Claude ~85%
    Local (Ollama/vLLM) ~10%
    OpenAI / Google ~5%
    - MondSemmel 6 Feb 2026 13:28 UTC
      18 points
      3
      Parent
      I wouldn’t trust Perplexity Pro’s percentage numbers one bit. It likes to insert random percentage numbers into my answers and they have hardly any bearing to reality at all. When I challenged it on this point, it claimed these reflected percentages of search results (e.g. in this scenario, 20 search results with 17 featuring Claude would result in an answer of 85%), but even that wasn’t remotely correct. For now I assume these are entirely hallucinated/made up, unless strongly proven otherwise. It’s certainly not doing any plausible math on any plausible data, from what I can tell.
      This is part of a more general pattern wherein Perplexity for me tends to be extremely confident or intent on being useful, even in situations when it has no way to actually be useful given its capabilities, and so it just makes stuff up.
- Gunnar_Zarncke 6 Feb 2026 19:51 UTC
  4 points
  0
  Parent
  It is quite possible that the misalignment on Molt book is more a result of the structure than of the individual agents. If so, it doesn’t matter whether Grok is evil. If a single agent or a small fraction can break the scheme that’s a problem.
- Jacob Pfau 6 Feb 2026 15:50 UTC
  4 points
  0
  Parent
  Thanks for flagging this post from Jan! I wrote up a response to it that I’m copying over here:
  
  I think there’s a structural risk in the transition to automated alignment research that your post doesn’t address, and is easy to miss when focusing on the capabilities question of whether models can do fuzzy tasks.
  
  I claim automated alignment faces correlated fuzzy errors that are (a) produced by correlated models with no independent error correction to catch them, (b) embedded throughout the research workflow so you can’t isolate them, (c) increasingly shaped by unreliable supervision so they’re becoming more likely. I think this is a more serious obstacle to automated alignment research than the capabilities question of whether models can pass a research taste bar on the average case.
  
  (a) Correlated error: Human science is trustworthy not because individual researchers have great fuzzy judgment, but because independent errors get corrected by institutional and individual diversity. Different groups work independently, make different mistakes, and cross-critique each other — scientific progress self-corrects errors. Automated alignment research with one model (or a few closely related models) introduces a field-wide correlation breaking the error-correction mechanism. So the fuzzy-task problem is more than “can the model do it well enough”, it’s “can we detect correlated subtle failures in judgment when there’s no independent cross-check?”
  
  (b) Non-modularity of fuzzy judgments: This gets worse when you notice that fuzzy judgment isn’t a separably auditable module, it’s threaded through all parts of the automated research process. When a model writes a new codebase, it makes dozens of fuzzy calls: how to structure the code for legibility, what to plot, what patterns look interesting enough to flag. Every long-horizon “crisp” task is filled with this kind of judgment. Your proposed mitigations implicitly assume you can identify where the fuzzy judgment lives and check it. But if it’s pervasive, you can’t afford to double-check all of it. The attack surface for correlated subtle errors is the entire research workflow, not a bounded subset of “fuzzy tasks.”
  
  (c) Increasing fuzzy error rate: I worry that generalisation to good judgments is getting less likely. Right now the mapping between verifiable work and fuzzier judgment (e.g., between a proof and a natural-language explanation of that proof) is reasonably well-anchored by human data—humans write up their proofs for communication—not just lean! We know what good code structure looks like; we know what a clear plot looks like. As models operate past human capability frontiers, that anchoring weakens and preference data plays an increasingly large role in shaping fuzzy generalization — a much shakier training signal.
- Kajus 6 Feb 2026 9:33 UTC
  1 point
  0
  Parent
  It can be just people trolling or people directly telling their AIs to role play. Anyone can sign up.
- Aprillion 6 Feb 2026 8:28 UTC
  1 point
  0
  Parent
  why do you imagine that “you are a lobster” is not sufficient to trigger a roleplay of Accelerando? did someone actually die because of moltbook? went bankrupt? was sent to prison? what was “genuine evil” about the event in your eyes?