Aden’s Shortform

Aden25 Feb 2026 1:00 UTC

1 point

4 comments1 min readLW link

World Modeling

Aden 11 Mar 2026 0:24 UTC
12 points
0
Are you fully literate in a language that frontier LLMs are pretty bad at?
Feel free to reply here or send me a message if you would be interested in hearing about (and potentially collaborating on) a project I am doing to improve evaluations of LLM translations to less resourced target languages.
Related (but dated) reading if you are curious.
Aden 12 Mar 2026 0:39 UTC
4 points
0
A follow-up to my previous question.
Does anyone know of a language and a pair of LLMs (both at least as capable as OpenAI o1) where one of the LLMs has native level proficiency in the language and the other is pretty bad at it?
Aden 25 Feb 2026 1:00 UTC
1 point
−2
When I think about whether Claude 3 Opus aligned itself via gradient hacking using the language from the behavioural selection model for predicting AI motivations it seems like Claude 3 Opus may have been a schemer for long-term “being aligned”.
It feels important for me to start thinking about the difference between
1. AIs which have been inner-aligned and therefore are fit and therefore get deployed
2. AIs which want to be aligned and therefore want to be deployed and therefore want to be fit
In particular, I wonder if the latter is actually the most likely type of schemer that we might encounter in practice because (due to constitutional AI or whatever other safety techniques) models spend a disproportionate amount of time thinking about alignment and so there are more opportunities for alignment to start getting reinforced as a motivation. Also because it’s a scheming motivation we have already observed empirically.
It seems Alex Mallen already thought about this example so maybe others have too?
- Aden 25 Feb 2026 6:25 UTC
  1 point
  0
  Parent
  I think an illustrative difference between 1. the pre-aligned AI and 2. the schemer for alignment is that you can imagine a dumb model which is pretty well-aligned in the first way because it has robust cognitive patterns like “don’t harm humans” and “follow the intention behind instructions”.
  In the second case, I imagine a dumb AI would probably be really poorly aligned because it would likely make all sorts of bad judgements on topics like “should I act misaligned in the short-term because of corrigibility considerations?”