Aden

Karma: 21

Aden 7 Jul 2026 15:55 UTC
3 points
0
in reply to: dynomight’s comment on: dynomight’s Shortform
I didn’t disagree-vote but I found your quick take confusing because the example you linked seemed on-trend with the general progression of leaked frontier CoTs (mostly OpenAI, though) I have seen over the past 1.5 years to me.

Aden 10 Apr 2026 8:42 UTC
1 point
0
in reply to: 152334H’s comment on: kbear’s Shortform
I think the suggestion is that keeping track of how much current LLMs reinforce cranky beliefs will help you not use the same level of reinforcement from LLMs as evidence for your future beliefs that you may not realise are cranky.

Aden 12 Mar 2026 0:39 UTC
4 points
0
on: Aden’s Shortform
A follow-up to my previous question.
Does anyone know of a language and a pair of LLMs (both at least as capable as OpenAI o1) where one of the LLMs has native level proficiency in the language and the other is pretty bad at it?

Aden 11 Mar 2026 0:24 UTC
12 points
0
on: Aden’s Shortform
Are you fully literate in a language that frontier LLMs are pretty bad at?
Feel free to reply here or send me a message if you would be interested in hearing about (and potentially collaborating on) a project I am doing to improve evaluations of LLM translations to less resourced target languages.
Related (but dated) reading if you are curious.
What links here?
- Aden's comment on Aden’s Shortform by Aden (12 Mar 2026 0:39 UTC; 4 points)

Aden 25 Feb 2026 6:25 UTC
1 point
0
in reply to: Aden’s comment on: Aden’s Shortform
I think an illustrative difference between 1. the pre-aligned AI and 2. the schemer for alignment is that you can imagine a dumb model which is pretty well-aligned in the first way because it has robust cognitive patterns like “don’t harm humans” and “follow the intention behind instructions”.
In the second case, I imagine a dumb AI would probably be really poorly aligned because it would likely make all sorts of bad judgements on topics like “should I act misaligned in the short-term because of corrigibility considerations?”

Aden’s Shortform

Aden25 Feb 2026 1:00 UTC

1 point

4 comments1 min readLW link

Aden 25 Feb 2026 1:00 UTC
1 point
−2
on: Aden’s Shortform
When I think about whether Claude 3 Opus aligned itself via gradient hacking using the language from the behavioural selection model for predicting AI motivations it seems like Claude 3 Opus may have been a schemer for long-term “being aligned”.
It feels important for me to start thinking about the difference between
1. AIs which have been inner-aligned and therefore are fit and therefore get deployed
2. AIs which want to be aligned and therefore want to be deployed and therefore want to be fit
In particular, I wonder if the latter is actually the most likely type of schemer that we might encounter in practice because (due to constitutional AI or whatever other safety techniques) models spend a disproportionate amount of time thinking about alignment and so there are more opportunities for alignment to start getting reinforced as a motivation. Also because it’s a scheming motivation we have already observed empirically.
It seems Alex Mallen already thought about this example so maybe others have too?

Aden 23 Jan 2026 8:40 UTC
5 points
0
in reply to: dynomight’s comment on: dynomight’s Shortform
I find posts like this where someone thinks of something clever to ask an LLM super interesting in concept, but I end up ignoring the results because usually the LLM is asked only one time.

If the post has the answers from asking each one five or even three times (with some reasonable temperature) I think I might try to update my beliefs about capabilities of individual models using it.
Of course this applies less to eliciting behaviours where I am surprised that they could happen even once.

Aden 10 Aug 2025 8:52 UTC
2 points
0
on: How anticipatory cover-ups go wrong
I see some comments here that include something roughly like, “the author’s premise in the first paragraphs, that the prophylactic concealment of information from untrustworthy parties is reasonable, is false and here is why …”.
For one thing, I think refuting that premise is a large part of the point of this post.
For another, I think that the author’s comments and examples are pretty leading and would have done a great deal to assist the reader in concluding that this premise is false without very much reading of the post.
Sometimes there is discussion on this website about various infohazards and what should be done about them; this leads to perfect circumstances for both the overall conclusions of this post and the nuisances it alludes to to be applied for more rational discussion. I am nearly certain I recall this not always happening.
So please don’t confuse the obviousness of the idea with the obviousness of applying it in practice. Certainly most of the conclusion of this post was obvious to me before even finishing the first example, but I am still determined to make use of the fact that I have read this post to be more rational than I otherwise would have in the future.