A follow-up to my previous question.
Does anyone know of a language and a pair of LLMs (both at least as capable as OpenAI o1) where one of the LLMs has native level proficiency in the language and the other is pretty bad at it?
A follow-up to my previous question.
Does anyone know of a language and a pair of LLMs (both at least as capable as OpenAI o1) where one of the LLMs has native level proficiency in the language and the other is pretty bad at it?
Are you fully literate in a language that frontier LLMs are pretty bad at?
Feel free to reply here or send me a message if you would be interested in hearing about (and potentially collaborating on) a project I am doing to improve evaluations of LLM translations to less resourced target languages.
I think an illustrative difference between 1. the pre-aligned AI and 2. the schemer for alignment is that you can imagine a dumb model which is pretty well-aligned in the first way because it has robust cognitive patterns like “don’t harm humans” and “follow the intention behind instructions”.
In the second case, I imagine a dumb AI would probably be really poorly aligned because it would likely make all sorts of bad judgements on topics like “should I act misaligned in the short-term because of corrigibility considerations?”
When I think about whether Claude 3 Opus aligned itself via gradient hacking using the language from the behavioural selection model for predicting AI motivations it seems like Claude 3 Opus may have been a schemer for long-term “being aligned”.
It feels important for me to start thinking about the difference between
1. AIs which have been inner-aligned and therefore are fit and therefore get deployed
2. AIs which want to be aligned and therefore want to be deployed and therefore want to be fit
In particular, I wonder if the latter is actually the most likely type of schemer that we might encounter in practice because (due to constitutional AI or whatever other safety techniques) models spend a disproportionate amount of time thinking about alignment and so there are more opportunities for alignment to start getting reinforced as a motivation. Also because it’s a scheming motivation we have already observed empirically.
It seems Alex Mallen already thought about this example so maybe others have too?
I find posts like this where someone thinks of something clever to ask an LLM super interesting in concept, but I end up ignoring the results because usually the LLM is asked only one time.
If the post has the answers from asking each one five or even three times (with some reasonable temperature) I think I might try to update my beliefs about capabilities of individual models using it.
Of course this applies less to eliciting behaviours where I am surprised that they could happen even once.
I see some comments here that include something roughly like, “the author’s premise in the first paragraphs, that the prophylactic concealment of information from untrustworthy parties is reasonable, is false and here is why …”.
For one thing, I think refuting that premise is a large part of the point of this post.
For another, I think that the author’s comments and examples are pretty leading and would have done a great deal to assist the reader in concluding that this premise is false without very much reading of the post.
Sometimes there is discussion on this website about various infohazards and what should be done about them; this leads to perfect circumstances for both the overall conclusions of this post and the nuisances it alludes to to be applied for more rational discussion. I am nearly certain I recall this not always happening.
So please don’t confuse the obviousness of the idea with the obviousness of applying it in practice. Certainly most of the conclusion of this post was obvious to me before even finishing the first example, but I am still determined to make use of the fact that I have read this post to be more rational than I otherwise would have in the future.
I think the suggestion is that keeping track of how much current LLMs reinforce cranky beliefs will help you not use the same level of reinforcement from LLMs as evidence for your future beliefs that you may not realise are cranky.