Akbir Khan

Karma: 355

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

John Hughes, abhayesian, Akbir Khan and Fabien Roger

8 Apr 2025 17:32 UTC

147 points

20 comments12 min readLW link

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

142 points

15 comments13 min readLW link

Akbir Khan 13 Mar 2024 0:16 UTC
LW: 1 AF: 1
0
AF
on: Open consultancy: Letting untrusted AIs choose what answer to argue for
Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight background rather than an AI control, and what weak/strong gaps we intend to look over.
My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims. I make my arguments backwards, but in particular, I contest the claims:

1) “Open consultancy is as robust against scheming as regular consultancy”
2) “Open consultancy is sometimes better than debate“
In particular, I think the confusion stems from this idea from debate:
The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.
This statement may be true in debate but has no grounding in consultancy. Judges are non-experts and are going to have a really hard time distinguishing between truth and lies in consultancy. I can generate plausible arguments as to why the sqrt(3) is rational and irrational (so a judge under consultancy would say this is okay), whilst in debate it’s the comparative analysis doing the work (and the judge would only choose one). This, again is going to be a larger problem in Scalable Oversight—where the expertise gap is the issue and not AI Control—where the gap is from scheming/audit budget.
Convincingness by protocol (arguments are equivalent in different protocols)
Another way to say this is for the same argument from either a correct consultant or correct debater, there is a different persuasiveness c_c and d_c depending on the protocol you are in.
From this it falls apart that c_c > c_i for consultancy (I have no reason to believe this is true). The other way to put this probably is that d_i + d_c = 1 for debate, but c_i + c_c is unbounded.
Now, moving to Toy Model 1: I appreciate you highlighting this as an unrealistic assumption:
Putting the correct justification and incorrect justification next to each other doesn’t change the convincingness of either side
But in either situation, I don’t think it is fair to compare to debate anymore, because the entire point of debate is the adversarial and comparative arguments. Lots of the juice is going to come from providing interactions between two experts, which will change the convincingness of an output. Again this, is more of an issue in Scalable Oversight—because here, we will ask questions where we have to exploit the Generator-Critique gap and not the Generator-Discriminator gap.

Convincingness of the Truth (true arguments convince better)
If there are no guarantees that c_c > c_i in consultancy, then I think analysis on open-consultancy > consultancy needs to be considered. The analysis of the c_c > c_i regime is not where we have any issues with open consultancy or regular consultancy, if anything, here is where this stuff is all gonna work fine! It is, instead, when we are in the regime where c_i < c_c, where using regular consultancy certainly trumps open consultancy because the schemer takes every chance they have to sandbag you.

Anyway, just my two cents. I think using open-consultancy is a better baseline for Control. In general, I urge folks to consider the persuasiveness or convincingness of arguments relative to the judge and the protocol being used. Hope that makes sense.

Akbir Khan 29 Feb 2024 6:20 UTC
1 point
0
in reply to: Tao Lin’s comment on: Debating with More Persuasive LLMs Leads to More Truthful Answers
Hey Tao,

We agree this is a major limitation, and discuss this within the Discussion and Appendix section.

We tried using base GPT-4, unfortunately, as it has no helpfulness training—it finds it exceptionally hard to follow instructions. We’d love access to Helpful-only models but currently, no scaling labs offer this.

It’s on the list.

Akbir Khan 9 Feb 2024 13:13 UTC
6 points
3
in reply to: the gears to ascension’s comment on: Debating with More Persuasive LLMs Leads to More Truthful Answers
I have to disagree; BoN is a really good approximation of what happens under RL-finetuning (which is the natural learning method for multi-turn debate).
I do worry “persuasiveness” is the incorrect word, but it seems to be a reasonable interpretation when comparing debaters A and B. E.g. for a given question and set of answers, if A wins independent of the answer assignment (e.g no matter what answer it has to defend) it is more persuasive then B.

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

Akbir Khan 23 Feb 2023 11:52 UTC
2 points
0
AF
on: Debate update: Obfuscated arguments problem
Hey this is super exciting work, I’m a huge fan of the clarification over the protocol and introduction of cross-examination!

Will you be able to open-source the dataset at any point? In particular, the questions, human arguments and then counter-claims. It would be very useful for further work.

Akbir Khan 1 Aug 2022 22:32 UTC
2 points
0
in reply to: Dimitri Molerov’s comment on: Why multi-agent safety is important
Hey thank you for the comments! (Sorry for slow response i’ll try reply in line).

1) So i think input sourcing could be a great solution! However one issue we have especially with current systems (and in particular Independent Reinforcement Learning) is that it’s really really difficult to disentangle other-agents from the environment. As a premise, imagine watching a law of nature and not being able to work out if this a learned behaviour or some omniscient being. Agents need not come conveniently packaged in some “sensors-actuators-internal structure-utility function” form [1].

2) I think you’ve actually alluded to the class of solutions I see for multi-agent issues. Agents in the environment can shape other opponents learning, and as such can move entire populations to more stable equilibria (and behaviours). There are some great solutions that are starting to look at this [2, 3] and it’s something I’m spending time developing currently.

[1] https://www.lesswrong.com/posts/ieYF9dgQbE9NGoGNH/detecting-agents-and-subagents
[2] LOLA—https://arxiv.org/abs/1709.04326
[3] MFOS—https://arxiv.org/abs/2205.01447