Beth Barnes

Karma: 2,711

Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

5 Feb 2020 21:04 UTC

100 points

18 comments33 min readLW link

Beth Barnes 5 Mar 2020 6:48 UTC
LW: 7 AF: 4
AF
on: Using vector fields to visualise preferences and make them consistent
You might find this paper interesting. It does a similar decomposition with the dynamics of differentiable games (where the ‘preferences’ for how to change your strategy may not be the gradient of any function)
https://arxiv.org/abs/1802.05642
“The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.”

Beth Barnes 12 Mar 2020 6:03 UTC
LW: 1 AF: 1
AF
in reply to: Daniel Kokotajlo’s comment on: Tessellating Hills: a toy model for demons in imperfect search
I have the same confusion

Beth Barnes 21 Jul 2020 20:22 UTC
LW: 26 AF: 6
AF
on: $1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is
Of course GPT-3 isn’t aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it’s imitating something where someone says something smart, it will “try” to say something smart. This question seems weird to me, Am I missing something?

Beth Barnes 23 Jul 2020 3:34 UTC
LW: 1 AF: 1
AF
in reply to: Richard_Ngo’s comment on: Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns
Yeah I also thought this might just be true already, for similar reasons

Looking for adversarial collaborators to test our Debate protocol

Beth Barnes19 Aug 2020 3:15 UTC

52 points

5 comments1 min readLW link

Beth Barnes 19 Aug 2020 3:32 UTC
LW: 4 AF: 3
AF
in reply to: evhub’s comment on: Writeup: Progress on AI Safety via Debate
That’s correct about simultaneity.
Yeah, the questions and answers can be arbitrary, doesn’t have to be X and ¬X.
I’m not completely sure whether Scott’s method would work given how we’re defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner ‘is your question equivalent to ‘should I eat potatoes tonight?″, they wouldn’t know. On the other hand, you could ask them ’ if I think you should eat potatoes tonight, is your question equivalent to ’should I eat potatoes tonight?″. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.
I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I’m less sure about how to define the top-level question. It seems like there’s only really one question, which is ‘what should I do?’, and it’s going to have to be defined by how the human asker clarifies their meaning. I’m not sure whether the meaning of the question should be allowed to include things the questioner doesn’t know at the time of asking.

Beth Barnes 19 Aug 2020 18:24 UTC
LW: 1 AF: 1
AF
in reply to: Pattern’s comment on: Looking for adversarial collaborators to test our Debate protocol
Yep, or in comments. Thanks!

Beth Barnes 17 Oct 2020 0:30 UTC
LW: 12 AF: 6
AF
on: AI safety from first principles: Goals and Agency
But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.

I think the ways in which humans are not fully consequentialist is much broader—we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we’re imitating someone else, etc.

Beth Barnes 16 Nov 2020 21:20 UTC
LW: 5 AF: 4
AF
in reply to: Rohin Shah’s comment on: AI safety via market making
I think for debate you can fix the circular argument problem by requiring debaters to ‘pay’ (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don’t resolve before the depth limit, then they’ll lose.

Beth Barnes 16 Nov 2020 21:32 UTC
LW: 3 AF: 2
AF
in reply to: evhub’s comment on: AI safety via market making
Suppose by strong induction that $M$ always gives the right answer immediately for all sets of size less than $n$
Pretty sure debate can also access R if you make this strong of an assumption—ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

I think the sort of claim that’s actually useful is going to look more like ‘we can guarantee that we’ll get a reasonable training signal for problems in [some class]’

Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded.

Beth Barnes 18 Nov 2020 6:18 UTC
LW: 16 AF: 12
AF
in reply to: Rohin Shah’s comment on: AI safety via market making
Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it.
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?
What links here?
- abramdemski's comment on Debate Minus Factored Cognition by abramdemski (18 Jan 2021 23:28 UTC; 2 points)

Beth Barnes 18 Nov 2020 6:22 UTC
LW: 2 AF: 1
AF
on: Extortion beats brinksmanship, but the audience matters
FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I’ve never seen it spelt with the s) and this was a little confusing.

Beth Barnes 18 Nov 2020 8:08 UTC
LW: 3 AF: 3
AF
on: Learning Normativity: A Research Agenda
However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there… but as I’ve mentioned earlier, human imitation does not get us all the way. Humans don’t perfectly endorse their own reactions.
Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human—e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity ‘correct’, but to design a mechanism such that we can’t expect to find any mechanism that does better.

Beth Barnes 18 Nov 2020 19:31 UTC
LW: 1 AF: 1
AF
in reply to: abramdemski’s comment on: Learning Normativity: A Research Agenda
I see myself as trying to construct a theory of normativity which gets that “by construction”, IE, we can’t expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.
Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we’re correct about the hope that natural language is sufficiently universal. It’s quite likely I’m either confused or being sloppy though.

You could put ‘learning the prior’ inside HCH I think, it would just be inefficient—for every claim, you’d ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you’d seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question—you just do all the learning from data once, then use the result of that to answer any subsequent questions.

Beth Barnes 22 Nov 2020 6:04 UTC
LW: 12 AF: 9
AF
in reply to: Rohin Shah’s comment on: AI safety via market making
Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that’s supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

Beth Barnes 19 Dec 2020 0:33 UTC
LW: 18 AF: 10
AF
in reply to: Daniel Kokotajlo’s comment on: Homogeneity vs. heterogeneity in AI takeoff scenarios
One counterexample is Manhattan Project—they developed two different designs simultaneously because they weren’t sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.
https://en.wikipedia.org/wiki/Manhattan_Project#:~:text=The%20Manhattan%20Project%20was%20a,Tube%20Alloys%20project)%20and%20Canada.

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

135 points

24 comments16 min readLW link

Beth Barnes 23 Dec 2020 23:30 UTC
LW: 4 AF: 3
AF
in reply to: John Schulman’s comment on: Debate update: Obfuscated arguments problem
Thanks!

Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped.

Beth Barnes 24 Dec 2020 3:33 UTC
LW: 2 AF: 2
AF
in reply to: John Schulman’s comment on: Debate update: Obfuscated arguments problem
I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you’re going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

Beth Barnes

Wri­teup: Progress on AI Safety via Debate

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

De­bate up­date: Obfus­cated ar­gu­ments problem

Writeup: Progress on AI Safety via Debate

Looking for adversarial collaborators to test our Debate protocol

Debate update: Obfuscated arguments problem