Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/
Beth Barnes
Writeup: Progress on AI Safety via Debate
I have the same confusion
Of course GPT-3 isn’t aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it’s imitating something where someone says something smart, it will “try” to say something smart. This question seems weird to me, Am I missing something?
Yeah I also thought this might just be true already, for similar reasons
Looking for adversarial collaborators to test our Debate protocol
That’s correct about simultaneity.
Yeah, the questions and answers can be arbitrary, doesn’t have to be X and ¬X.
I’m not completely sure whether Scott’s method would work given how we’re defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner ‘is your question equivalent to ‘should I eat potatoes tonight?″, they wouldn’t know. On the other hand, you could ask them ’ if I think you should eat potatoes tonight, is your question equivalent to ’should I eat potatoes tonight?″. This would work as long as you were referring only to what one debater believed you should eat tonight, I think.
I feel fairly ok about this as a way to define the meaning of questions written by debaters within the debate. I’m less sure about how to define the top-level question. It seems like there’s only really one question, which is ‘what should I do?’, and it’s going to have to be defined by how the human asker clarifies their meaning. I’m not sure whether the meaning of the question should be allowed to include things the questioner doesn’t know at the time of asking.
Yep, or in comments. Thanks!
But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.
I think the ways in which humans are not fully consequentialist is much broader—we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we’re imitating someone else, etc.
I think for debate you can fix the circular argument problem by requiring debaters to ‘pay’ (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don’t resolve before the depth limit, then they’ll lose.
Suppose by strong induction that always gives the right answer immediately for all sets of size less than
Pretty sure debate can also access R if you make this strong of an assumption—ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.
I think the sort of claim that’s actually useful is going to look more like ‘we can guarantee that we’ll get a reasonable training signal for problems in [some class]’
Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded.
Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it.
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
Does that make sense?
- 18 Jan 2021 23:28 UTC; 2 points) 's comment on Debate Minus Factored Cognition by (
FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I’ve never seen it spelt with the s) and this was a little confusing.
However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there… but as I’ve mentioned earlier, human imitation does not get us all the way. Humans don’t perfectly endorse their own reactions.
Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human—e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and decide how much they endorse it.
I think the target is not to get normativity ‘correct’, but to design a mechanism such that we can’t expect to find any mechanism that does better.
I see myself as trying to construct a theory of normativity which gets that “by construction”, IE, we can’t expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.
Nice, this is what I was trying to say but was struggling to phrase it. I like this.
I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we’re correct about the hope that natural language is sufficiently universal. It’s quite likely I’m either confused or being sloppy though.
You could put ‘learning the prior’ inside HCH I think, it would just be inefficient—for every claim, you’d ask your HCH tree how much you should believe it, and HCH would think about the correct way to do bayesian reasoning, what the prior on that claim should be, and how well it predicted every piece of data you’d seen so far, in conjunction with everything else in your prior. I think one view of learning the prior is just making this process more tractable/practical, and saving you from having to revisit all your data points every time you ask any question—you just do all the learning from data once, then use the result of that to answer any subsequent questions.
Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that’s supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument
One counterexample is Manhattan Project—they developed two different designs simultaneously because they weren’t sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.
https://en.wikipedia.org/wiki/Manhattan_Project#:~:text=The%20Manhattan%20Project%20was%20a,Tube%20Alloys%20project)%20and%20Canada.
Debate update: Obfuscated arguments problem
Thanks!
Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped.
I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you’re going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time
You might find this paper interesting. It does a similar decomposition with the dynamics of differentiable games (where the ‘preferences’ for how to change your strategy may not be the gradient of any function)
https://arxiv.org/abs/1802.05642
“The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.”