PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/
rohinmshah(Rohin Shah)
Sorry, what in this post contradicts anything in Ajeya’s report? I agree with your headline conclusion of
If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does.
This also seems to be the assumption that Ajeya uses. I actually suspect we could get away with a smaller neural net ,that is similar in size to or somewhat smaller than the brain.
I guess the report then uses existing ML scaling laws to predict how much compute we need to train a neural net the size of a brain, whereas you prefer to use the human lifetime to predict it instead? From my perspective, the former just seems way more principled / well-motivated / likely to give you the right answer, given that the scaling laws seem to be quite precise and reasonably robust.
I would predict that we won’t get human-level data efficiency for neural net training, but that’s a consequence of my trust in scaling laws (+ a simple model for why that would be the case, namely that evolution can bake in some prior knowledge that it will be harder for humans to do, and you need more data to compensate).
I suggest you replace “we don’t know how to make wings that flap” with “we don’t even know how birds stay up for so long without flapping their wings,”
Done.
Planned summary for the Alignment Newsletter:
This post argues against a particular class of arguments about AI timelines. These arguments have the form: “The brain has property X, but we don’t know how to make AIs with property X. Since it took evolution a long time to make brains with property X, we should expect it will take us a long time as well”. The reason these are not compelling is because humans often use different approaches to solve problems than evolution did, and so humans might solve the overall problem without ever needing to have property X. To make these arguments more convincing, you need to argue 1) why property X really is _necessary_ and 2) why property X won’t follow quickly once everything else is in place.
This is illustrated with a hypothetical example of someone trying to predict when humans would achieve heavier-than-air flight: in practice, you could have made decent predictions just by looking at the power to weight ratios of engines vs. birds. Someone who argued that we were far away because “we don’t know how to make wings that flap” would have made incorrect predictions.
Planned opinion:
This all seems generally right to me, and is part of the reason I like the <@biological anchors approach@>(@Draft report on AI timelines@) to forecasting transformative AI.
my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.
I agree with this summary.
I suspect Daniel Kokotajlo is in a similar position as me; my impression was that he was asking that the output be that-which-makes-AI-risk-arguments-work, and wasn’t making any claims about how the research should be organized.
Ah, interesting, I didn’t catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.
There are two arguments:
Your assumption + automatic verification of questions of the form “What is the best defeater to X” implies Weak Factored Cognition (which as defined in my original comment is of the form “there exists a tree such that...” and says nothing about what equilibrium we get).
Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there’s some subtlety here though.)
In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let’s leave that aside and instead talk about a simpler argument that doesn’t talk about Factored Cognition at all.
----
Based on what argument? Is this something from the original debate paper that I’m forgetting?
Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):
If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.
Additional details:
In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the “last word” (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.
The point of the “clawing” argument is that it’s a rational deviation from honesty, so it means honesty isn’t an equilibrium.
I think this is only true when you have turn-by-turn play and your opponent has already “claimed” the honest debater role. In this case I’d say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.
In the simultaneous play setting, I think you expect both agents to be honest.
More broadly, I note that the “clawing” argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.
I also don’t really understand the hope in the non-zero-sum case here—in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.
My model is like this
Got it, that makes sense. I see better now why you’re saying one-step debate isn’t an NP oracle.
I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion “What is the best defeater to this argument?”
While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising
I don’t think that’s what I did? Here’s what I think the structure of my argument is:
Every dishonest argument has a defeater. (Your assumption.)
Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
1 and 2 imply the Weak Factored Cognition hypothesis.
I’m not assuming factored cognition, I’m proving it using your assumption.
Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).
So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion
This is in fact what I usually take away from it. The point is to gain intuition about how “strongly” you amplify the original human’s capabilities.
but I just don’t think that’s telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).
I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?
Instead, I’m proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.
This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)
In my setup, a player is incentivised to concede when they’re beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.
Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction.
I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.
I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can’t find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that’s not about the equilibrium, and it sounded to me like you were talking about the equilibrium.
I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I’m interested in hearing it.
I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.
Exactly which objection are you talking about here?
If it’s something like “coherence theorems do not say that tool AI is not a thing”, that seems true.
Yes, I think that is basically the main thing I’m claiming.
But then you also make claims like “all behavior can be rationalized as EU maximization”, which is wildly misleading.
I tried to be clear that my argument was “you need more assumptions beyond just coherence arguments on universe-histories; if you have literally no other assumptions then all behavior can be rationalized as EU maximization”. I think the phrase “all behavior can be rationalized as EU maximization” or something like it was basically necessary to get across the argument that I was making. I agree that taken in isolation it is misleading; I don’t really see what I could have done differently to prevent there from being something that in isolation was misleading, while still being able to point out the-thing-that-I-believe-is-fallacious. Nuance is hard.
(Also, it should be noted that you are not in the intended audience for that post; I expect that to you the point feels obvious enough so as not to be worth stating, and so overall it feels like I’m just being misleading. If everyone were similar to you I would not have bothered to write that post.)
Also, the “preferences over universe-histories” argument doesn’t work as well when we specify the full counterfactual behavior of a system, which is something we can do quite well in practice.
I agree that if you have counterfactual behavior EU maximization is not vacuous. I don’t think that this meaningfully changes the upshot (which is “coherence arguments, by themselves without any other assumptions on the structure of the world or the space of utility functions, do not imply AI risk”). It might meaningfully change the title of the post (perhaps they do imply goal-directed behavior in some sense), though in that case I’d change the title to “Coherence arguments do not imply AI risk” and I think it’s effectively the same post.
Mostly though, I’m wondering how exactly you use counterfactual behavior in an argument for AI risk. Like, the argument I was arguing against is extremely abstract, and just claims that the AI is “intelligent” / “coherent”. How do you use that to get counterfactual behavior for the AI system?
I agree that for any given AI system, we could probably gain a bunch of knowledge about its counterfactual behavior, and then reason about how coherent it is and how goal-directed it is. But this is a fundamentally different thing than the thing I was talking about (which is just: can we abstractly argue for AI risk without talking about details of the system beyond “it is intelligent”?)
My argument is that coherence theorems do not apply nontrivially to any arbitrary system, so they could still potentially tell us interesting things about which systems are/aren’t <agenty/dangerous/etc>.
I agree with this.
There may be good arguments for why coherence theorems are the wrong way to think about goal-directedness, but “everything can be viewed as EU maximization” is not one of them.
I actually also agree with this, and was not trying to argue that coherence arguments are irrelevant to “goal-directedness” or “being a good agent”—I’ve already mentioned that I personally do things differently thanks to my knowledge of coherence arguments.
Just how narrow a setting are you considering here? Limited resources are everywhere. Even an e-coli needs to efficiently use limited resources. Indeed, I expect coherence theorems to say nontrivial things about an e-coli swimming around in search of food (and this includes the possibility that the nontrivial things the theorem says could turn out to be empirically wrong, which in turn would tell us nontrivial things about e-coli and/or selection pressures, and possibly point to better coherence theorems).
I agree that if you take any particular system and try to make predictions, the necessary assumptions (such as “what counts as a limited resource”) will often be easy and obvious and the coherence theorems do have content in such situations. It’s the abstract argument that feels flawed to me.
I somewhat expect your response will be “why would anyone be applying coherence arguments in such a ridiculously abstract way rather than studying a concrete system”, to which I would say that you are not in the intended audience.
----
Fwiw thinking this through has made me feel better about including it in the Alignment book than I did before, though I’m still overall opposed. (I do still think it is a good fit for other books.)
I’m curious: does this mean that you’re on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + “business as usual” in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck?
Yes, with the exception that I don’t know if compute will be the bottleneck (that is my best guess; I think Ajeya’s report makes a good case for it; but I could see it being other factors as well).
I think the case for is basically “we see a bunch of very predictable performance lines; seems like they’ll continue to go up”. But more importantly I don’t know of any compelling counterpoints; the usual argument seems to be “but we don’t see any causal reasoning / abstraction / <insert property here> yet”, which I think is perfectly compatible with the scaling hypothesis (see e.g. this comment).
A more concrete version of the “lobotomized alien” hypothetical might be something like this
I see, that makes sense, and I think it does make sense as an intuition pump for what the “ML paradigm” is trying to do (though as you sort of mentioned I don’t expect that we can just do the motivation / cognition decomposition).
“research acceleration” seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than “AI agent that competently runs a company”, so might still come first on those grounds.
Definitely depends on how powerful you’re expecting the AI system to be. It seems like if you want to make the argument that AI will go well by default, you need the research accelerator to be quite powerful (or you have to combine with some argument like “AI alignment will be easy to solve”).
I don’t think papers, books, etc are a “relatively well-defined training set”. They’re a good source of knowledge, but if you imitate papers and books, you get a research accelerator that is limited by the capabilities of human scientists (well, actually much more limited, since it can’t run experiments). They might be a good source of pretraining data, but there would still be a lot of work to do to get a very powerful research accelerator.
I should have said that I don’t see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old “AI getting out of the box” stories, so I think we agree.
Fwiw I’m not convinced that we avoid catastrophic deception either, but my thoughts here are pretty nebulous and I think that “we don’t know of a path to catastrophic deception” is a defensible position.
Huh. I see a lot of the work I’m excited about as trying to avoid the Agency Hand-Off paradigm, and the Value Learning sequence was all about why we might hope to be able to avoid the Agency Hand-Off paradigm.
Your definition of corrigibility is the MIRI version. If you instead go by Paul’s post, the introduction goes:
I would like to build AI systems which help me:
Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on
This seems pretty clearly outside of the Agency Hand-Off paradigm to me (that’s the way I interpret it at least). Similarly for e.g. approval-directed agents.
I do agree that assistance games look like they’re within the Agency Hand-Off paradigm, especially the way Stuart talks about them; this is one of the main reasons I’m more excited about corrigibility.
- 14 Jan 2021 21:47 UTC; 11 points) 's comment on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment by (
It seems weird to include a post in the book if we believe that it is misguided, just because people historically believed it. If I were making this book, I would not include such posts; I’d want an “LW Review” to focus on things that are true and useful, rather than historically interesting.
That being said, I haven’t thought much about the goals of the book, and if we want to include posts for the sake of history, then sure, include the post. That was just not my impression about the goal.
Is your concern that people will see this in the Alignment section, and not see the arguments against the connection, and continue to be misled?
I would have this concern, yes, but I’m happy to defer (in the sense of “not pushing”, rather than the sense of “adopting their beliefs as my own”) to the opinions of the people who have thought way more than me about the purpose of this review and the book, and have caused it to happen. If they are interested in including historically important essays that we now think are misguided, I wouldn’t object. I predict that they would prefer not to include such essays but of course I could be wrong about that.
the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model
I assume this is my comment + post; I’m not entirely sure what you mean here. Perhaps you mean that I’m not modeling the world as having “external” probabilities that the agent has to handle; I agree that is true, but that is because in the use case I’m imagining (looking at the behavior of an AI system and determining what it is optimizing) you don’t get these “external” probabilities.
I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.
I assure you I read this full post (well, the Arbital version of it) and thought carefully about it before making my post; my objections remain. I discussed VNM specifically because that’s the best-understood coherence theorem and the one that I see misused in AI alignment most often. (That being said, I don’t know the formal statements of other coherence theorems, though I predict with ~98% confidence that any specific theorem you point me to would not change my objection.)
Yes, if you add in some additional detail about resources, assume that you do not have preferences over how those resources are used, and assume that there are preferences over other things that can be affected using resources, then coherence theorems tell you something about how such agents act. This doesn’t seem all that relevant to the specific, narrow setting which I was considering.
I agree that coherence arguments (including VNM) can be useful, for example by:
Helping people make better decisions (e.g. becoming more comfortable with taking risk)
Reasoning about what AI systems would do, given stronger assumptions than the ones I used (e.g. if you assume there are “resources” that the AI system has no preferences over).
Nonetheless, within AI alignment, prior to my post I heard the VNM argument being misused all the time (by rationalists / LWers, less so by others); this has gone down since then but still happens.
I think e.g. this talk is sneaking in the “resources” assumption, without arguing for it or acknowledging its existence, and this often misleads people (including me) into thinking that AI risk is implied by math based on very simple axioms that are hard to disagree with.
----
On the review: I don’t think this post should be in the Alignment section of the review, without a significant rewrite / addition clarifying why exactly coherence arguments are useful or important for AI alignment. As such I will vote against it.
I would however support it being part of the non-Alignment section of the review; as I’ve said before, I generally really like coherence arguments and they influence my own decision-making a lot (in fact, a big part of the reason I started working in AI alignment was thinking about the Arrhenius paradox, which has a very similar coherence flavor).
To the extent that your summary of the “pro” case is accurate, particularly “Epistemics are the most important thing”, I find it deeply ironic and sad that all of the commentary, besides one comment from Carl Shulman (and my own), seems to be about what people should do, rather than what is actually true. One would hope that people pushing “epistemics are the most important thing” would want to rely on true facts when pushing their argument.
I primarily take umbrage at the fact that the post makes claims I think are false without providing any evidence for them. I brought up the moralizing tone as an explanation for why it was popular despite making (what I think are) false claims with no evidence.
[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines)
Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I’m now at “double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance”. (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that’s been proposed.)
I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I’d be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won’t hold up, but I haven’t thought about it much.
Note the idea would have come up anyway; this is a pretty core belief of mine, and I believe I was the one to bring it up on the academia post (after which someone linked this post).
Well, to be more accurate, I actually believe a stronger claim than the one in this post, that focuses on imitation as the mechanism for unconscious incentive-following. (Though it looks like comments later made this connection, under the title “cultural learning”.)
(I still think this is a good post and will be voting for inclusion; just wanted to clarify impact)
I think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don’t think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
What I was claiming in the sentence you quoted was that I don’t see ontological shifts as a huge additional category of problem that isn’t covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Planned summary for the Alignment Newsletter:
This post identifies four different motivations for working on transparency:
1. By learning more about how current neural networks work, we can improve our forecasts for AI timelines.
2. It seems _necessary_ for inner alignment. In particular, whatever AI development model you take, it seems likely that there will be some possibility of emergent misbehavior, and there doesn’t yet seem to be a way to rule that out except via transparency.
3. A good solution to transparency would be _sufficient_ for safety, since we could at least notice when AI systems were misaligned, and then choose not to deploy them.
4. Even if AI will “go well by default”, there are still instrumental reasons for transparency, such as improving cause prioritization in EA (via point 1), and for making systems more capable and robust.
After reviewing work on <@circuits@>(@Thread: Circuits@), the post suggests a few directions for future research:
1. Investigating how modular neural networks tend to be,
2. Figuring out how to make transparency outputs more precise and less subjective,
3. Looking for circuits in other networks (i.e. not image classifiers), see e.g. <@RL vision@>(@Understanding RL Vision@),
4. Figuring out how transparency fits into an end-to-end story for AI safety.
Planned opinion:
I’m on board with motivations 2 and 3 for working on transparency (that is, that it directly helps with AI safety). I agree with motivation 4, but for a different reason than the post mentions. If by “AI goes well by default” we mean that AI systems are trying to help their users, that still leaves the AI governance problem. It seems that making these AI systems transparent would be significantly helpful; it would probably enable a “right to an explanation”, for example. I don’t agree with motivation 1 as much: if I wanted to improve AI timeline forecasts, there are a lot of other aspects I would investigate first. (Specifically, I’d improve estimates of inputs into <@this report@>(@Draft report on AI timelines@).) Part of this is that I am less uncertain than the author about the cruxes that transparency could help with, and so see less value in investigating them further.
Other comments (i.e. not going in the newsletter):
I feel like the most interesting parts of this post are exactly the parts I didn’t summarize. (I chose not to summarize them because they felt quite preliminary, where I couldn’t really extract a key argument or insight that I believed.) But some thoughts about them:
The “alien in a box” hypothetical made sense to me (mostly), but I didn’t understand the “lobotomized alien” hypothetical. I also didn’t see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That’s not my only confusion, but I’m having a harder time explaining other confusions.)
It feels like your non-agentic argument is too dependent on how you defined “AGI”. I can believe that the first powerful research accelerator will be limited to language, but that doesn’t mean that other AI systems deployed at the same time will be limited to language.
It seems like there’s a pretty clear argument for language models to be deceptive—the “default” way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it’s more plausible to me that the first such model won’t cause catastrophic risk, which would still be enough for your conclusions.)
Planned summary for the Alignment Newsletter:
This post gives an example scheme inspired by the previous post. Like [iterated amplification](https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd), it defines an ideal (analogous to <@HCH@>(@Humans Consulting HCH@)), and then an approximation to it that could be computed in practice.
Like HCH, we imagine a tree of systems that improve as we increase the depth of the tree. However, the nodes in the tree are question-answering (QA) _systems_, rather than direct questions. Given a few QA systems from a lower level, we construct a QA system at a higher level by asking one low-level QA system “what’s a good safe distribution over QA systems”, and a different low-level QA system “what’s a good metric that we can use to judge QA systems”. We then use [quantilization](https://intelligence.org/files/QuantilizersSaferAlternative.pdf) ([AN #48](https://mailchi.mp/3091c6e9405c/alignment-newsletter-48)) to select better-performing QA systems, without optimizing too hard and falling prey to Goodhart’s Law. In the infinite limit, this should converge to a stable equilibrium.
By having the tree reason about what good safe distributions are, and what good metrics are, we are explicitly improving the way that the AI system learns to interpret feedback (this is what the “good metric” is meant to evaluate), thus meeting the desiderata from the previous post.
To implement this in practice, we do something similar to iterated amplification. Iterated amplification approximates depth-limited HCH by maintaining a model that can answer _arbitrary_ questions (even though each node is a single question); similarly here we maintain a model that has a _distribution_ over QA systems (even though each node is a single QA system). Then, to sample from the amplified distribution, we sample two QA systems from the current distribution, ask one for a good safe distribution and the other for a good metric, and use quantilization to sample a new QA system given these ingredients. We use distillation to turn this slow quantilization process into a fast neural net model.
Considering the problem of <@inaccessible information@>(@Inaccessible information@), the hope is that as we amplify the QA system we will eventually be able to approve of some safe reasoning process about inaccessible information. If this doesn’t happen, then it seems that no human reasoning could approve of reasoning about that inaccessible information, so we have done as well as possible.
Planned opinion (the first post is Learning Normativity):
**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.
**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.
**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up—as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.
Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).
So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.
Planned summary for the Alignment Newsletter:
To build aligned AI systems, we need to have our AI systems learn what to do from human feedback. However, it is unclear how to interpret that feedback: any particular piece of feedback could be wrong; economics provides many examples of stated preferences diverging from revealed preferences. Not only would we like our AI system to be uncertain about the interpretation about any particular piece of feedback, we would also like it to _improve_ its process for interpreting human feedback. This would come from human feedback on the meta-level process by which the AI system learns. This gives us _process-level feedback_, where we make sure the AI system gets the right answers _for the right reasons_.
For example, perhaps initially we have an AI system that interprets human statements literally. Switching from this literal interpretation to a Gricean interpretation (where you also take into account the fact that the human chose to say this statement rather than other statements) is likely to yield improvements, and human feedback could help the AI system do this. (See also [Gricean communication and meta-preferences](https://www.alignmentforum.org/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences), [Communication Prior as Alignment Strategy](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy), and [multiple related CHAI papers](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy?commentId=uWBFsKK6XFbL4Hs4z).)
Of course, if we learn _how_ to interpret human feedback, that too is going to be uncertain. We can fix this by “going meta” once again: learning how to learn to interpret human feedback. Iterating this process we get an infinite tower of “levels” of learning, and at every level we assume that feedback is not perfect and the loss function we are using is also not perfect.
In order for this to actually be feasible, we clearly need to share information across these various “levels” (or else it would take infinite time to learn across all of the levels). The AI system should not just learn to decrease the probability assigned to a single hypothesis, it should learn what _kinds_ of hypotheses tend to be good or bad.
Planned opinion (the second post is Recursive Quantilizers II):
**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.
**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.
**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up—as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.
Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).
So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.
The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
(The tree could be infinite if you don’t have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)
For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).
I think it differs based on what assumptions you make on the human judge, so there isn’t a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the “defeaters” assumption you have, for which I’d refer to the argument I gave above.)
Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report “the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2”.
The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎
Corollary: For the first player, honesty is an equilibrium policy.
Argument: By the claim above, the first player can never do any better than honesty (you can’t do better than always winning).
In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.
----
Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is “for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A”; that version of WFC is presumably not true.