My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about “consequentionalist reasoning”).
VojtaKovarik
AI Safety Debate and Its Applications
Deconfuse Yourself about Agency
Redefining Fast Takeoff
That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don’t see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers—not within units, but within assemblages of units.
Do you have some intuition for how inner optimizers would arise within assemblages of units, without being initiated by some unit higher in the hierarchy? Or is that what you are pointing at?
(I don’t have much experience thinking in these terms, so maybe the question is dumb/already answered in the post. But anyway: )
Do you have some more-detailed (and stupidly explicit) examples of the intentional and algorithmic views on the same thing, and how to translate between them?
Yep, that totally makes sense.
Observations inspired by your comment: While this shouldn’t necessarily be so, it seems the particular formulations make a lot of difference when it comes to exchanging ideas. If I read your comment without the
(although maybe “intelligence” would be a better word?)
bracket, I immediatelly go “aaa, this is so wrong!”. And if I substitute “intelligent” for “agent”, I totally agree with it. Not sure whether this is just me, or whether it generalizes to other people.
More specifically, I agree that from the different concepts in the vicinity of “agency”, “the ability to learn the environment and exploit this knowledge towards a certain goal” seems to be particularly important to AI alignment. I think the word “agency” is perhaps not well suited for this particular concept, since it comes with so many other connotations. But “intelligence” seems quite right.
I am not even sure what the input/output channels of a rock are supposed to be
I guess you imagine that the input is the physical forces affecting the ball and the output is the forces the ball exerts on the environment. Obviously, this is very much not useful for anything. But it suddenly becomes non-trivial if you consider something like the billiard-ball computer (seems like a theoretical construct, not sure if anybody actually built one...but it seems like a relevant example anyway).
I have a sense that (formalized) versions of A(Θ)-morphism are going to be more useful (or easier?) for the behavioral side, though it isn’t really clear.
I think -morphisation is primarily useful for describing what we often mean when we say “agency”. In particular, I view this as distinct from which concepts we should be thinking about in this space. (I think the promising candidates include learning that Vanessa points to in her comment, optimization, search, and the concepts in the second part of my post.)
However, I think it might also serve as a useful part of the language for describing (non) agent-like behavior. For example, we might want to SGD-morphise an ecoli bacteria independently of whether it actually implements some form of stochastic gradient descent w.r.t. the concentration of some chemicals in the environment.
You mention the distinction between agent-like architecture and agent-like behavior (which I find similar to my distinction between selection and control), but how does the concept of A(Θ)-morphism account for this distinction?
I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some is something that exists in the map of some entity (where often ).
The selection/control distinction seems related, but not quite similar to me. Am I missing something there?
Intuitively, I agree that the vacation question is under-defined / has too many “right” answers. On the other hand, I can also imagine the world where you can develop some objective fun theory, or just something which actually makes the questions well-posed. And the AIs could use this fact in the debate:
Bob: “Actually, you can derive a well-defined fun theory and use it to answer this question. And then Bali clearly wins.”
Alice: “There could never be any such thing!”
Bob: “Actually, there indeed is such a theory, and its central idea is [...].”
[They go on like this for a bit, and eventually, Bob wins.]
Indeed, this seems like a thing you could (by explaining that integration is a thing) if somebody tried to convince you that there is no principled way to measure the area of a circle.
However—if true—this only shows that there are less under-defined question than we think. The “Ministry of Ambiguity versus the Department of Clarity” fight is still very much a thing, as are the incentives to manipulate the human. And perhaps most importantly, routinely holding debates where the AI “explains to you how to think about something” seems extremely dangerous...
I agree with your summary :). The claim was that humans often predict behavior by assuming that something has a particular architecture.
(And some confusions about agency seem to appear precisely because of not making the architecture/behavior distinction.)
First off, while I feel somewhat de-confused about X-like behavior, I don’t feel very confident about X-like architectures. Maybe the meaning is somewhat clear on higher levels of abstraction (e.g., if my brain goes “realize I want to describe a concept --> visualize several explanations and judge each for suitability --> pick the one that seems the best --> send a signal to start typing it down”, then this would be a kind of search/optimization-thingy). But on the level of physics, I don’t really know what an architecture means. So take this with a grain of salt.
Maybe the term “physical structure” is misleading. The thing I was trying to point at is the distinction between being able to accurately model Y using model X, and Y actually being X. In the sense that there might be a giant look-up table (GLUT) that accuractly predicts your behavior, but on no level of abstraction is it correct to say that you actually are a GLUT. Whereas modelling you as having some goals, planning, etc. might be less accurate but somewhat more, hm, true. I realize this isn’t very precise, but I guess you can see what I mean.
That being said, I suppose that what I meant by “optimization architecture” is, for example, a stochastic gradient descent with the emphasis on “this is the input”, “this is the part of the algorithm that does the calculation”, and “this is the output”. An “implementation of an optimization architecture” would be...well, the atoms of your computer that perform SGD, or maybe some simple bacteria that moves in the direction where the concentration of whatever-it-likes is the highest (not that anything I know would implement precisely SGD, but still).
Ad “interesting physical structure” behind the ant-colony: If by “evolution” we mean the atoms that the world is made of, as they changed over time until your ant colony emerged...then yeah, this is a physical structure causally upstream of the ant colony, and one that is responsible for the ant colony behaving the way it does. I wouldn’t say it is interesting (to me, and w.r.t. the ant colony) though, since it is totally incomprehensible to me. (But maybe “interestingness” doesn’t really make sense on the level of physics, and is only relevant in relation to our abstract world-models and their understanding.)
Finally, the ideal thing a “X-like behavior ==> Y-like architecture” theorem would cash out into is a criterion that you can actually check and say with certainty that the thing will not exhibit X-like behavior. (Whether this is reasonable to hope for is another matter.) So, even if all that I have written in this comment turns out to be nonsense, getting such criterion is what we are after :-).
New paper: (When) is Truth-telling Favored in AI debate?
I guess on first reading, you can cheat by reading the introduction, Section 2 right after that, and the conclusion. One level above that is reading the text but skipping the more technical sections (4 and 5). Or possibly reading 4 and 5 as well, but only focusing on the informal meaning of the formal results.
Regarding the background knowledge required for the paper: It uses some game theory (Nash equilibria, extensive form games) and probability theory (expectations, probability measures, conditional probability). Strictly speaking, you can get all of this from looking up whichever keywords on wikipedia. I think that all of the concepts used there are basic in the corresponding fields, and in particular no special knowledge of measure theory is required. However, I studied both game theory and measure theory, so I am biased, and you shouldn’t trust me. (Moreover, there is a difference between “strictly speaking, only this is needed” and “my intuitions are informed by X, Y, and Z”.)
Another thing is that the AAAI workshop where this will appear has a page limit, which means that some explanations might have gotten less space than they would deserve. In particular, the arguments in Section 4 are much easier to digest if you can draw the functions that the text talks about. To understand the formal results, I think I visualized two-dimensional slices of the “world space” (i.e., squares), and assumed that the value of the function is 0 by default, except for being 1 at some selected subset of the square. This allows you to compute all the expectations and conditionals visually.
+1
(Just noticed your comment for the other debate post/paper. I will reply to it during the weekend.)
Thank you for the comments!
A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is—perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn’t be willing to act on it. Then I run the debate, become fully convinced that the debate’s outcome is the correct answer, and act on it.
The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.
“I’d be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space,”
To dissolve a possible confusion: By “claims about a space of questions” you mean “a claim about every question from a space of questions”? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single “meta” claim, understandable by the judge, that specified many smaller claims (eg, “for any meal you ask me to cook, I will be able to cook it better than any of my friends”; horribly false, btw.)?
Anyway, yeah, I agree that this seems promising. I still don’t know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).
I agree with your high-level points regarding the feature debate formalization. I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be “these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model”, rather than “these specific examples will be a problem in general debates”. In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others’ claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.
For example, I guess that even with other debate protocols, you will be “having a hard time when your side requires too difficult arguments”. I imagine there will always be some maximum “inferential distance that a debater can bridge” (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause. How will such an example look like? Without a specific debate design, I can’t really say. Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn’t apply).
A minor point:
“If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised.”
I agree with your point here—debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be “rational judges can have unstable debates in unlikely worlds” and “biased judges can have unstable debates even in typical worlds”.
A side-note:
Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.
Can’t remember the specific reference but: Imperfect-information game theory has some research on abstractions. Naturally, an object of interest are “optimal” abstractions—i.e., ones that are as small as possible for given accuracy, or as accurate as possible for given size. However, there are typically some negative results, stating that getting (near-) optimal abstractions is at least as expensive as finding the (near-) optimal solution of the full game. Intuitively, I would expect this to be a recurring theme for abstractions in general.
The implication of this is that all the goals should have the implicitly have the caveat that the maps have to be “not-too-expensive to construct”. (This is intended to be a side-note, not an advocacy to change the formulation. The one you have there is accessible and memorable :-).)
AI Services as a Research Paradigm
I agree with your points in the suggested summary. However, I feel like they are not fully representative of the text. But, as the author, I might be imagining the version of the document in my head rather than the one I actually wrote :-).
My estimate is that after reading it, I would gain the impression that the text revolves around the abstract model. Which I thought wasn’t the case; definitely wasn’t the intention.
Also, I am not sure if it is intended that your summary doesn’t mention the examples and the “classifying research questions” subsection (which seems equally important to me as the list it generates).
Finally, from your planned opinion, I might get the impression that the text suggests no technical problems at all. I think that some of them either are technical problems (e.g., undesired appearance of agency, preventing error propagation and correlated failures, “Tools vs Agents” in Section 6) or have important technical components (all the problems listed as related changes in environment, system, or users). Although whether these are AI specific is arguable.
Side-note 1: I also think that most of the classical AI safety problems also appear in systems of AI services (either in individual services, or in “system-wide variants”). But this is only mentioned in the text briefly, since I am not yet fully clear on how to do the translation between agent-like AIs and systems of AI services. (Also, on the extent to which such translation even makes sense.)
Side-note 2: I imagine that many “non-AI problems” might become “somewhat-AI problems” or even “problems that AI researchers need to deal with” once we get enough progress in AI to automate the corresponding domains.
Fixed the wrong section numbers and frame problem description.
Informally, we can assume that some description of the world is given by context and view a task as something specified by an initial state and an end state (or states) - accomplishing the task amounts to causing a transformation from the starting state to one of the desired end states.
I feel like this definition is not capturing what I mean by a “task”. Many “agent-like” things, such as “become supreme ruler of the world”, seem like tasks according to this definition; many useless things like “twitching randomly” can be thought of as completing a “task” as defined here and so would be counted as “services”.
Could it be that the problem is not in the “task” part but in the definition service? If I consider the task of building me a house that I will like, I can envision a very service-like way of doing that (ask me a bunch of routine questions, select house-model correspondingly, then proceed to build it in a cook-book manner by calling on other services). But I can also imagine going about this in a very agent-like manner.
(Also, “twitching randomly” seems like a perfectly valid task, and a twitch-bot as a perfectly valid service. Just a very stupid one that nobody would want to build or pay for. Uhm, probably. Hopefully.)
I agree with Lanrian. A perhaps better metric is the chance that randomly selected pixels of a randomly selected image will cause the judge to guess the label correctly. This corresponds to “judge accuracy (random pixels)” in Table 2 of the original paper, and it’s 48.2%/59.4% for 4⁄6 pixels.