By “enemy” I meant the hypothetical terrorist in the “some terrorist group likes to eat babies” example.
I’m very confused about what you’re perceiving here, so I think some very severe miscommunication has occurred. Did you accidentally respond to a different comment than you thought?
I probably won’t get to that soon, but I’ll put it on the list.
I also want to say that I’m sorry for kicking off this giant tangential thread on your post. I know this sort of thing can be a disincentive to write in the future, so I want to explicitly say that you’re a good writer, this was a piece worth reading, and I would like to read more of your posts in the future.
This is a really good point and a great distinction to make.
As an example, suppose I hear a claim that some terrorist group likes to eat babies. Such a claim may very well be true. On the other hand, it’s the sort of claim which I would expect to hear even in cases where it isn’t true. In general, I expect claims of the form “<enemy> is/wants/does <evil thing>”, regardless of whether those claims have any basis.
Now, clearly looking into the claim is an all-around solid solution, but it’s also an expensive solution—it takes time and effort. So, a reasonable question to ask is: should the burden of proof be on writer or critic? One could imagine a community norm where that sort of statement needs to come with a citation, or a community norm where it’s the commenters’ job to prove it wrong. I don’t think either of those standards are a good idea, because both of them require the expensive work to be done. There’s a correct Bayesian update whether or not the work of finding a citation is done, and community norms should work reasonably well whether or not the work is done.
A norm which makes more sense to me: there’s nothing wrong with writers occasionally dropping conflict-theory-esque claims. But readers should be suspicious of such claims a-priori, and just as it’s reasonable for authors to make the claim without citation, it’s reasonable for readers to question the claim on a-priori grounds. It makes sense to say “I haven’t specifically looked into whether <enemy> wants <evil thing>, but that sounds suspicious a-priori.”
I generally endorse this line of reasoning.
If you’re looking to improve your matrix calculus skills, I specifically recommend practicing tensor index notation and the Einstein summation convention. It will make neural networks much more pleasant, especially recurrent nets. (This may have been obvious already, but it’s sometimes tough to tell what’s useful when learning a subject.)
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the “realism about rationality” examples in that post—e.g. I don’t think AGI will necessarily be an “agent”, I don’t think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I’m not bothered by moral intuitions containing contradictions, etc).
I would also describe my “no proofs ⇒ doomed” view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don’t have the proofs, it’s highly unlikely that we understand the system well enough to usefully predict whether it is safe—but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven’t missed something critical; it is not the primary basis for thinking the system is safe.
A simple example, in contrast to AI: every engineering discipline I know of uses “safety factors”—i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.
Three possibly-relevant points here.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don’t do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted—e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you’ve got the insights and the sorting algorithm in hand, the proof isn’t actually that much extra work, although it will still take some experts chewing on it a bit to make sure it’s correct.
That’s the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be “aligned”. Once those are figured out, I don’t expect proofs to be much harder than algorithms. Coming back to the “see whether the AI runs a check for whether it can deceive humans” example, the proof wouldn’t involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction—just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.
Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don’t have a proof.
I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don’t think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more “difficult” wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes—i.e. the chunks of state-space which they can access just don’t contain really bad outcomes.
Still unsafe, in both cases.
The second case is simpler. Think about it in analogy to a wish-granting genie/demon: if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon? I certainly wouldn’t bet on it. The problem here is that the AI is smarter than we are, and can find loopholes we will not think of.
The first case is more subtle, because most of the complexity is hidden under a human-intuitive abstraction layer. If we had an unaligned genie/demon and said “I wish for you to passively study me for a year, learn what would make me most happy, and then give me that”, then that might be a safe wish—assuming the genie/demon already has an appropriate understanding of what “happy” means, including things like long-term satisfaction etc. But an AI will presumably not start with such an understanding out the gate. Abstractly, the AI can learn its optimization target, but in order to do that it needs a learning target—the thing it’s trying to learn. And that learning target is itself what needs to be aligned. If we want the AI to learn what makes humans “happy”, in a safe way, then whatever it’s using as a proxy for “happiness” needs to be a safe optimization target.
On a side note, Yudkowsky’s “The Hidden Complexity of Wishes” is in many ways a better explanation of what I’m getting at. The one thing it doesn’t explain is how “more powerful” in the sense of “ability to grant more difficult wishes” translates into a more powerful optimizer. But that’s a pretty easy jump to make: wishes require satisficing, so we use the usual approach of a two-valued utility function.
I believe the empirical claim. As I see it, the main issue is Goodhart: an AGI is probably going to be optimizing something, and open-ended optimization tends to go badly. The main purpose of proof-level guarantees is to make damn sure that the optimization target is safe. (You might imagine something other than a utility-maximizer, but at the end of the day it’s either going to perform open-ended optimization of something, or be not very powerful.)
The best analogy here is something like an unaligned wish-granting genie/demon. You want to be really careful about wording that wish, and make sure it doesn’t have any loopholes.
I think the difficulty of getting those proof-level guarantees is more conceptual than technical: the problem is that we don’t have good ways to rigorously express many of the core ideas, e.g. the idea that physical systems made of atoms can “want” things. Once the core problems of embedded agency are resolved, I expect the relevant guarantees will not be difficult.
I get what you’re saying about theories vs theorists. I agree that there are plenty of people who hold conflict theories about some things but not others, and that there are multiple reasons for holding a conflict theory.
None of this changes the original point: explaining a problem by someone being evil is still a mind-killer. Treating one’s own arguments as soldiers is still a mind-killer. Holding a conflict theory about any particular situation is still a mind-killer, at least to the extent that we’re talking about conflict theory in the form of “bad thing happens because of this bad person” as opposed to “this person’s incentives are misaligned”. We can explain other peoples’ positions by saying they’re using a conflict theory, and that has some predictive power, but we should still expect those people to usually be mind-killed by default—even if their arguments happen to be correct.
As you say, explaining Calhoun and Buchanan’s use of public choice theory as entirely a rationalisation for their political goals, is a conflict theory. Saying that people bring up public choice theory not due to differing economic understanding but due to different political goals, is a conflict theory. And I expect people using either those explanations to be mind-killed by default, even if the particular interpretation were correct.
Even after all this discussion of theories vs theorists, “conflict theory = predictably wrong” still seems like a solid heuristic.
This explanation loses predictive power compared to the explanation I gave above. In particular, if we think of conflict theory as “bad things happen because of bad people”, then it makes sense why conflict theorists would think public choice theory makes black people worse off, rather than better off. In your explanation, we need that as an additional assumption.
Let’s imagine for a minute that we didn’t know any of the background, and just think about what we might have predicted ahead of time.
Frame 1: conflict theory is characterized by the idea that problems mostly come from people following their own self-interest. Not knowing anything else, what do we expect conflict theorists to think about public choice theory—a theory whose central premise is modeling public servants as following their own self-interests/incentives? Like, the third sentence of the wikipedia article is “it is the subset of positive political theory that studies self-interested agents (voters, politicians, bureaucrats) and their interactions”.
If conflict theory is about problems stemming from people following their self-interest, public choice theory ought to be right up the conflict theorist’s alley. This whole “meta-level conflict” thing sounds like a rather contrived post-hoc explanation; a-priori there doesn’t seem to be much reason for all this meta stuff. And conflict theorists in practice seem to be awfully selective about when to go meta, in a way that we wouldn’t predict just based on “problems mostly stem from people following their self-interest”.
On the other hand...
Frame 2: conflict theory is characterized by the idea that bad things mostly happen because of bad people, and the solution is to punish them. In this frame, what would we expect conflict theorists to think of public choice theory?
Well, we’d expect them to dismiss it as obviously wrong—it doesn’t denounce any bad people—and therefore also probably an attempt by bad people to steer things the way they want.
If conflict theory is characterized by “bad things happen because of bad people”, then an article about how racism secretly underlies public choice theory is exactly the sort of thing we’d predict.
After reading this and the comments you linked, I think people mean several different things by conflict/mistake theory.
I mostly think of conflict theory as a worldview characterized by (a) assuming that bad things mostly happen because of bad people, and (b) assuming that the solution is mostly to punish them and/or move power away from them. I think of mistake theory as a worldview characterized by assuming that people do not intend to be evil (although they can still have bad incentives). I see mechanism design as the prototypical mistake theory approach: if people are misaligned, then restructure the system to align their incentives. It’s a technical problem, and getting angry at people is usually unhelpful.
In the comment thread you linked, Scott characterizes conflict theory as “the main driver of disagreement is self-interest rather than honest mistakes”. That view matches up more with the example you give: the mistake theorist assumes that people have “good” intent, and if you just explain that their actions are harmful, then they’ll stop. Under this interpretation, mechanism design is conflict-theory-flavored; it’s thinking of people as self-interested and then trying to align them anyway.
(I think part of the confusion is that some people are coming in with the assumption that acting in self-interest is automatically bad, and others are coming in with more of an economic/game theory mindset. Like, from an economic viewpoint, there’s no reason why “the main driver of disagreement is self-interest” would lead to arguing that public choice theory is racist, which was one of Scott’s original examples.)
So I guess one good question to think about is: how do we categorize mechanism design? Is it conflict, is it mistake, is it something else? Different answers correspond to different interpretations of what “conflict” and “mistake” theory mean. I’m pretty sure my interpretation is a much better fit to the examples and explanations in Scott’s original post on the topic, and it seems like a natural categorization to me. On the other hand, it also seems like there’s another natural category of naive-mistake-theorists who just assume honest mistakes, as in your Bob-Charlie example, and apparently some people are using the terms to capture that category.
Personally, my view is that mechanism design is more-or-less-always the right way to think about these kinds of problems. Sometimes that will lead to the conclusion that someone is making an honest mistake, sometimes it will lead to the conclusion that punishment is an efficient strategy, and often it will lead to other conclusions.
Mechanism design is, to a large extent, a conflict theory
I would say that mechanism design is how mistake theorists respond to situations where conflict theory is relevant—i.e., where there really is a “bad guy”. Mechanism design is not about “what consequences should happen to different agents”, it’s about designing a system to achieve a goal using unaligned agents—“consequences” are just one tool in the tool box, and mechanism design (and mistake theory) is perfectly happy to use other tools as well.
the main thesis is that power allows people to avoid committing direct crime while having less-powerful people commit those crimes instead … This is a denotative statement that can be evaluated independent of “who should we be angry at”.
There’s certainly a denotative idea in the OP which could potentially be useful. On the other hand, saying “the post has a few sentences about moral blame” seems like a serious understatement of the extent to which the OP is about who to be angry at.
in some cases “who we should be angry at” if that’s the best available implementation
The OP didn’t talk about any other possible implementations, which is part of why it smells like conflict theory. Framing it through principal-agent problems would at least have immediately suggested others.
Something about this piece felt off to me, like I couldn’t see anything specifically wrong with it but still had a strong instinctive prior that lots of things were wrong.
After thinking about it for a bit, I think my main heuristic is: this whole piece sounds like it’s built on a conflict-theory worldview. The whole question of the essay is basically “who should we be angry at”? Based on that, I’d expect that many or most of the individual examples are probably inaccurately understood or poorly analyzed. Lark’s comment about the Wells Fargo case confirms that instinct for one of the examples.
Then I started thinking about the “conflict theory = predictably wrong” heuristic. We say “politics is the mindkiller”, but I don’t think that’s quite right—people have plenty of intelligent discussions about policy, even when those discussions inherently involve politics. “Tribalism is the mindkiller” is another obvious formulation, but I’d also propose “conflict theory is the mindkiller”. Models like “arguments are soldiers” or “our enemies are evil” are the core of Yudkowsky’s original argument for viewing politics as a mind-killer. But these sort of models are essentially synonymous with conflict theory; if we could somehow have a tribalistic or political discussion without those conflict-theoretic elements, I’d expect it wouldn’t be so mindkiller-ish.
Looping back to the main topic of the OP: what would be a more mistake-theoretic way to view the same examples? One theme that jumps out to me is principal-agent problems: when something is outsourced, it’s hard to align incentives. That topic has a whole literature in game theory, and I imagine more useful insight could be had by thinking about how it applies to the examples above, rather than thinking about “moral culpability”—a.k.a. who to be angry at.
Those are consistent path-dependent preferences, so they can be modeled by a committee of subagents by the method outlined in the post. It would require something like n2n−1 states, I think, one for each current topping times each possible set of toppings tried already. Off the top of my head, I’m not sure how many dimensions it would require, but you can probably figure it out by trying a few small examples.
That said, the right way to model those particular preferences is to introduce uncertainty and Bayesian reasoning. The “hidden state” in this case is clearly information the agent has learned about each topping.
This raises another interesting question: can we just model all path-dependent preferences by introducing uncertainty? What subset can be modeled this way? Nonexistence of a representative agent for markets suggests that we can’t always just use uncertainty, at least without changing our interpretations of “system” or “preference” or “state” somewhat. On the other hand, in some specific cases it is possible to interpret the wealth distribution in a market as a probability distribution in a mixture model—log utilities let us do this, for instance. So I’d guess that there’s some clever criteria that would let us tell whether a committee/market with given utilities can be interpreted as a single Bayesian utility maximizer.
One potential problem: if the two utilities have different asymptotic behavior, then one of them can dominate decision-making. For instance, suppose we’re using 0-1 normalization, but one of the two utilities has a big spike or tail somewhere. Then it’s going to have near-zero slope everywhere else.
More concrete example: on the hedonism axis, humans have more capacity for severe pain than extreme pleasure. So that end of the axis has a big downward spike, and the hedonism-utility would be near-flat at the not-severe-pain end (at least for any of the normalizations you suggest, other than max-mean, which has the same problem with the other end of the axis). But if the preferences-utility lacks a big spike like that, then we’re liable to end up with constant low-grade hedonic unhappiness.
That’s still a lot better than plenty of other possible outcomes—preference-utility still looks good, and we’re not in constant severe pain. But it still seems not very good.
It doesn’t have to be particularly good, just give non-crazy results.
The intertheoretic utility post makes a lot more sense in that light; I had mostly dismissed it as a hack job when I first saw it. But if this is the sort of thing you’re trying to do, it seems more useful. Thanks for clarifying.
How do you imagine standardizing the utility functions? E.g., if we multiply Uh by 2, then it does just as good a job representing our happiness, but gets twice as much weight.