I suspect describing AI as having “values” feels more alien than “goals,” but I don’t have an easy way to figure this out.
Linch
The Matchless Match
whynotboth.jpeg
Here’s my current four-point argument for AI risk/danger from misaligned AIs.
We are on the path of creating intelligences capable of being better than humans at almost all economically and militarily relevant tasks.
There are strong selection pressures and trends to make these intelligences into goal-seeking minds acting in the real world, rather than disembodied high-IQ pattern-matchers.
Unlike traditional software, we have little ability to know or control what these goal-seeking minds will do, only directional input.
Minds much better than humans at seeking their goals, with goals different enough from our own, may end us all, either as a preventative measure or side effect.
Request for feedback: I’m curious whether there are points that people think I’m critically missing, and/or ways that these arguments would not be convincing to “normal people.” I’m trying to write the argument to lay out the simplest possible case.
Yeah I believe this too. Possibly one of the relatively few examples of the midwit meme being true in real life.
What are people’s favorite arguments/articles/essays trying to lay out the simplest possible case for AI risk/danger?
Every single argument for AI danger/risk/safety I’ve seen seems to overcomplicate things. Either they have too many extraneous details, or they appeal to overly complex analogies, or they seem to spend much of their time responding to insider debates.
I might want to try my hand at writing the simplest possible argument that is still rigorous and clear, without being trapped by common pitfalls. To do that, I want to quickly survey the field so I can learn from the best existing work as well as avoid the mistakes they make.
Ashley Babbit, an unarmed <protestor or rioter, depending on your party affiliation>
I appreciate your attempt to be charitable, but I don’t think the left-wing/liberal concerns with Jan 6 is appropriately summarized as “riot.”
Alas, no my model is rather limited.
I would not trust Dean Ball as a trustworthy actor acting on the object-level, and certainly would not take any of his statements at face value! I think it’s much better to model him as a combination of a political actor saying whatever words causes his political aims to be achieved, plus someone willing to pursue random vendettas.
I’d recommend trying to talk to people 1:1, especially about topics that are more in their wheelhouses than in yours’. At least I’ve found my average conversation with Uber drivers to be more interesting and insightful than reading my phone.
My guess is that I do this more than you do, but one thing I find unpleasant about interacting with large groups of people I don’t know well is that I wind up doing a bunch of semi-conscious theory-of-mind modeling, emotional regulation-type management of different levels of a conversation, etc [1], so it’s harder for me to focus on the object-level. [2]I think this is much less of a problem in 1:1 conversations where maintaining the multilevel tracking feels quite natural.
- ^
It’s unclear to me if I do this more or less than “normies.” The case for “less” is that I don’t think I’ve spent a lot of my skillpoints on people modeling compared to other things. The case for “more” is that often people I interact with have almost laughably simplistic or non-existent models of other people.
- ^
I would not be surprised if I specifically happen to be in a midwit part of the curve, alas.
- ^
I personally find the “virtue is good because bounded optimization is too hard” framing less valuable/persuasive than the “virtue is good because your own brain and those of other agents are trying to trick you” framing. Basically, the adversarial dynamics seem key in these situations, otherwise a better heuristic might be to focus on the highest order bit first and then go down the importance ladder.
Though of course both are relevant parts of the story here.
Thanks, this is a helpful point! The second one has been on my mind re: assassinations, and is implicitly part of my model for uncertainty about assassination effectiveness (I still think my original belief is largely correct, but I can’t rule out psy ops)
I often see people advocate others sacrifice their souls. People often justify lying, political violence, coverups of “your side’s” crimes and misdeeds, or professional misconduct of government officials and journalists, because their cause is sufficiently True and Just. I’m overall skeptical of this entire class of arguments.
This is not because I intrinsically value “clean hands” or seeming good over actual good outcomes. Nor is it because I have a sort of magical thinking common in movies, where things miraculously work out well if you just ignore tradeoffs.
Rather, it’s because I think the empirical consequences of deception, violence, criminal activity, and other norm violations are often (not always) quite bad, and people aren’t smart or wise enough to tell the exceptions apart from the general case, especially when they’re ideologically and emotionally compromised, as is often the case.
Instead, I think it often helps to be interpersonally nice, conduct yourself with honor, and overall be true to your internal and/or society-wide notions of ethics and integrity.
I’m especially skeptical of galaxy-brained positions where to be a hard-nosed consequentialist or whatever, you are supposed to do a specific and concrete Hard Thing (usually involving harming innocents) to achieve some large, underspecified, and far-off positive outcome.
I think it’s like those thought experiments about torturing a terrorist (or a terrorist’s child) to find the location of the a ticking nuclear bomb under Manhattan where somehow you know the torture would do it.
I mean, sure, if presented that way I’d think it’s a good idea but has anybody here checked the literature on the reliability of evidence extracted under torture? Is that really the most effective interrogation technique?
So many people seem eager to rush to sell their souls, without first checking to see if the Devil’s willing to fulfill his end of the bargain.
The 7 Types Of Advice (And 3 Common Failure Modes)
One dispositional difference between me and other ppl is that compared to other people, if Bob says statement X that’s false and dumb, I’m much more likely to believe that Bob did not meaningfully understand something about X.
I think other people are much more likely to jump to “Bob actually has a deeper reason Y for saying X” if they like Bob or “Bob is just trolling” if they dislike Bob.
Either reason might well be true, but a) I think often they are not true, and b) even if they are, I still think most likely Bob didn’t understand X. Even if X is not Bob’s true rejection, or if Bob would’ve found a different thing to troll about if he understood X better.
This is one of the reasons I find it valuable to proffer relatively simple explanations for complex concepts, at least sometimes.
Thanks, this is helpful.
Schelling’s specific point actually feels relevant to me and a blindspot among (at least some) rationalists or EAs when they talk about “conflict” vs “mistake” theory. I’ve recently thought about the “conflict vs mistake theory” framing some more, and think it misses out a lot of the learnings that are standard in, eg, negotiation classes or bargaining theory, or international relations/game theory writ large.
I think a lot of the time a better position is something roughly like: “I have my interests and intend to pursue mine own interests to the best of my ability. I respect you as an agent with your interests and willing to pursue yours. Sometimes our interests come into conflict, and we take actions detrimental to each other. However, it is implausible that our interests are directly opposed, and there are often plausible gains from trade.”
A plausible example of mistake theory inhibiting gains from trade is when (supposedly) Obama often tried to lecture Republican lawmakers about their mistakes, instead of taking their interests as a given and tried to negotiate more.
Of course, conflict theory can inhibit gains from trade if it prevents people from coming to the negotiation table, or just not notice that bargaining is almost always a better option than war.
Thanks for the Wiki article, was helpful to read!
Yeah, it’s hard to balance the examples well. The most common examples of being wrong about X are often not the most central/clean examples of being wrong about X. This was also an issue for me in the Theory of Mind examples (neurotypical adults have at least some ToM in the developmental psychology sense, some of the most common failures are more sophisticated failures like typical mind fallacy[1], but in a sense, neither are the most interesting examples to bring up).
For me, an interesting example of Grice’s maxims not being fully integrated is this post, which argues that you need to understand postmodern philosophy to get why “Stating true facts sometimes makes people angrier rather than updating their beliefs,” whereas in practice I think in many (not all) cases, the people “stating true facts” and “just asking questions” that predictably make people angry are failing to integrate Grice’s maxims on a normative level, and/or have poor theory of mind on a descriptive level.
Obviously there’s more than one way to surmount a mountain, and continental philosophy has other teachings and benefits as well, so I don’t want to begrudge people too much for becoming better at strategic empathy and conversational pragmatics through continental philosophy rather than the tools I’m more familiar with. But it does feel like overkill to me, and unfortunately continental philosophy seems to shackle people with other commitments and attitudes.
- ^
The problem with “typical mind fallacy” as a prototypical example of a cognitive error is that in many cases it can also be written correctly as a typical mind heuristic.
- ^
Yeah, after getting enough people tripped up/upset at me about invoking IVT-like intuitions for discontinuous functions[1], I suspect something like the above is the subtler point I should’ve led with. Elsewhere I wrote I think part of the argument is that if you have a complicated distribution between a bunch of unknown discontinuous functions “in reality”, from your epistemic state, it would often essentially look continuous to you when you combine the probabilities together, and you should treat them as such.
I think your formalism is helpful/might aid in thinking more clearly, but I’m also worried people would jump at it if their uncertainty is slightly non-uniform (without noticing that changing the math a tiny bit only changes the endline result a tiny bit).
In a lot of situations you can still treat your situation as locally linear despite non-global uniformity (see my third point on differentiable functions being locally linear), but that argument is more about “negotiating price,” my first (IVT-inspired) point was establishing that it’s possible to have an effect at all.
Total agree with the train example being a clear elucidation. I’ve used it before in other contexts when trying to explain EV-style reasoning more directly.
- ^
Obviously IVT doesn’t hold for all discontinuous functions. But IVT-style intuitions still hold up for reasons like the ones you illustrate, most of the time.
- ^
I think this sort of assumes that terminal-ish goals are developed earlier and thus more stable and instrumental-ish goals are developed later and more subject to change.
I think this may or may not be true on the individual level but it’s probably false on the ecological level.
Competitive pressures shape many instrumental-ish goals to be convergent whereas terminal-ish goals have more free parameters.