Yeah it looks like maybe the same argument just expressed very differently? Like, I think the “coherence implies goal-directedness” argument basically goes through if you just consider computational complexity, but I’m still not sure if you agree? (maybe I’m being way to vague)
Or maybe I want a stronger conclusion? I’d like to say something like “REAL, GENERAL intelligence” REQUIRES goal-directed behavior (given the physical limitations of the real world). It seems like maybe our disagreement (if there is one) is around how much departure from goal-directed-ness is feasible / desirable and/or how much we expect such departures to affect performance (the trade-off also gets worse for more intelligent systems).
It seems likely the AI’s beliefs would be logically coherent whenever the corresponding human beliefs are logically coherent. This seems quite different from arguing that the AI has a goal.
Yeah, it’s definitely only an *analogy* (in my mind), but I find it pretty compelling *shrug.
But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”.
Totally agree; it’s an under-appreciated point!
Here’s my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don’t actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)
The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.
This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.
I’m personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the “early days” of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of “social epistemology” would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I’d argue we’re in the process of failing catastrophically at that)
RE the title, a quick list:
FHI (and associated orgs)
I think a lot of orgs that are more focused on social issues which can or do arise from present day AI / ADM (automated decision making) technology should be thinking more about global coordination, but seem focused on national (or subnational, or EU) level policy. It seems valuable to make the most compelling case for stronger international coordination efforts to these actors. Examples of this kind of org that I have in mind are AINow and Montreal AI ethics institute (MAIEI).
As mentioned in other comments, there are many private conversations among people concerned about AI-Xrisk, and (IMO, legitimate) info-hazards / unilateralist curse concerns loom large. It seems prudent to make progress on those meta-level issues (i.e. how to engage the public and policymakers on AI(-Xrisk) coordination efforts) as a community as quickly as possible, because:
Getting effective AI governance in place seems like it will be challenging and take a long time.
There are a rapidly growing number of organizations seeking to shape AI policy, who may have objectives that are counter-productive from the point of view of AI-Xrisk. And there may be a significant first-mover advantage (e.g. via setting important legal or cultural precedents, and framing the issue for the public and policymakers).
There is massive untapped potential for people who are not currently involved in reducing AI-Xrisk to contribute (consider the raw number of people who haven’t been exposed to serious thought on the subject).
Info-hazard-y ideas are becoming public knowledge anyways, on the timescale of years. There may be a significant advantage to getting ahead of the “natural” diffusion of these memes and seeking to control the framing / narrative.
My answers to your 6 questions:
1. Hopefully the effect will be transient and minimal.
2. I strongly disagree. I think we (ultimately) need much better coordination.
3. Good question. As an incomplete answer, I think personal connections and trust play a significant (possibly indispensable) role.
4. I don’t know. Speculating/musing/rambling: the kinds of coordination where IT has made a big difference (recently, i.e. starting with the internet) are primarily economic and consumer-faced. For international coordination, the stakes are higher; it’s geopolitics, not economics; you need effective international institutions to provide enforcement mechanisms.
5. Yes, but this doesn’t seem like a crucial consideration (for the most part). Do you have specific examples in mind?
6. Social science and economics seem really valuable to me. Game theory, mechanism design, behavioral game theory. I imagine there’s probably a lot of really valuable stuff on how people/orgs make collective decisions that the stakeholders are satisfied with in some other fields as well (psychology? sociology? anthropology?). We need experts in these fields (esp, I think the softer fields are underrepresented) to inform the AI-Xrisk community about existing findings and create research agendas.
BoMAI is in this vein, as well ( https://arxiv.org/pdf/1905.12186.pdf )
I don’t understand how this answers the question.
As a clarification, I’m considering the case where we consider the state space to be the set of all “possible” histories (including counter-logical ones), like the standard “general RL” (i.e. AIXI-style) set-up.
I don’t know how deep blue worked. My impression was that it doesn’t use learning, so the answer would be no.
A starting point for Tom and Stuart’s works: https://scholar.google.com/scholar?rlz=1C1CHBF_enCA818CA819&um=1&ie=UTF-8&lr&cites=1927115341710450492
Naturally (as an author on that paper), I agree to some extent with this argument.
I think it’s worth pointing out one technical ‘caveat’: the agent should get utility 0 *on all future timesteps* as soon as it takes an action other than the one specified by the policy. We say the agent gets reward 1: “if and only if its history is an element of the set H”, *not* iff “the policy would take action a given history h”. Without this caveat, I think the agent might take other actions in order to capture more future utility (e.g. to avoid terminal states). [Side-note (SN): this relates to a question I asked ~10days ago about whether decision theories and/or policies need to specify actions for impossible histories.]
My main point, however, is that I think you could do some steelmanning here and recover most of the arguments you are criticizing (based on complexity arguments). TBC, I think the thesis (i.e. the title) is a correct and HIGHLY valuable point! But I think there are still good arguments for intelligence strongly suggesting some level of “goal-directed behavior”. e.g. it’s probably physically impossible to implement policies (over histories) that are effectively random, since they look like look-up tables that are larger than the physical universe. So when we build AIs, we are building things that aren’t at that extreme end of the spectrum. Eliezer has a nice analogy in a comment on one of Paul’s posts (I think), about an agent that behaves like it understands math, except that it thinks 2+2=5. You don’t have to believe the extreme version of this view to believe that it’s harder to build agents that aren’t coherent *in a more intuitively meaningful sense* (i.e. closer to caring about states, which is (I think, e.g. see Hutter’s work on state aggregation) equivalent to putting some sort of equivalence relation on histories).
I also want to mention Laurent Orseau’s paper: “Agents and Devices: A Relative Definition of Agency”, which can be viewed as attempting to distinguish “real” agents from things that merely satisfy coherence via the construction in our paper.
I strongly agree.
I should’ve been more clear.
I think this is a situation where our intuition is likely wrong.
This sort of thing is why I say “I’m not satisfied with my current understanding”.
The two examples here seem to not have alarming/obvious enough Ws. It seems like you are arguing against a straw-man who makes bad predictions, based on something like a typical mind fallacy.
my concern that decision theory research (as done by humans in the foreseeable future) can’t solve decision theory in a definitive enough way that would obviate the need to make sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself
So you’re saying we need to solve decision theory at the meta-level, instead of the object-level. But can’t we view any meta-level solution as also (trivially) an object level solution?
In other words, “[making] sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself” sounds like a special case of “[solving] decision theory in a definitive enough way”.
I’m starting with the objective of objecting to your (6): this seems like an important goal, in my mind. And if we *aren’t* able to verify that an AI is free from decision theoretic flaws, then how can we trust it to self-modify to be free of such flaws?
Your perspective still make sense to me if you say: “this AI (soit ALICE) is exploitable, but it’ll fix that within 100 years, so if it doesn’t get exploited in the meanwhile, then we’ll be OK”. And OFC in principle, making an agent that will have no flaws within X years of when it is created is easier than the special case of X=0.
In reality, it seems plausible to me that we can build an agent like ALICE and have a decent change that ALICE won’t get exploited within 100 years.
But I still don’t see why you dismiss the goal of (6); I don’t think we have anything like definitive evidence that it is an (effectively) impossible goal.
we haven’t seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game.
It’s a good point, but… we won’t see examples like this if the algorithms that produce this kind of behavior take longer to produce the behavior than the amount of time we’ve let them run.
I think there are good reasons to view the effective horizon of different agents as part of their utility function. Then I think a lot of the risk we incur is because humans act as if we have short effective horizons. But I don’t think we *actually* do have such short horizons. In other words, our revealed preferences are more myopic than our considered preferences.
Now, one can say that this actually means we don’t care that much about the long-term future, but I don’t agree with that conclusion; I think we *do* care (at least, I do), but aren’t very good at acting as if we(/I) do.
Anyways, if you buy this like of argument about effective horizons, then you should be worried that we will easily be outcompeted by some process/entity that behaves as if it has a much longer effective horizon, so long as it also finds a way to make a “positive-sum” trade with us (e.g. “I take everything after 2200 A.D., and in the meanwhile, I give you whatever you want”).
I view the chess-playing algorithm as either *not* fully goal directed, or somehow fundamentally limited in its understanding of the world, or level of rationality. Intuitively, it seems easy to make agents that are ignorant or indifferent(/”irrational”) in such a way that they will only seek to optimize things within the ontology we’ve provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute). However, our understanding of such things doesn’t seem mature.… at least I’m not satisfied with my current understanding. I think Stuart Armstrong and Tom Everrit are the main people who’ve done work in this area, and their work on this stuff seems quite under appreciated.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I’m not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I’m actually more interested in hearing your take on those lines of argument than saying mine ATM :P
Not a direct response: It’s been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn’t very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn’t apply if you’re *already* in a discontinuous regime (but then it becomes circular).
Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).
I do think this is an overly optimistic picture. The amount of traction an argument gets seems to be something like a product of how good the argument is, how credible those making the argument are, and how easy it is to process the argument.
Also, regarding this:
But the credibility system is good enough that the top credible people are really pretty smart, so to an extent can be swayed by good arguments presented well.
It’s not just intelligence that determines if people will be swayed; I think other factors (like “rationality”, “open-mindedness”, and other personality factors play a very big role.
Oops, missed that, sry.
I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don’t have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.
Yep, good catch ;)
I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I’m epistemically humble enough these days to think it’s not reasonable to say “nearly inevitable” if you integrate out epistemic uncertainty.
But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?