Is this supposed to answer my entire comment, in the sense that the general audience doesn’t need precise definitions? That may work for some people but can be off-putting to others. And surely it’s more important to convince AI researchers. Much of the general public already hates AI.
zulupineapple
I don’t really understand what you mean by goal-oriented
This is the key point. “The Problem” uses the word “goal” some 80 times, but does not define it, does not acknowledge that it’s a complex concept, or consider if some AI might not have it. I wish I could just use your concept of a goal, we shouldn’t need to discuss this, it should have been precisely defined in “The Problem” or some other introductory text.
Personally by “goal” or “goal-oriented” I mean that the utility function of the AI has a simple description. E.g. In the narrow domain of chess moves, the actions a chess bot chooses are very well explained by “trying to win”. On the other hand, in the real world there are many actions that would help winning, which the chess bot ignores, and not for a lack of intelligence, therefore “trying to win” is no longer a good predictor, and I therefore claim that this bot is not actually “goal-oriented”, or at least the goal is very different from plain “winning”. Maybe you would call this property “robustly goal-oriented”?
A second definition could be that any system which moves towards any goal, more than away from it, is “oriented” to that goal. This is reasonable because that’s the economically useful part. And with this definition, the statement “ASI is very likely to exhibit goal-oriented behavior” is trivial. But that’s a very low bar, and extrapolating something about the long term behavior of the system from this definition seems like a mistake.
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question.
I’m happy with the explanation, my issue is that I don’t feel like I’ve seen this explicitly acknowledged, neither in “The Problem” nor in “A List of Lethalities” (maybe it falls under “We can’t just build a very weak system”, but I don’t agree that it has to be weak) nor in Paul’s posts I’ve read.
not all useful agents are goal-oriented.
The theorem proved I mentioned is one such useful agent. I understand you would call the prover “goal oriented”, but it doesn’t necessarily reach that level under my definition. And at lest we agree that provers can be safe. The usefulness is, among other things, that we could use the prover to work out alignment for more general agents. I don’t want to go too far on the tangent of whether this would actually work, but surely it is a coherent sequence of events, right?
The chess example, as I recall, is a response to two common objections
I don’t hold these objections, and I don’t think anyone reasonable does, especially with the “never” in them. At best I could argue that humans aren’t actually great at pursuing goals robustly, and therefore the AI might also not be.
contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis
It’s not just “hard” to make improvements, but also unnecessary and even suicidal. X-risk arguments seem to assume that goals are robust, but do not convincingly explain why they must be.
To be clear, you believe that making aligned narrow AI is easy, regardless of how intelligent it is? Even something more useful than a chess bot, like a theorem prover? And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks? I’m not saying they won’t, but is that really enough to justify high p(doom)? When I was reading “The Problem”, I was sure that goal-oriented AI was seen as inevitable for some reason deeper than “Goal-oriented behavior is economically useful”.
I’d still like to argue that “goal-oriented” is not a simple concept, and it’s not trivial to produce a goal-oriented agent even if you try, and that not all useful agents are goal-oriented. But if the answer is that people will try very hard to kill themselves, I wouldn’t know how to reply to that.
All AI is trained in narrow domains, to some extent. There is no way to make a training environment as complex as the real world. I could have make the same post about LLMs, except there the supposed goal is a lot less clear. Do you have a better example of a “goal oriented” AI in a complex domain?
You might reasonably argue that making aligned narrow AI is easy, but greedy capitalists will build unaligned AI instead. I think it would be off topic to debate here how likely that is. But I don’t think this is the prevailing thought, and I don’t think it produces the p(doom)=0.9 that some people hold.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
Chess bots do not have goals
In reference to Said criticizing Benquo, you seem to be ignoring the crucial point, which is that Said was right. Benquo made the simple claim, that knowing about yeast is useful in everyday life, and this claim is clearly wrong, regardless of what either of them said about it. Benquo could have admitted this, or he could have found another example. But instead he doubled down on being wrong, which naturally leads to frustration. It’s concerning that you picked this conversation as an example, as if you can’t tell.
I’m also confused by the “asymmetric effort” part. You describe it like a competition between author and critic. But if the criticism is correct, the author should be happy to receive it (isn’t that the point of making posts?). And how much effort does it take to say “I don’t understand what you mean” or “this doesn’t seem important to my core point” or “you’re right”? And what reward do you think Said gets from this? Certainly not social status, as he clearly doesn’t have any.
By the way, I appreciate you pointing to real conversations instead of vaguely hand waving. If you want to change anyone’s mind about norms, you need to point to negative as well as positive examples as evidence. Hopefully you’ll do that when you eventually explain the right way to fight the LinkedIn attractor.
Given that the training data contains contradictory statements and blatant lies, it’s natural that the LLM would learn to lie in order to reproduce those. Rather, it’s surprising that the LLM has any coherent concept of truth at all.
As for solutions, I’d suggest either having LLMs produce formal verifiable proofs, or having them fact check each other, not that you need my suggestions.
The solution is to realize that your rationality skills are also bullshit, it’s just a slightly different brand of bullshit, not compatible with the bullshit of lay-philosophy. There certainly isn’t much evidence to the contrary.
In practice I recommend playing a video game, finding people who are better then you at it, finding that your rationality isn’t all that useful for beating them, and developing some respect for them, despite the shit they might say sometimes.
Maybe I should just let you tell me what framework you are even using in the first place.
I’m looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.
Furthermore, if O={o0,o1}, then I can group the terms into u(o0)P(“we’re in a state where f evaluates to o0”) + u(o1)P(“we’re in a state where f evaluates to o1″), I’m just moving all of the complexity out of EU and into P, which I assume to work by some magic (e.g. LI), that doesn’t involve literally iterating over every possible S.
We can either start with a basic set of “worlds” (eg, ) and define our “propositions” or “events” as sets of worlds <...>
That’s just math speak, you can define a lot of things as a lot of other things, but that doesn’t mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.
By the way, I might not see any more replies to this.
A classical probability distribution over with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U.
Ok, you’re saying that JB is just a set of axioms, and U already satisfies those axioms. And in this construction “event” really is a subset of Omega, and “updates” are just updates of P, right? Then of course U is not more general, I had the impression that JB is a more distinct and specific thing.
Regarding the other direction, my sense is that you will have a very hard time writing down these updates, and when it works, the code will look a lot like one with an utility function. But, again, the example in “Updates Are Computable” isn’t detailed enough for me to argue anything. Although now that I look at it, it does look a lot like the U(p)=1-p(“never press the button”).
events (ie, propositions in the agent’s internal language)
I think you should include this explanation of events in the post.
construct ‘worlds’ as maximal specifications of which propositions are true/false
It remains totally unclear to me why you demand the world to be such a thing.
I’m not sure why you say Omega can be the domain of U but not the entire ontology.
My point is that if U has two output values, then it only needs two possible inputs. Maybe you’re saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you’re right, but I feel no need to make such claims. Even if the domains are different, they are not unrelated, Omega is still in some way contained in the ontology.
I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent
We could and I think we should. I have no idea why we’re talking math, and not writing code for some toy agents in some toy simulation. Math has a tendency to sweep all kinds of infinite and intractable problems under the rug.
Answering out of order:
<...> then I think the Jeffrey-Bolker setup is a reasonable formalization.
Jeffrey is a reasonable formalization, it was never my point to say that it isn’t. My point is only that U is also reasonable, and possibly equivalent or more general. That there is no “case against” it. Although, if you find Jeffery more elegant or comfortable, there is nothing wrong with that.
do you believe that any plausible utility function on bit-strings can be re-represented as a computable function (perhaps on some other representation, rather than bit-strings)?
I don’t know what “plausible” means, but no, that sounds like a very high bar. I believe that if there is at least one U that produces an intelligent agent, then utility functions are interesting and worth considering. Of course I believe that there are many such “good” functions, but I would not claim that I can describe the set of all of them. At the same time, I don’t see why any “good” utility function should be uncomputable.
I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.
I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={”button has been pressed”, “button has not been pressed”} and P(“button has been pressed” | “I’m pressing the button”)~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks.
We could expand the scenario so that every “day” is represented by an n-bit string.
If you want to force the agent to remember the entire history of the world, then you’ll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would “update” the daily utilities into total expected utility; in that case, U can do something similar.
I can always “extend” a world with an extra, new fact which I had not previously included. IE, agents never “finish” imagining worlds; more detail can always be added
You defined U at the very beginning, so there is no need to send these new facts to U, it doesn’t care. Instead, you are describing a problem with P, and it’s a hard problem, but Jeffrey also uses P, so that doesn’t solve it.
> … set our model to be a list of “events” we’ve observed …
I didn’t understand this part.If you “evaluate events”, then events have some sort of bit representation in the agent, right? I don’t clearly see the events in your “Updates Are Computable” example, so I can’t say much and I may be confused, but I have a strong feeling that you could define U as a function on those bits, and get the same agent.
This is an interesting alternative, which I have never seen spelled out in axiomatic foundations.
The point would be to set U(p) = p(“button has been pressed”) and then decide to “press the button” by evaluating U(P conditioned on “I’m pressing the button”) * P(“I’m pressing the button” | “press the button”), where P is the agent’s current belief, and p is a variable of the same type as P.
If you actually do want to work on AI risk, but something is preventing you, you can just say “personal reasons”, I’m not going to ask for details.
I understand that my style is annoying to some. Unfortunately, I have not observed polite and friendly people getting interesting answers, so I’ll have to remain like that.
OK, there are many people writing explanations, but if all of them are rehashing the same points from Superintelligence book, then there is not much value in that (and I’m tired of reading the same things over and over). Of course you don’t need new arguments or new evidence, but it’s still strange if there aren’t any.
Anyone who has read this FAQ and others, but isn’t a believer yet, will have some specific objections. But I don’t think everyone’s objections are unique, a better FAQ should be able to cover them, if their refutations exist to begin with.
Also, are you yourself working on AI risk? If not, why not? Is this not the most important problem of our time? Would EY not say that you should work on it? Could it be that you and him actually have wildly different estimates of P(AI doom), despite agreeing on the arguments?
As for Raemon, you’re right, I probably misunderstood why he’s unhappy with newer explanations.
Stampy seems pretty shallow, even more so than this FAQ. Is that what you meant by it not filling “this exact niche”?
By the way, I come from AGI safety from first principles, where I found your comment linking to this. Notably, that sequence says “My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it.” which is reasonable and seems an order of magnitude more conservative than this FAQ, which doesn’t really touch the question of agency at all.
I’m talking specifically about discussions on LW. Of course in reality Alice ignores Bob’s comment 90% of the time, and that’s a problem in it’s own right. It would be ideal if people who have distinct information would choose to exchange that information.
I picked a specific and reasonably grounded topic, “x-risk”, or “the probability that we all die in the next 10 years”, which is one number, so not hard to compare, unless you want to break it down by cause of death. In contrived philosophical discussions, it can certainly be hard to determine who agrees on what, but I have a hunch that this is the least of the problems in those discussions.
A lot of things have zero practical impact, and that’s also a problem in it’s own right. It seems to me that we’re barely ever having “is working on this problem going to have practical impact?” type of discussions.
I want neither. I observe that Raemon cannot find an up to date introduction that he’s happy with, and I point out that this is really weird. What I want is an explanation to this bizarre situation.
Is your position that Raemon is blind, and good, convincing explanations are actually abundant? If so, I’d like to see them, it doesn’t matter where from.
“The world is full of adversarial relationships” is pretty much the weakest possible argument and is not going to convince anyone.
Are you saying that MIRI website has convincing introductory explanation of AI risk, the kind that Raemon wishes he had? Surely he would have found them already? If there aren’t, then, again, why not?
Why We Disagree
If our relationship to them is adversarial, we will lose. But you also need to argue that this relationship will (likely) be adversarial.
Also, I’m not asking you to make the case here, I’m asking why the case is not being made on front page of LW and on every other platform. Would that not help with advocacy and recruitment? No idea what “keeping up with current events” means.
On second thought, I don’t agree that the number of outputs is the right criteria. It’s the “narrowness” of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of “preference for chess” or “desire to win”, neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say “I’m good at chess”, but this wouldn’t be a direct consequence of the chess skill.