This seems like it is not about the “motivational system”, and if this were implemented in a robot that does have a separate “motivational system” (i.e. it is goal-directed), I worry about a nearest unblocked strategy.
I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that’s your interpretation, that’s not what I meant at all, unless random sampling counts as a motivation system. I’m saying that all you do is sample from what’s consented to.
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that’s what you meant, I think that’s a valid concern. What I was imagining is that you are trying to infer “what the user wants” not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says “get me groceries”, the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.
There’s no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).
(The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do. For example, if you don’t know how to get started filing your taxes, then the robot can’t help you. But maybe there’s some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)
A third interpretation of your concern is that you’re saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it’s going to be pushing toward perverse instantiations one way or another. I don’t have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.
(I feel compelled to mention again that I don’t feel strongly that the whole idea makes any sense. I just want to convey why I don’t think it’s about constraining an underlying motivation system.)
Non-Consequentialist Cooperation? (Abram Demski): [...]
However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI’s “motivational system”. This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are “somewhat” corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn’t seem to have an analogous benefit.
You could definitely think of it as a limitation to put on a system, but I actually wasn’t thinking of it that way when I wrote the post. I was trying to imagine something which only operates from this principle. Granted, I didn’t really explain how that could work. I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.
(It now seems to me that although I put “non-consequentialist” in the title of the post, I didn’t explain the part where it isn’t consequentialist very well. Which is fine, since the post was very much just spitballing.)
Agreed. I’ll at least edit the post to point to this comment.
I’m not sure which you’re addressing, but, note that I’m not objecting to the practice of illustrating variables with diamonds and boxes rather than only circles so that you can see at a glance where the choices and the utility are (although I don’t tend to use the convention myself). I’m objecting to the further implication that doing this makes it not a Bayes net.
I hear there is a way to fiddle with the foundations of probability theory so that conditional probabilities are taken as basic and ordinary probabilities are defined in terms of them. Maybe this would solve the problem?
This does help somewhat. See here. But, in order to get good answers from that, you need to already know enough about the structure of the situation.
Maybe I’m late to the party, in which case sorry about that & I look forward to hearing why I’m wrong, but I’m not convinced that epsilon-exploration is a satisfactory way to ensure that conditional probabilities are well-defined. Here’s why:
I agree, but I also think there are some things pointing in the direction of “there’s something interesting going on with epsilon exploration”. Specifically, there’s a pretty strong analogy between epsilon exploration and modal UDT: MUDT is like the limit as you send exploration probability to zero, so it never actually happens but it still happens in nonstandard models. However, that only seems to work when you know the structure of the situation logically. When you have to learn it, you have to actually explore sometimes to get it right.
To the extent that MUDT looks like a deep result about counterfactual reasoning, I take this as a point in favor of epsilon exploration telling us something about the deep structure of counterfactual reasoning.
Anyway, see here for some more recent thoughts of mine. (But I didn’t discuss the question of epsilon exploration as much as I could have.)
I disagree. All the nodes in the network should be thought of as grounding out in imagination, in that it’s a world-model, not a world. Maybe I’m not seeing your point.
I would definitely like to see a graphical model that’s more capable of representing the way the world-model itself is recursively involved in decision-making.
One argument for calling an influence diagram a generalization of a bayes could be that the conditional probability table for the agent’s policy given observations is not given as part of the influence diagram, and instead must be solved for. But we can still think of this as a special case of a Bayes net, rather than a generalization, by thinking of an influence diagram as a special sort of Bayes net in which the decision nodes have to have conditional probability tables obeying some optimality notion (such as the CDT optimality notion, the EDT optimality notion, etc).
This constraint is not easily represented within the Bayes net itself, but instead imposed from outside. It would be nice to have a graphical model in which you could represent that kind of constraint naturally. But simply labelling things as decision nodes doesn’t do much. I would rather have a way of identifying something as agent-like based on the structure of the model for it. (To give a really bad version: suppose you allow directed cycles, rather than requiring DAGs, and you think of the “backwards causality” as agency. But, this is really bad, and I offer it only to illustrate the kind of thing I mean—allowing you to express the structure which gives rise to agency, rather than taking agency as a new primitive.)
All in all, I can’t wrap my head around “what is the difference between a producer and a consumer of thought?” because the question as posed seems to hold rigor, even quality, constant/irrelevant.
I’m not trying to hold it constant, I’m just trying to understand a relatively low standard, because that’s the part I feel confused about. It seems relatively much easier to look at bad intellectual output and say how it could have been better, think about the thought processes involved, etc. Much harder to say what goes into producing output at all vs not doing so.
It’s important to note that accuracy and calibration are two different things. I’m mentioning this because the OP asks for calibration metrics, but several answers so far give accuracy metrics. Any proper scoring rule is a measure of accuracy as opposed to calibration.
It is possible to be very well-calibrated but very inaccurate; for example, you might know that it is going to be Monday 1/7th of the time, so you give a probability of 1/7th. Everyone else just knows what day it is. On a calibration graph, you would be perfectly lined up; when you say 1/7th, the thing happens 1/7th of the time.
It is also possible to have high accuracy and poor calibration. Perhaps you can guess coin flips when no one else can, but you are wary of your precognitive powers, which makes you underconfident. So, you always place 60% probability on the event that actually happens (heads or tails). Your calibration graph is far out of line, but your accuracy is higher than anyone else.
In terms of improving rationality, the interesting thing about calibration is that (as in the precog example) if you know you’re poorly calibrated, you can boost your accuracy simply by improving your calibration. In some sense it is a free improvement: you don’t need to know anything more about the domain; you get more accurate just by knowing more about yourself (by seeing a calibration chart and adjusting).
However, if you just try to be more calibrated without any concern for accuracy, you could be like the person who says 1/7th. So, just aiming to do well on a score of calibration is not a good idea. This could be part of the reason why calibration charts are presented instead of calibration scores. (Another reason being that calibration charts help you know how to adjust to increase calibration.)
That being said, a decomposition of a proper scoring rule into components including a measure of calibration, like Dark Denego gives, seems like the way to go.
I guess, philosophically, I worry that giving the nodes special types like that pushes people toward thinking about agents as not-embedded-in-the-world, thinking things like “we need to extend Bayes nets to represent actions and utilities, because those are not normal variable nodes”. Not that memoryless cartesian environments are any better in that respect.
Hrm. I realize that the post would be comprehensible to a much wider audience with a glossary, but there’s one level of effort needed for me to write posts like this one, and another level needed for posts where I try to be comprehensible to someone who lacks all the jargon of MIRI-style decision theory. Basically, if I write with a broad audience in mind, then I’m modeling all the inferential gaps and explaining a lot more details. I would never get to points like the one I’m trying to make in this post. (I’ve tried.) Posts like this are primarily for the few people who have kept up with the CDT=EDT sequence so far, to get my updated thinking in writing in case anyone wants to go through the effort of trying to figure out what in the world I mean. To people who need a glossary, I recommend searching lesswrong and the stanford encyclopedia of philosophy.
I’ve avoided people/conversations on those grounds, but I’m not sure it is the best way to deal with it. And I really do think good intellectual progress can be made at level 2. As Ruby said in the post I’m replying to, intellectual debate is common in analytic philosophy, and it does well there.
Maybe my description of intellectual debate makes you think of all the bad arguments-are-soldiers stuff. Which it should. But, I think there’s something to be said about highly developed cultures of intellectual debate. There are a lot of conventions which make it work better, such as a strong norm of being charitable to the other side (which, in intellectual-debate culture, means an expectation that people will call you out for being uncharitable). This sort of simulates level 3 within level 2.
As for level 1, you might be able to develop some empathy for it at times when you feel particularly vulnerable and need people to do something to affirm your belongingness in a group or conversation. Keep an eye out for times when you appreciate level-one behavior from others, times when you would have appreciated some level-one comfort, or times when other people engage in level one (and decide whether it was helpful in the situation). It’s nice when we can get to a place where no one’s ego is on the line when they offer ideas, but sometimes it just is. Ignoring it doesn’t make it go away, it just makes you manage it ineptly. My guess is that you are involved with more level one situations than you think, and would endorse some of it.
(lightly edited version of my original email reply to above comment; note that Diffractor was originally replying to a version of the Dutch-book which didn’t yet call out the fact that it required an assumption of nonzero probability on actions.)
I agree that this Dutch-book argument won’t touch probability zero actions, but my thinking is that it really should apply in general to actions whose probability is bounded away from zero (in some fairly broad setting). I’m happy to require an epsilon-exploration assumption to get the conclusion.
Your thought experiment raises the issue of how to ensure in general that adding bets to a decision problem doesn’t change the decisions made. One thought I had was to make the bets always smaller than the difference in utilities. Perhaps smaller Dutch-books are in some sense less concerning, but as long as they don’t vanish to infinitesimal, seems legit. A bet that’s desirable at one scale is desirable at another. But scaling down bets may not suffice in general. Perhaps a bet-balancing scheme to ensure that nothing changes the comparative desirability of actions as the decision is made?
For your cosmic ray problem, what about:
You didn’t specify the probability of a cosmic ray. I suppose it should have probability higher than the probability of exploration. Let’s say 1/million for cosmic ray, 1/billion for exploration.
Before the agent makes the decision, it can be given the option to lose .01 util if it goes right, in exchange for +.02 utils if it goes right & cosmic ray. This will be accepted (by either a CDT agent or EDT agent), because it is worth approximately +.01 util conditioned on going right, since cosmic ray is almost certain in that case.
Then, while making the decision, cosmic ray conditioned on going right looks very unlikely in terms of CDT’s causal expectations. We give the agent the option of getting .001 util if it goes right, if it also agrees to lose .02 conditioned on going right & cosmic ray.
CDT agrees to both bets, and so loses money upon going right.
Ah, that’s not a very good money pump. I want it to lose money no matter what. Let’s try again:
Before decision: option to lose 1 millionth of a util in exchange for 2 utils if right&ray.
During decision: option to gain .1 millionth util in exchange for −2 util if right&ray.
That should do it. CDT loses .9 millionth of a util, with nothing gained. And the trick is almost the same as my dutch book for death in damascus. I think this should generalize well.
The amounts of money lost in the Dutch Book get very small, but that’s fine.
“The expectations should be equal for actions with nonzero probability”—this means a CDT agent should have equal causal expectations for any action taken with nonzero probability, and EDT agents should similarly have equal evidential expectations. Actually, I should revise my statement to be more careful: in the case of epsilon-exploring agents, the condition is >epsilon rather than >0. In any case, my statement there isn’t about evidential and causal expectations being equal to each other, but rather about one of them being conversant across (sufficiently probable) actions.
“differing counterfactual and evidential expectations are smoothly more and more tenable as actions become less and less probable”—this means that the amount we can take from a CDT agent through a Dutch Book, for an action which is given a different casual expectation than evidential expectation, smoothly reduces as the probability of an action goes to zero. In that statement, I was assuming you hold the difference between evidential and causal expectations constant add you reduce the probability of the action. Otherwise it’s not necessarily true.
I think it’s usually a good idea overall, but there is a less cooperative conversational tactic which tries to masquerade as this: listing a number of plausible straw-men in order to create the appearance that all possible interpretations of what the other person is saying are bad. (Feels like from the inside: all possible interpretations are bad; i’ll demonstrate it exhaustively...)
It’s not completely terrible, because even this combative version of the conversational move opens up the opportunity for the other person to point out the (n+1)th interpretation which hasn’t been enumerated.
You can try to differentiate yourself from this via tone (by not sounding like you’re trying to argue against the other person in asking the question), but, this will only be somewhat successful since someone trying to make the less cooperative move will also try to sound like they’re honestly trying to understand.
My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
I wouldn’t say that preference utilitarianism “falls apart”; it just becomes much harder to implement.
And I’d like a little more definition of “autonomy” as a value—how do you operationally detect whether you’re infringing on someone’s autonomy?
My (still very informal) suggestion is that you don’t try to measure autonomy directly and optimize for it. Instead, you try to define and operate from informed consent. This (maybe) allows a system to have enough autonomy to perform complex and open-ended tasks, but not so much that you expect perverse instantiations of goals.
My proposed definition of informed consent is “the human wants X and understands the consequences of the AI doing X”, where X is something like a probability distribution on plans which the AI might enact. (… that formalization is very rough)
Is it just the right to make bad decisions (those which contradict stated goals and beliefs)?
This is certainly part of respecting an agent’s autonomy. I think more generally respecting someone’s autonomy means not taking away their freedom, not making decisions on their behalf without having prior permission to do so, and avoiding operating from assumptions about what is good or bad for a person.
Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone’s autonomy when they value it themselves. If they don’t, why impose it upon them?
One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?
Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).
I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.
(In cases where humans understand the implications of value manipulation and consent to it, it’s much less concerning—though we still want to make sure the AI isn’t prone to pressure humans into that, and think carefully about whether it is really OK.)
Is the point here that you expect we can’t solve those problems and therefore need an alternative? The idea doesn’t help with “the difficulties of assuming human rationality” though so what problems does it help with?
It’s less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.
In other words, it doesn’t avoid the central problems of ambitious value learning (such as “what does it mean for irrational beings to have values?“), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.
Is the idea that even trying to do ambitious value learning constitutes violating someone’s autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?
I think there are a couple of ways in which this is true.
I mentioned cases where a value-learner might violate privacy in ways humans wouldn’t want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn’t X-risk bad. It’s not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don’t really want.
I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an “unexpected result of optimizing something reasonable-looking” pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don’t endorse your “why impose it on them?” argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems
Abstracting your idea a little: in order to go beyond first thoughts, you need some kind of strategy for developing ideas further. Without one, you will just have the same thoughts when you try to “think more” about a subject. I’ve edited my answer to elaborate on this idea.
Well, my original intention was definitely more like “why don’t more people keep developing their ideas further?” as opposed to “why don’t more people have ideas?”—but, I definitely grant that sharing ideas is what I actually am able to observe.
If someone had commented with a one-line answer like “people are intellectually active if it is rewarding”, I would have been very meh about it—it’s obvious, but trivial. All the added detail you gave makes it seem like a pretty useful observation, though.
Two possible caveats --
What determines what’s rewarding? Any set of behaviors can be explained by positing that they’re rewarding, so for this kind of model to be meaningful, there’s got to be a set of rewards involved which are relatively simple and have relatively broad explanatory power.
In order for a behavior to be rewarded in the first place, it has to be generated the first time. How does that happen? Animal trainers build up complicated tricks by rewarding steps incrementally approaching the desired behavior. Are there similar incremental steps here? What are they, and what rewards are associated with them?
(Your spelled-out details give some ideas in those directions.)