Thanks for the feedback!
I indeed am thinking about your intuitions for goal-directed behaviors, because it seems quite important. I currently lack a clear idea (as formal as possible) of what you mean, and thus I have trouble weighting your arguments that it is not necessary, or that it causes most problems in safety. And since these arguments would have significant implications, I want to have as informed as possible an opinion on them.
Since you say that goal-directed behavior is not about having a model or not, is it about the form of the model? Or about the use of the model? Would a model-based agent that did not adapt its model when the environment changed be considered as not goal-directed (like the lookup-table agent in your example)?
I’m curious about what you think people are aware of: that the idea of goal-directedness from the value learning sequence is captured by model-based RL, or that any sufficiently powerful agent (implicitly goal-directed) needs to be model-based instead of model-free?
If that’s the former, I’m really interested in links to posts and comments pointing that out, as I don’t know of any. And if that’s the latter, then it seems that it is goes back to asking whether powerful agents must be goal-directed.
For more examples, here are the bugs I found following the prompts:
I consistently take too much time to wake up in the morning, between 30 minutes and 2 hours too much.
When working on something, I tend to do “just enough” to make good progress on it, and then stop for the day. Even if I could have kept going.
Although I am very comfortable in conversations, I have a weird anxiety about starting one with a complete stranger.
I have a consistent reluctance to start a new activity, like a coding project or cooking a new recipe. Whereas I thrive on new ideas.
My focus wanes around 1 hour after I start working on something on the best days, and I would want more.
I keep procrastinating on washing my dishes.
I take too much time thinking about how to do things and what I should do, and too little doing the things.
I regularly feel I’m not important to people.
I have trouble focusing when reading maths, and that’s something I would want to improve.
I’m not sure these are bugs at the right level, but that’s what I got out of the prompts.
I am not sure I understand exactly what you are aiming for with your take 3: is applied rationality a rationality that I don’t have to follow when I want to use emotions/intuitions/breaks, or is it a rationality that considers these options when making the decisions? The former seems to permissive, in that there is never a place where I have to use rationality, while the latter might fall into the same issues as the restaurant example, by pondering to much whether I should use intuition to choose my meal.
That being said, I like the different perspectives offered by the takes.
Nice post. Being convinced myself of the importance of mathematics both for understanding the world in general and for the specific problems of AI safety, I found it interesting to see what arguments you marshaled in and against this position.
About the unreasonable effectiveness of mathematics, I’d like to throw the “follow-up” statement: The unreasonable ineffectiveness of mathematics beyond physics (for example in biology). The counter argument, at least for biology, is that Wigner was talking a lot about differential equations, which seems somewhat ineffective in biology; but theoretical computer science, which one can see as the mathematical study of computation, and thus somewhat a branch of mathematics, might be better fitted to biology.
A general comment about your perspective is that you seem to equals mathematics with formal specification and proofs. That’s not necessarily an issue, but most modern mathematicians tend to not be exact formalists, so I thought it important to point out.
For the rest of my comments:
Rather than precise, I would say that mathematics are formal. The difference lies in the fact that a precise statement captures almost exactly an idea, whereas formalization provide an objective description of… something. Given that the main difficulty in applying mathematics and in writing specification for formal methods is this ontological identification between the formalization and the object in the world, I feel that it’s a bit too easy to say that maths captures the ideas precisely.
Similarly, it is not because the definitions themselves are unambiguous (if they are formal) that their interpretation, meaning and use is. I agree that a formal definition is far less ambiguous than a natural language one, but that does not mean that it is completely unambiguous. Many disagreement I had in research were about the interpretation of the formalisms themselves.
Although I agree with the idea of mathematics capturing some concept of simplicity, I would precise that it is about simplicity when all is explicited. That’s rather obvious for rationalists. Formal definitions tend to be full of subtleties and hard to manage, but the explicit versions of the “simpler” models would actually be more complex than that.
Nitpick about the “quantitative”: what of abstract algebra, and all the subfields that are not explicitly quantitative? Are they useful only insofar as they serves for the more quantitative parts of maths, or am I taking this argument too far and you just meant that one use of maths was in the quantitative parts?
The talk about Serial Depth makes me think about deconfusion. I feel it is indeed rather easy to makes someone not confused about making a sandwich, while it is still undone for AI Safety.
The Anthropocentrism arguments feels right to me, but I think it doesn’t apply if one is trying to build prosaic aligned AGI. Then the “most important” is to solve rather anthropocentric models of decision and values, instead of abstracting them away. But I might be wrong on that one.
I find discussions about AI takeoff to be very confusing.
So do I. So thanks a lot for this summary!
Why should all equivalence classes of linked world have the same average utility? That ensures the unicity of the utility function up to translation, but I’m not sure that’s always the best way to do it. What is the intuition behind this specific choice?
Thanks, I’ll keep going then.
I don’t see the link with my objection, since you quote a part of your post when you write of value impact (which is dependent on the values of the specific agents) and I talk about the need for context even for objective impact (which you present as independent of values and objectives of specific agents)
I have one potential criticism of the examples:
Because I was not sure what was the concrete implication of the asteroid impact, the reveal was unimpactful on me (pun inteded) that it was objectively valued negatively by anybody because they risk death. Had you written that the asteroid strikes near the agent, or that this causes massive catastrophes, then I would probably have though that it mattered the same for local peeblehoarders and for humans. Also, the asteroid might destroy pebbles (or depending on your definition of pebble, make new ones).
Also, I feel that some of your examples of objective impact are indeed relevant to agents in general (not dying/being destroyed), while other depends on sharing a common context (cash, which would be utterly useless in Pebblia if the local economy was based on exchanging peebles for peebles).
Do you just always consider this context as implicit?
Thanks, I’m looking into the toy model. :)
I really like the refinement of the formalization, with the explanations of what to keep and what was missing.
That said, I feel like the final formalization could be defined directly as a special type of preorder, one composed only of disjoint chains and cycles. Because as I understand the rest of the post, that is what you use when computing the utility function. This formalization would also be more direct, with one less layer of abstraction.
Is there any reason to prefer the “injective function” definition to the “special preorder” one?
Another modality of relating introduced to me by a friend a couple of weeks ago is “what kind of experience do you take from this relation”. My friend has a quite idiosyncratic classification, but you could separate people you see between combinations of intellectual stimulation, sense of security, being cared for… In my mind this is quite orthogonal to other directions: whatever this relation holds for you, it might matter tremendously or very little.
The main use I have for this modality is to clarify what I am missing in my life. For example, when I feel lonely, I feel a discrepancy with my social situation: I have many friends, some really close who care about me and about whom I care. But when considering what experience I feel I am missing in my relationships, I can say that it’s attraction and passion for the other and sexual tension and action.
Yes, I agree that you are focusing more on how to see the mistake in a meta-way, instead of an outside view as Nate do.
Though I don’t think your example of the distinction is exactly the right one: the idea from Nate of banning “should” or cashing out “should” would be able IMHO to unearth the underlying “I should be taking things seriously” apply the consequentialist analysis of “you will not be measured by how you felt or who you punished. You will be measured by what actually happened, as will we all” (paraphrasing). What I feel is different is that the Way provide a mean for systematically findind this underlying should and explaining it from the inside.
Nonetheless, I find both useful, and I am better for having the Curse of the Counterfactual in my mental toolbox.
I just found another interesting reference: Categories for the practising physicist. Although this is not exactly about discarding undue abstraction, it does present many concepts in terms of concrete examples, and there are even real-world categories defined in it!
Great post! I want to chew on it a bit before making a longer comment, but I noticed similarities between this post and Nate Soares’s Replacing Guilt sequence (which I consider the most important sequence… ever). More specifically, he seems to say things similar about guilt and should in “should” considered harmful, Not because you “should” and Your “shoulds” are not a duty.
For example, from “should” considered harmful:
I see lots of guilt-motivated people use “shoulds” as ultimatums: “either I get the meds, or I am a bad person.” They leave themselves only two choices: go out of their way on the way to work and suffer through awkward human interaction at the pharmacy, or be bad. Either way, they lose: the should has set them up for failure.
But the actual options aren’t “suffer” or “be bad.” The actual options are “incur the social/time costs of buying meds” or “incur the physical/mental costs of feeling ill.” It’s just a choice: you weigh the branches, and then you pick. Neither branch makes you “bad.” It’s ok to decide that the social/time costs outweigh the physical/mental costs. It’s ok to decide the opposite. Neither side is a “should.” Both sides are an option.
Or the idea of prefering to punish someone (me or another) instead of actually looking at the situation and accepting it, makes me think of tolerification:
There’s a certain type of darkness in the world that most people simply cannot to see. It’s not the abstract darkness: people will readily acknowledge that the world is broken, and explain how and why the hated out-group is responsible. And that’s exactly what I’m pointing at: upon seeing that the world is broken, people experience an impulse to explain the brokenness in a way that relieves the tension. When seeing that the world is broken, people reflexively feel a need to explain. Carol can acknowledge that there is suffering abroad, but this acknowledgement comes part and parcel with an explanation about why she bears no responsibility. Dave can acknowledge that he failed to pass the interview, but his mind automatically generates reasons why this is an acceptable state of affairs.
This is the type of darkness in the world that most people cannot see: they cannot see a world that is unacceptable. Upon noticing that the world is broken, they reflexively list reasons why it is still tolerable. Even cynicism, I think, can fill this role: I often read cynicism as an attempt to explain a world full of callous neglect and casual cruelty, in a framework that makes neglect and cruelty seem natural and expected (and therefore tolerable).
I call this reflexive response “tolerification,” and if you watch for it, you can see it everywhere.
The approach of these questions in the replacing guilt series is not exactly at the same level; most notably, I feel Nate is trying to explain why should are not “useful” and cause only harm that cannot serve for accomplishing your goals. On the other hand, I see this post as more about examining the exact mechanism underlying this error we make.
Still, I feel the connection is strong enough to encourage people to read both.
Great post! It explained clearly both positions, clarified the potential uses of PAL and proposed variations when it was considered accessible.
Maybe my only issue is with the (lack of) definition of the principal-agent problem. The rest of the post works relatively well without you defining it explicitly, but I think a short definition (even just a rephrasing of the one on Wikipedia) would make the post even more readable.
Okay, so we agree that it’s improbable (at least for decision problems) to be able to verify an answer faster than finding it. What you care about are cases where verification is easier, as is conjectured for example for NP (where verification is polynomial, but finding an answer is supposed to not be).
For IP, if we only want to verify any real-world property, I actually have a simple example I give into my intro to complexity theory lectures. Imagine that you are color-blind (precisely, a specific red and a specific green look exactly the same to you). If I have two balls, perfectly similar except one is green and the other is red, I can convince you that these balls are of different colors. It is basically the interactive protocol for graph non-isomorphism: you flip a coin, and depending on the result, you exchange the balls without me seeing it. If I can tell whether you exchanged the balls a sufficient number of times, then you should get convinced that I can actually differentiate them.
Of course this is not necessarily applicable to questions like tastes. Moreover, it is a protocol for showing that I can distinguish between the balls; it does not show why.
Could you give a list of some open problems or open questions related to this agenda (maybe with some pointers to the more relevant posts)? I am potentially interested in working on it, but I find it far easier to study a topic (and you sir write a lot of technical posts) while trying to solve some concrete problem.
Thanks in advance!