You are just normalizing on the dollar. You could ask “how many chickens would I kill to save a human life” instead, and you would normalize on a chicken.
Utility functions are invariant up to affine transformation. I don’t need to say how much I value a human life or how much I value a chicken life to make decisions in weird trolly problems involving humans and chickens. I only need to know relative values. However, utility uncertainty messes this up. Say I have two hypotheses: one in which human and chicken lives have the same value, and one in which humans are a million times more valuable. I assign the two hypotheses equal weight.
I could normalize and say that in both cases a human is worth 1 util. Then, when I average across utility functions, humans are about twice as valuable as chickens. But if I normalize and say that in both cases a chicken is worth 1 util, then when I average, the human is worth about 500,000 times as much. (You can still treat it like other uncertainty, but you have to make this normalization choice.)
I think it was wrong about the MtG post. I mostly think the negative effects of posting ideas (related to technical topics) that people think are bad is small enough to ignore, except in so far as it messes with my internal state. My system 2 thinks my system 1 is wrong about the external effects, but intends to cooperate with it anyway, because not cooperating with it could be internally bad.
As another example, months ago, you asked me to talk about how embedded agency fits in with the rest of AI safety, and I said something like that I didn’t want to force myself to make any public arguments for or against the usefulness of agent foundations. This is because I think research prioritization is especially prone to rationalization, so it is important to me that any thoughts about research prioritization are not pressured by downstream effects on what I am allowed to work on. (It still can change what I decide to work on, but only through channels that are entirely internal.)
So, I feel like I am concerned for everyone, including myself, but also including people who do not think that it would effect them. A large part of what concerns me is that the effects could be invisible.
For example, I think that I am not very effected by this, but I recently noticed a connection between how difficult it is to get to work on writing a blog post that I think it is good to write, and how much my system one expects some people to receive the post negatively. (This happened when writing the recent MtG post.) This is only anecdotal, but I think that posts that seems like bad PR caused akrasia, even when when controlling for how good I think the post is on net. The scary part is that there was a long time before I noticed this. If I believed that there was a credible way to detect when there are thoughts you can’t have in the first place, I would be less worried.
I didn’t have many data points, and the above connection might have been a coincidence, but the point I am trying to make is that I don’t feel like I have good enough introspective access to rule out a large, invisible, effect. Maybe others do have enough introspective access, but I do not think that just not seeing the outer incentives pulling on you is enough to conclude that they are not there.
I am not saying to falsely encourage him, I think I am mostly saying to continue giving him some attention/platform to get his ideas out in a way that would be heard. The real thing that I want is whatever will cause Bob to not end up back propagating from the group epistemics into his individual idea generation.
I apologize for using the phrase “epistemic status” in a way that disagrees with the accepted technical term.
I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.
I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.
Abram and I submit Embedded Agency.
Yeah, it is just functions that take in two sentences and put both their Godel numbers into a fixed formula (with 2 inputs).
Thanks, I actually wanted to get rid of the earlier condition that f(x)≥x for all x, and I did that.
This is not a complete answer, but it is part of my picture:
(It is the part of the picture that I can give while being only descriptive, and not prescriptive. For epistemic hygiene reasons, I want avoid discussions of how much of different approaches we need in contexts (like this one) that would make me feel like I was justifying my research in a way that people might interpret as an official statement from the agent foundations team lead.)
I think that Embedded Agency is basically a refactoring of Agent Foundations in a way that gives one central curiosity based goalpost, rather than making it look like a bunch of independent problems. It is mostly all the same problems, but it was previously packaged as “Here are a bunch of things we wish we understood about aligning AI,” and in repackaged as “Here is a central mystery of the universe, and here are a bunch things we don’t understand about it.” It is not a coincidence that they are the same problems, since they were generated in the first place by people paying close to what mysteries of the universe related to AI we haven’t solved yet.
I think of Agent Foundations research has having a different type signature than most other AI Alignment research, in a way that looks kind of like Agent Foundations:other AI alignment::science:engineering. I think of AF as more forward-chaining and other stuff as more backward-chaining. This may seem backwards if you think about AF as reasoning about superintelligent agents, and other research programs as thinking about modern ML systems, but I think it is true. We are trying to build up a mountain of understanding, until we collect enough that the problem seems easier. Others are trying to make direct plans on what we need to do, see what is wrong with those plans, and try to fix the problems. Some consequences of this is that AF work is more likely to be helpful given long timelines, partially because AF is trying to be the start of a long journey of figuring things out, but also because AF is more likely to be robust to huge shifts in the field.
I actually like to draw an analogy with this: (taken from this post by Evan Hubinger)
I was talking with Scott Garrabrant late one night recently and he gave me the following problem: how do you get a fixed number of DFA-based robots to traverse an arbitrary maze (if the robots can locally communicate with each other)? My approach to this problem was to come up with and then try to falsify various possible solutions. I started with a hypothesis, threw it against counterexamples, fixed it to resolve the counterexamples, and iterated. If I could find a hypothesis which I could prove was unfalsifiable, then I’d be done.
When Scott noticed I was using this approach, he remarked on how different it was than what he was used to when doing math. Scott’s approach, instead, was to just start proving all of the things he could about the system until he managed to prove that he had a solution. Thus, while I was working backwards by coming up with possible solutions, Scott was working forwards by expanding the scope of what he knew until he found the solution.
(I don’t think it quite communicates my approach correctly, but I don’t know how to do better.)
A consequence of the type signature of Agent Foundations is that my answer to “What are the other major chunks of the larger problem?” is “That is what I am trying to figure out.”
So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.
Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is “make the map match the territory better than any other map in a given class,” or even “make the map match the territory better than any small modification to the map.” But a large point of the section is that if you search strategies that “make the map match the territory better than any other map in a given class,” at small scales, this is the same as “make the map match the territory.” So you might find “make the map match the territory” optimizers, and then go wrong in the way above.
I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.
Some last minute emphasis:
We kind of open with how agents have to grow and learn and be stable, but talk most of the time about this two agent problem, where there is an initial agent and a successor agent. When thinking about it as the succession problem, it seems like a bit of a stretch as a fundamental part of agency. The first two sections were about how agents have to make decisions and have models, and choosing a successor does not seem like as much of a fundamental part of agency. However, when you think it as an agent has to stably continue to optimize over time, it seems a lot more fundamental.
So, I want to emphasize that when we say there are multiple forms of the problem, like choosing successors or learning/growing over time, the view in which these are different at all is a dualistic view. To an embedded agent, the future self is not privileged, it is just another part of the environment, so there is no difference between making a successor and preserving your own goals.
It feels very different to humans. This is because it is much easier for us to change ourselves over time that it is to make a clone of ourselves and change the clone, but that difference is not fundamental.
But how do you avoid proving with certainty that p=1/2?
Since your proposal does not say what to do if you find inconsistent proofs that the linear function is two different things, I will assume that if it finds multiple different proofs, it defaults to 5 for the following.
Here is another example:
You are in a 5 and 10 problem. You have twin that is also in a 5 and 10 problem. You have exactly the same source code. There is a consistency checker, and if you and your twin do different things, you both get 0 utility.
You can prove that you and your twin do the same thing. Thus you can prove that the function is 5+5p. You can also prove that your twin takes 5 by Lob’s theorem. (You can also prove that you take 5 by Lob’s theorem, but you ignore that proof, since “there is always a chance”) Thus, you can prove that the function is 5-5p. Your system doesn’t know what to do with two functions, so it defaults to 5. (If it is provable that you both take 5, you both take 5, completing the proof by Lob.)
I am doing the same thing as before, but because I put it outside of the agent, it does not get flagged with the “there is always a chance” module. This is trying to illustrate that your proposal takes advantage of a separation between the agent and the environment that was snuck in, and could be done incorrectly.
Two possible fixes:
1) You could say that the agent, instead of taking 5 when finding inconsistency takes some action that exhibits the inconsistency (something that the two functions give different values). This is very similar to the chicken rule, and if you add something like this, you don’t really need the rest of your system. If you take an agent that whenever it proves it does something, it does something else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.
2) I had one proof system, and just ignored the proofs that I found that I did a thing. I could instead give the agent a special proof system that is incapable of proving what it does, but how do you do that? Chicken rule seems like the place to start.
One problem with the chicken rule is that it was developed in a system that was deductively closed, so you can’t prove something that passes though a proof of P without proving P. If you violate this, by having a random theorem prover, you might have an system that fails to prove “I take 5” but proves “I take 5 and 1+1=2″ and uses this to complete the Lob loop.
Sure. How do you do that?
My point was that I don’t know where to assume the linearity is. Whenever I have private randomness, I have linearity over what I end up choosing with that randomness, but not linearity over what probability I choose. But I think this is non getting at the disagreement, so I pivot to:
In your model, what does it mean to prove that U is some linear affine function? If I prove that my probability p is 1⁄2 and that U=7.5, have I proven that U is the constant function 7.5? If there is only one value of p, it is not defined what the utility function is, unless I successfully carve the universe in such a way as to let me replace the action with various things and see what happens. (or, assuming linearity replace the probability with enough linearly independent things (in this case 2) to define the function.
Yeah, so its like you have this private data, which is an infinite sequence of bits, and if you see all 0′s you take an exploration action. I think that by giving the agent these private bits and promising that the bits do not change the rest of the world, you are essentially giving the agent access to a causal counterfactual that you constructed. You don’t even have to mix with what the agent actually does, you can explore with every action and ask if it is better to explore and take 5 or explore and take 10. By doing this, you are essentially giving the agent access to a causal counterfactual, because conditioning on these infinitesimals is basically like coming in and changing what the agent does. I think giving the agent a true source of randomness actually does let you implement CDT.
If the environment learns from the other possible worlds, It might punish or reward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.
I agree that that is how you want to think about the matching pennies problem. However the point is that your proposed solution assumed linearity. It didn’t empirically observe linearity. You have to be able to tell the difference between the situations in order to know not to assume linearity in the matching pennies problem. The method for telling the difference is how you determine whether or not and in what ways you have logical control over Omega’s prediction of you.
A conversation that just went down in my head:
Me: “You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems.”
Straw You: “The reason you fail is because you can’t specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction.”
Me: “But the instruction is not simple! Pointing at what the “human” is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don’t look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted.”
So, your suggestion is not just an inconsequential grain of uncertainty, it is an grain of exploration. The agent actually does take 10 with some small probability. If you try to do this with just uncertainty, things would be worse, since that uncertainty would not be justified.
One problem is that you actually do explore a bunch, and since you don’t get a reset button, you will sometimes explore into irreversible actions, like shutting yourself off. However, if the agent has a source of randomness, and also the ability to simulate worlds in which that randomness went another way, you can have an agent that with probability 1−ε does not explore ever, and learns from the other worlds in which it does explore. So, you can either explore forever, and shut yourself off, or you can explore very very rarely and learn from other possible worlds.
The problem with learning from other possible worlds is to get good results out of it, you have to assume that the environment does not also learn from other possible worlds, which is not very embedded.
But you are suggesting actually exploring a bunch, and there is a problem other than just shutting yourself off. You are getting past this problem in this case by only allowing linear functions, but that is not an accurate assumption. Let’s say you are playing matching pennies with Omega, who has the ability to predict what probability you will pick but not what action you will pick.
(In matching pennies, you each choose H or T, you win if they match, they win if they don’t.)
Omega will pick H if your probability of H is less that 1⁄2 and T otherwise. Your utility as a function of probability is piecewise linear with two parts. Trying to assume that it will be linear will make things messy.
There is this problem where sometimes the outcome of exploring into taking 10, and the outcome of actually taking 10 because it is good are different. More on this here.