I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.
It seems like it’s easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It’s much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls.
Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization.
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it’s just the utility function you endorse, and that’s up to you. The other black box is the space of programs that you could be. Maybe it’s limited by memory, maybe it’s limited by run time, or maybe it’s any finite state machine with less than 10^20 states, maybe it’s python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don’t have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
I don’t think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.
I do think this is an important concept to explain our conception of goal-directedness, but I don’t think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
Here is an idea for a disagreement resolution technique. I think this will work best:
*with one other partner you disagree with.
*when your the beliefs you disagree about are clearly about what the world is like.
*when your the beliefs you disagree about are mutually exclusive.
*when everybody genuinely wants to figure out what is going on.
Probably doesn’t really require all of those though.
The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write down your credences next to each of the statements on the work space.
Now, when you want to make a new argument or present a new piece of evidence, you should ask your partner if they have heard it before after you present it. Maybe you should ask them questions about it beforehand to verify that they have not. If they have not heard it before, or had not considered it, you give it a name and write it down between the two propositions. Now you ask your partner how much they changed their credence as a result of the new argument. They write down their new credences below the ones they previously wrote down, and write down the changes next to the argument that just got added to the board.
When your partner presents a new argument or piece of evidence, be honest about whether you have heard it before. If you have not, it should change your credence some. How much do you think? Write down your new credence. I don’t think you should worry too much about being a consistent Bayesian here or anything like that. Just move your credence a bit for each argument or piece of evidence you have not heard or considered, and move it more for better arguments or stronger evidence. You don’t have to commit to the last credence you write down, but you should think at least that the relative sizes of all of the changes were about right. I
I think this is the core of the technique. I would love to try this. I think it would be interesting because it would focus the conversation and give players a record of how much their minds changed, and why. I also think this might make it harder to just forget the conversation and move back to your previous credence by default afterwards.
You could also iterate it. If you do not think that your partner changed their mind enough as a result of a new argument, get a new workspace and write down how much you think they should have change their credence. They do the same. Now you can both make arguments relevant to that, and incrementally change your estimate of how much they should have changed their mind, and you both have a record of the changes.
If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.
Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like “this is not the rationality I grew up with”. I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.
One relatively small reason I feel this way is that a lot of really smart rationalists, who are my friends or who I deeply respect or both, seem to have gotten really into chakras, and maybe some other woo stuff. I want to better understand these folks. I’ll admit now that I have weird biased attitudes towards woo stuff in general, but I am going to use chakras as a specific example here.
One of the sacred values of rationality that I care a lot about is that one should not discount hypotheses/perspectives because they are low status, woo, or otherwise weird.
Another is that one’s beliefs should pay rent.
To be clear, I am worried that we might be failing on the second sacred value. I am not saying that we should abandon the first one as I think some people may have suggested in the past. I actually think that rationalists getting into chakras is strong evidence that we are doing great on the first sacred value.
Maybe we are not failing on the second sacred value. I want to know whether we are or not, so I want to ask rationalists who think a lot or talk enthusiastically about chakras a question:
Do chakras exist?
If you answer “yes”, how do you know they exist?
I’ve thought a bit about how someone might answer the second question if they answer “yes” to the first question without violating the second sacred value. I’ve thought of basically two ways that seems possible, but there are probably others.
One way might be that you just think that chakras literally exist in the same ways that planes literally exist, or in the way that waves literally exist. Chakras are just some phenomena that are made out of some stuff like everything else. If that is the case, then it seems like we should be able to at least in principle point to some sort of test that we could run to convince me that they do exist, or you that they do not. I would definitely be interested in hearing proposals for such tests!
Another way might be that you think chakras do not literally exist like planes do, but you can make a predictive profit by pretending that they do exist. This is sort of like how I do not expect that if I could read and understand the source code for a human mind, that there would be some parts of the code that I could point to and call the utility and probability functions. Nonetheless, I think it makes sense to model humans as optimization processes with some utility function and some probability function, because modeling them that way allows me to compress my predictions about their future behavior. Of course, I would get better predictions if I could model them as mechanical objects, but doing so is just too computationally expensive for me. Maybe modeling people as having chakras, including yourself, works sort of the same way. You use some of your evidence to infer the state of their chakras, and then use that model to make testable predictions about their future behavior. In other words, you might think that chakras are real patterns. Again it seems to me that in this case we should at least in principle be able to come up with tests that would convince me that chakras exist, or you that they do not, and I would love to hear any such proposals.
Maybe you think they exist in some other sense, and then I would definitely like to hear about that.
Maybe you do not think they exist in anyway, or make any predictions of any kind, and in that case, I guess I am not sure how continuing to be enthusiastic about thinking about chakras or talking about chakras is supposed to jive with the sacred principle that one’s beliefs should pay rent.
I guess it’s worth mentioning that I do not feel as averse to Duncan’s color wheel thing, maybe because it’s not coded as “woo” to my mind. But I still think it would be fair to ask about that taxonomy exactly how we think that it cuts the universe at its joints. Asking that question still seems to me like it should reduce to figuring out what sorts of predictions to make if it in fact does, and then figuring out ways to test them.
I would really love to have several cooperative conversations about this with people who are excited about chakras, or other similar woo things, either within this framework of finding out what sorts of tests we could run to get rid of our uncertainty, or questioning the framework I propose altogether.
Here is an idea I just thought of in an uber ride for how to narrow down the space of languages it would be reasonable to use for universal induction. To express the k-complexity of an object O relative to a programing language L I will write:
Suppose we have two programing languages. The first is Python. The second is Qython, which is a lot like Python, except that it interprets the string “A” as a program that outputs some particular algorithmically large random looking character string S with KPython(S)≈1015. I claim that intuitively, Python is a better language to use for measuring the complexity of a hypothesis than Qython. That’s the notion that I just thought of a way to formally express.
There is a well known theorem that if you are using L1 to measure the complexity of objects, and I am using L2 to measure the complexity of objects, then there is a constant c2 such that for any object O:
In words, this means that you might think that some objects are less complicated than I do, and you might think that some objects are more complicated than I do, but you won’t think that any object is c2 complexity units more complicated than I do. Intuitively, c2 is just the length of the shortest program in L1 that is a compiler for L2. So worst case scenario, the shortest program in L1 that outputs O will be a compiler for L2 written in L1 (which is c2 characters long) plus giving that compiler the program in L2 that outputs O (which would be KL2(O) characters long).
I am going to define the k-complexity of a function f:X→Yrelative to a programing language as the length of the shortest program in that language such that when it is given x as an input, it returns f(x). This is probably already defined that way, but jic. So say we have a function from programs in L2 to their outputs and we call that function C2, then:
There is also another constant:
The first is the length of the shortest compiler for L2 written in L1, and the second is the length of the shortest compiler for L1 written in L2. Notice that these do not need to be equal. For instance, I claim that the compiler for Qython written in Python is roughly 1015 characters long, since we have to write the program that outputs S in Python which by hypothesis was about 1015 characters long, and then a bit more to get it to run that program when it reads “A”, and to get that functionality to play nicely with the rest of Qython however that works out. By contrast, to write a compiler for Python in Qython it shouldn’t take very long. Since Qython basically is Python, it might not take any characters, but if there are weird rules in Qython for how the string “A” is interpreted when it appears in an otherwise Python-like program, then it still shouldn’t take any more characters than it takes to write a Python interpreter in regular Python.
So this is my proposed method for determining which of two programming languages it would be better to use for universal induction. Say again that we are choosing between L1 and L2. We find the pair of constants such that KL1(C2)=c2 and KL2(C1)=c1, and then compare their sizes. If c1 is less than c2 this means that it is easier to write a compiler for L1 in L2 than vice versa, and so there is more hidden complexity in L2‘s encodings than in L1’s, and so we should use L1 instead of L2 for assessing the complexity of hypotheses.
Lets say that if KL2(C1)<KL1(C2) then L2 hides more complexity than L1.
A few complications:
It is probably not always decidable whether the smallest compiler for L1 written in L2 is smaller than the smallest compiler for L2 written in L1, but this at least in principle gives us some way to specify what we mean by one language hiding more complexity than another, and it seems like at least in the case of Python vs. Qython, we can make a pretty good argument that the smallest compiler for Python written in Qython is smaller than the smallest compiler for Qython written in Python.
It is possible (I’d say probable) that if we started with some group of candidate languages and looked for languages that hide less complexity, we might run into a circle. Like the smallest compiler for L1 in L2 might be the same size as the smallest compiler for L2 in L1 but there might still be an infinite set of objects Oi such that:
In this case, the two languages would disagree about the complexity of an infinite set of objects, but at least they would disagree about it by no more than the same fixed constant in both directions. Idk, seems like probably we could do something clever there, like take the average or something, idk. If we introduce an L3 and the smallest compiler for L3 in L1 is larger than it is in L2, then it seems like we should pick L1.
If there is an infinite set of languages that all stand in this relationship to each other, ie, all of the languages in an infinite set disagree about the complexity of an infinite set of objects and hide less complexity than any language not in the set, then idk, seems pretty damning for this approach, but at least we narrowed down the search space a bit?
Even if it turns out that we end up in a situation where we have an infinite set of languages that disagree about an infinite set of objects by exactly the same constant, it might be nice to have some upper bound on what that constant is.
In any case, this seems like something somebody would have thought of, and then proved the relevant theorems addressing all of the complications I raised. Ever seen something like this before? I think a friend might have suggested a paper that tried some similar method, and concluded that it wasn’t a feasible strategy, but I don’t remember exactly, and it might have been a totally different thing.
When I started writing this comment I was confused. Then I got myself fairly less confused I think. I am going to say a bunch of things to explain my confusion, how I tried to get less confused, and then I will ask a couple questions. This comment got really long, and I may decide that it should be a post instead.
Take a system X with 8 possible states. Imagine X is like a simplified Rubik’s cube type puzzle. (Thinking about mechanical Rubik’s cube solvers is how I originally got confused, but using actual Rubik’s cubes to explain would make the math harder.) Suppose I want to measure the optimization power of two different optimizers that optimize X, and share the following preference ordering:
When I let optimizer1 operate on X, optimizer1 always leaves X=x8. So on the first time I give optimizer1 X I get:
If I give X to optimizer1 a second time I get:
This seems a bit weird to me. If we are imagining a mechanical robot with a camera that solves a Rubik’s cube like puzzle, it seems weird to say that the solver gets stronger if I let it operate on the puzzle twice. I guess this would make sense for a measure of optimization pressure exerted instead of a measure of the power of the system, but that doesn’t seem to be what the post was going for exactly. I guess we could fix this by dividing by the number of times we give optimizer1 X, and then we would get 3 no matter how many times we let optimizer1 operate on X. This would avoid the weird result that a mechanical puzzle solver gets more powerful the more times we let it operate on the puzzle.
Say that when I let optimizer2 operate on X, it leaves X=x7 with probability p, and leaves X=x8 with probability 1−p, but I do not know p. If I let optimizer2 operate on X one time, and I observe X=x7, I get:
If I let optimizer2 operate on X three times, and I observe X1=x7, X2=x7, X3=x8, then I get:
Now we could use the same trick we used before and divide by the number of instances on which optimizer2 was allowed to exert optimization pressure, and this would give us 7⁄3. The thing is though that we do not know p and it seems like p is relevant to how strong optimizer2 is. We can estimate p to be 2⁄5 using Laplace’s rule, but it might be that the long run frequency of times that optimizer2 leaves X=x8 is actually .9999 and we just got unlucky. (I’m not a frequentist, long run frequency just seemed like the closest concept. Feel free to replace “long run frequency” with the prob a solomonoff bot using the correct language assigns at the limit, or anything else reasonable.) If the long run frequency is in fact that large, then it seems like we are underestimating the power of optimizer2 just because we got a bad sample of its performance. The higher p is the more we are underestimating optimizer2 when we measure its power from these observations.
So it seems then like there is another thing that we need to know besides the preference ordering of an optimizer, the measure over the target system in the absence of optimization, and the observed state of the target system, in order to perfectly measure the optimization power of an optimizer. In this case, it seems like we need to know p. This is a pretty easy fix, we can just take the expectation of the optimization power as originally defined with respect to the probability of observing that state when the optimizer is present, but it is seem more complicated, and it is different.
With o being the observed outcome, Ubeing the utility function of the optimization process, and P being the distribution over outcomes in the absence of optimization, I took the definition in the original post to be:
The definition I am proposing instead is:
That is, you take the expectation of the original measure with respect to the distribution over outcomes you expect to observe in the presence of optimization. We could then call the original measure “optimization pressure exerted”, and the second measure optimization power. For systems that are only allowed to optimize once, like humans, these values are very similar; for systems that might exert their full optimization power on several occasions depending on circumstance, like Rubik’s cube solvers, these values will be different insofar as the system is allowed to optimize several times. We can think of the first measure as measuring the actual amount of optimization pressure that was exerted on the target system on a particular instance, and we can think of the second measure as the expected amount of optimization pressure that the optimizer exerts on the target system.
To hammer the point home, there is the amount of optimization pressure that I in fact exerted on the universe this time around. Say it was a trillion bits. Then there is the expected amount of optimization pressure that I exert on the universe in a given life. Maybe I just got lucky (or unlucky) on this go around. It could be that if you reran the universe from the point at which I was born several times while varying some things that seem irrelevant, I would on average only increase the negentropy of variables I care about by a million bits. If that were the case, then using the amount of optimization pressure that I exerted on this go around as an estimate of my optimization power in general would be a huge underestimate.
Ok, so what’s up here? This seems like an easy thing to notice, and I’m sure Eliezer noticed it.
Eliezer talks about how from the perspective of deep blue, it is exerting optimization pressure every time it plays a game, but from the perspective of the programmers, creating deep blue was a one time optimization cost. Is that a different way to cache out the same thing? It still seems weird to me to say that the more times deep blue plays chess, the higher its optimization power is. It does not seem weird to me to say that the more times a human plays chess, the higher its optimization power is. Each chess game is a subsystem of the target system of that human, eg, the environment over time. Whereas it does seem weird to me to say that if you uploaded my brain and let my brain operate on the same universe 100 times, that the optimization power of my uploaded brain would be 100 times greater than if you only did this once.
This is a consequence of one of the nice properties of Eliezer’s measure: OP sums for independent systems. It makes sense that if I think an optimizer is optimizing two independent systems, then when I measure their OP with respect to the first system and add it to their OP with respect to the second, I should get the same answer I would if I were treating the two systems jointly as one system. The Rubik’s cube the first time I give it to a mechanical Rubik’s cube solver, and the second time I give it to a mechanical Rubik’s cube solver are in fact two such independent systems. So are the first time you simulate the universe after my birth and the second time. It makes sense to me that you should sum my optimization power for independent parts of the universe in a particular go around should sum to my optimization power with respect to the two systems taken jointly as one, but it doesn’t make sense to me that you should just add the optimization pressure I exert on each go to get my total optimization power. Does the measure I propose here actually sum nicely with respect to independent systems? It seems like it might, but I’m not sure.
Is this just the same as Eliezer’s proposal for measuring optimization power for mixed outcomes? Seems pretty different, but maybe it isn’t. Maybe this is another way to extend optimization power to mixed outcomes? It does take into account that the agent might not take an action that guarantees an outcome with certainty.
Is there some way that I am confused or missing something in the original post that it seems like I am not aware of?
Is there a particular formula for negentropy that OP has in mind? I am not seeing how the log of the inverse of the probability of observing an outcome as good or better than the one observed can be interpreted as the negentropy of a system with respect to that preference ordering.
Edit: Actually, I think I figured it out, but I would still be interested in hearing what other people think.