The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length.
I haven’t heard the term “platonic goals” before. There’s been plenty written on capability control before, but I don’t know of anything written before on the strategy I described in this post (although it’s entirely possible that there’s been previous writing on the topic that I’m not aware of).
Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I’m misunderstanding your concern?)
There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don’t exist, so the existence of additional leaks that we didn’t know about shouldn’t be concerning.
Leaks from the outside world into the computational abstraction would be more concerning, since the whole point is to prevent those from existing. It seems like it should be possible to make hardware arbitrarily reliable by devoting enough resources to error detection and correction, which would prevent such leaks, though I’m not an expert, so it would be good to know if this is wrong. There may be other ways to get the AI to act similarly to the way it would in the idealized toy world even when hardware errors create small differences. This is certainly the sort of thing we would want to take seriously if hardware can’t be made arbitrarily reliable.
Incidentally, that story about accidental creation of a radio with an evolutionary algorithm was part of what motivated my post in the first place. If the evolutionary algorithm had used tests of its oscillator design in a computer model, rather than in the real world, then it would have have built a radio receiver, since radio signals from nearby computers would not have been included in the computer model of the environment, even though they were present in the actual environment.
What I meant was that the computation isn’t extremely long in the sense of description length, not in the sense of computation time. Also, we aren’t doing policy search over the set of all turing machines, we’re doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)
Wouldn’t the set of all action sequences have lower description length than some large finite set of policies? There’s also the potential problem that all of the policies in the large finite set you’re searching over could be quite far from optimal.
Ok, understood on the second assumption. U is not a function to [0,1], but a function to the set of [0,1]-valued random variables, and your assumption is that this random variable is uncorrelated with certain claims about the outputs of certain policies. The intuitive explanation of the third condition made sense; my complaint was that even with the intended interpretation at hand, the formal statement made no sense to me.
I’m pretty sure you’re assuming that ϕ is resolved on day n, not that it is resolved eventually.
Searching over the set of all Turing machines won’t halt in a reasonably short amount of time, and in fact won’t halt ever, since the set of all Turing machines is non-compact. So I don’t see what you mean when you say that the computation is not extremely long.
This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you’re suggesting we build an AI without preferences, but it doesn’t sound like that.)
Ok, here’s another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Humans simulate this process so they can see what the AI does in the small world, and copy what it does. The AI could try messing with us spectators, so that we end up giving more compute to the physical instantiation of the AI in the human world (which is different from the AI in the platonic mathematical structure), which the physical instantiation of the AI in the human world can use to better manipulate the simulation of the toy world that we are running in the human world (which is also different from the platonic mathematical structure). The platonic mathematical structure itself does not have a human world with extra compute in it that can be grabbed, so trying to mess with human spectators would, in the platonic mathematical structure, just end up being a waste of compute, so this strategy will be discarded if it somehow gets considered in the first place. Thus a real-world simulation of this AI-in-a-platonic-mathematical-structure will, if accurate, behave in the same way.
I suggest stating the result you’re proving before giving the proof.
You have some unusual notation that I think makes some of this unnecessarily confusing. Instead of this underlined vs non-underlined thing, you should have different functions U:Aω→[0,1]$ and Ua1:n−1st:A×Π→[0,1], where the first maps action sequences to utilities, and the second maps a pair consisting of an action x and a future policy π to the utility of the action sequence beginning with a1:n−1st, followed by x, followed by the action sequence generated by π. Your first assumption would then be stated Ua1:n−1st(x,π)=U(a1:n−1st,x,a1:∞π). Your second assumption (fairness of the environment) is implicit in the type signature of the utility function U:Aω→[0,1]. If your utility depends on something other than the action sequence, then it doesn’t make sense to write it as a function of the action sequence. It’s good to point out assumptions that are implicit in the formalism you’re using, but by the time you identify utility as a function of action sequences, you don’t need to assume fairness of the environment as an additional axiom. I do not understand what your third assumption is.
This is emphatically false in general, but there’s a special condition that makes it viable, namely that the distribution at time n is guaranteed to assign probability 1 to ϕ iff ϕ. My epistemic state about this is “this seems extremely plausible, but I don’t know for sure if logical inductors attain this property in the limit”
They don’t. For instance, let ϕ be any true undecidable sentence. The logical inductor does not assign probability 1 to ϕ even in the limit. Your fourth assumption does not seem reasonable. Does En−1(En(U|ϕ))=En−1(U|ϕ) not give you what you want?
Note that this only explicitly writes the starting code, and the code that might be modified to, not the past or future action sequence! This is important for the agent to be able to reason about this computation, despite it taking an infinite input.
I think this is exactly backwards. The property that makes spaces easy to search through and reason about is compactness, not finiteness. If A is finite, then Aω is compact, and thus easy to search through and reason about, provided the relevant functions on it are continuous. But the space of computer programs is an infinite discrete space, hence non-compact, and hard to search through and reason about, except by remembering that the purpose of selecting a program is so that it will generate an element of the nice, easily-searchable compact space of action sequences.
The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples’ simulation of the toy world the AI is optimizing, but it doesn’t give the AI any more influence over the abstract computational process that it (another abstract computational process) was interfacing with and optimizing over.
Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.
Yes, this could be difficult, and would likely limit what we could do, but I don’t see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don’t require more explicit physics knowledge than we already have.
I agree. I didn’t mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It’s possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don’t expect this to be the case.
It seems unlikely to me that a re-evaluation of how many QALYs buying a sandwich is worth would arise from a re-evaluation of how valuable QALYs are, rather than a re-evaluation of how much buying the sandwich is worth.
I disagree with this. The value of a QALY could depend on other features of the universe (such as your lifespan) in ways that are difficult to explicitly characterize, and thus are subject to revision upon further thought. That is, you might not be able to say exactly how valuable the difference between living 50 years and living 51 years is, denominated in units of the difference between living 1000 years and living 1001 years. Your estimate of this ratio might be subject to revision once you think about it for longer. So the value of a QALY isn’t stable under re-evaluation, even when expressed in units of QALYs under different circumstances. In general, I’m skeptical that the concept of good reference points whose values are stable in the way you want is a coherent one.
Ok, I see what you’re getting at now.
I don’t think that specifying the property of importance is simple and helps narrow down S. I think that in order for predicting S to be important, S must be generated by a simple process. Processes that take large numbers of bits to specify are correspondingly rarely occurring, and thus less useful to predict.
Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying “a simulation that can support life” is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.
Oh yes, I see. That does cut the complexity overhead down a lot.
Once you’ve specified the agent, it just samples randomly from the distribution of “strings I want to influence.” That has a way lower probability than the “natural” complexity of a string I want to influence. For example, if 1/quadrillion strings are important to influence, then the attackers are able to save log(quadrillion) bits.
I don’t understand what you’re saying here.
I didn’t mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.
This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.
I think decision problems with incomplete information are a better model in which to measure optimization power than deterministic decision problems with complete information are. If the agent knows exactly what payoffs it would get from each action, it is hard to explain why it might not choose the optimal one. In the example I gave, the first agent could have mistakenly concluded that the .9-utility action was better than the 1-utility action while making only small errors in estimating the consequences of each of its actions, while the second agent would need to make large errors in estimating the consequences of its actions in order to think that the .1-utility action was better than the 1-utility action.