AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Karma: 6,156

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Yes, absolutely! The contest is not a publication venue.

A major impediment in applying RL theory to any realistic scenario is that even the control problem

^{[1]}is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two can be optimized more or less independently.

In real life, planning can often be decomposed across timescales, s.t. you don’t need to make short timescale plans for steps that only happen later on the long timescale. For example, if I’m in the process of planning a trip to Paris, I might need to worry about (i) booking hotel and tickets (long timescale), (ii) navigating the website I’m using to find a flight (medium timescale) and (iii) moving my finger towards the correct key for entering some specific text into a field (short timescale). But I don’t need to worry about walking down the escalator in the airport at this moment.

Here’s an attempt to formalize these properties.

We will define a certain formal language for describing environments. These environments are going to be certain

*asymptotic regions*in the space of MDPs.Each term has a type which consists of a tuple of inputs and a single output . Each input is a associated with an HV-polytope

^{[2]}. The output is associated with an H-polytope^{[3]}. The inputs represent action spaces (to get a discrete action set, we use the simplex of probability distributions on this set). The output represents the space of admissible equilibria.The atomic terms are finite communicating

^{[4]}MDPs, in which each state is associated with a particular input and a transition kernel which has to be an affine mapping. For an atomic term, is the polytope of stationary state-action distributions. Notice that it’s efficiently computable.Given two terms and , we can construct a new term . We set . This represents a process made of two independent parts.

Given a term , terms and surjective affine mappings , we can construct a new term . This represents an environment governed by on long timescales and by on short timescales. Notice that it’s possible to efficiently verify that is a surjection, which is why we use HV-polytopes for inputs

^{[5]}.

It might be useful to think of as vertical composition and as horizontal composition, in the category-theoretic sense.

In order to assign semantics to this language, we need to define the environment associated with each term . We will do so by assigning a state space , each state an input (which determines the action space at this state) and a transition kernel. This is done recursively:

For the atomic terms, it is straightforward.

For :

. Here, the last factor represents which subenvironment is active. This is needed because we want the two subenvironments to be asynchronous, i.e. their time dynamics don’t have to be in lockstep.

The transition kernel at is defined by updating according to the transition kernel of and then changing according to some

*arbitrary*probabilistic rule, as long as this rule switches the active subenvironment sufficiently often. The degrees of freedom here are one reason we get an asymptotic region in MDP-space rather than a specific MDP.

For :

, where we abuse notation to identify the input with its index inside the tuple.

is extended from in the obvious way.

Given and , the -transition kernel at is defined by (i) with high probability, is updated according to the transition kernel of (ii) with low probability, is updated according to the transition kernel of , where the action is determined by the

*frequency*of state-action pairs since the last type II transition: it is easy to see that is always a polytope in an appropriately defined space of state-action distributions.

The upshot is that, given a list of term definitions (which has a structure similar to a directed acyclic graph, since the definition of each term can refer to previously defined terms), we get an environment that can have an exponentially large number of states, but the control problem can be solved in time polynomial in the size of this description, given some assumptions about the reward function. Specifically, we “decorate” our terms with reward functions in the following way:

For atomic terms, we just specify the reward function in the straightforward way.

For , we specify some . The reward is then a linear combination of the individual rewards with these coefficients (and doesn’t depend on which subenvironment is active).

For a term of the form , we need that for some affine which is part of the decoration. This can be validated efficiently (here it’s important again that the input is an HV-polytope). In addition, we specify some and the reward a linear combination with these coefficients of the -reward and the -reward.

For timescale decomposition, this planning algorithm can be regarded as formalization of instrumental goals.

An important problem is, understanding the sample complexity of learning hypothesis classes made of such environments. First in the unbounded case and then with polynomial-time learning algorithms.

- ↩︎
“Control” means finding the optimal policy given known transition kernel and reward function.

- ↩︎
An HV-polytope is a polytope described by a list of inequalities

*and*a list of vertices (notice that it’s possible to efficiently validate such a description). - ↩︎
An H-polytope is a polytope described by list of inequalities.

- ↩︎
Maybe we can drop this requirement and use the polytope of

*reachable*stationary state-action distributions for . - ↩︎
According to Tiwary 2008, projection of H-polytopes is NP-hard even in the output-sensitive sense, but for non-degenerate projection directions it is output-sensitive polynomial time. In particular, this means we should be able to efficiently verify surjectivity in the non-degenerate case even for H-polytopes on the inputs. However, the proof given there seems poorly written and the paper is not peer reviewed AFAICT.

Some random observations:

What we actually want to optimize is how much

*subjective*time we survive,*not*surviving as long as possible on the physical timeline.If for some reason we wanted to survive as long as possible on the physical timeline, the best strategy might be storing energy/negentropy harvested during early periods rather than finding ways to harvest remaining energy during late periods.

AFAICT, it might be possible to harvest energy without stars or even black holes, by collecting cosmic gas and doing nuclear fusion, or something even better than nuclear fusion (e.g. inducing proton decay or throwing matter into small rapidly evaporating black holes).

AFAIK, it’s possible that there are bound states of neutrinos held by the weak force, and bound states of such “multi-neutrinos” held by gravity. And, the particles dark matter is made of open even more possibilities. Maybe we can build computers out of this stuff after all protons decay.

If our vacuum is only metastable, we might be able to induce some kind of controlled vacuum collapse and escape into the resulting new universe, which opens a bunch of more options.

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on

*how*the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent “where” it is in the universe. However, on closer examination, the physicalist is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent’s interaction with the actual environment but also its interaction with the “envelope” computer. In a sense, the policy can be said to reflects the agent’s “conscious thoughts”.This means that specifying an agent requires not only specifying its source code but also the “envelope semantics” (possibly we also need to penalize for the complexity of in the definition of ). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history is -consistent with the variable of the bridge transform. That is, for any we must have for some destiny . In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn’t sufficiently strong, since it doesn’t seem to establish a causal relation between the manifesting and the agent’s observations, but it’s at least necessary).

Notice also that the computational power of the envelope implied by becomes another characteristic of the agent’s intelligence, together with as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.

If you’re making this monthly, I suggest that you create a way to follow for people who don’t have Facebook, for example post these events in the (currently defunct) page of the LessWrong Israel group.

The spectrum you’re describing is related, I think, to the spectrum that appears in the AIT definition of agency where there is dependence on the

*cost of computational resources*. This means that the same system can appear agentic from a resource-scarce perspective but non-agentic from a resource-abundant perspective. The former then corresponds to the Vingean regime and the latter to the predictable regime. However, the framework does have a notion of prior and not just utility, so it*is*possible to ascribe beliefs to Vingean agents. I think it makes sense: the beliefs of another agent can predictably differ from your own beliefs if only because there is some evidence that you have seen but the other agent, to the best of your knowledge, have not^{[1]}.

- ↩︎
You need to allow for the possibility that the other agent inferred this evidence from some pattern you are not aware of, but you should not be confident of this. For example even a an arbitrarily-intelligent AI that received zero external information should have a hard time inferring certain things about the world that we know.

- ↩︎

# Causality in IBP

There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis , for , we consider its bridge transform . Given some subset of programs we can define then project to

^{[1]}. We can then take bridge transform*again*to get some . The factor now tells us which programs causally affect the manifestation of programs in . Notice that by Proposition 2.8 in the IBP article, when we just get all programs that are running, which makes sense.# Agreement Rules Out Mesa-Optimization

The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann’s) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don’t lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.

If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user’s own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user’s behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.

Two nuances:

Maybe in practice there’s still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.

If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn’t. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb’s theorem allows resolving the self-reference in prisoner dilemma games between FairBots.

- ↩︎
There are actually two ways to do this, corresponding to the two natural mappings . The first is just projecting the subset of to a subset of , the second is analogous to what’s used in Proposition 2.16 of the IBP article. I’m not entirely sure what’s correct here.

The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline’s future, you’re simulating a “groundhog day” where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive probability of unaligned AI takeover in the present (i.e. during the time interval of the simulated loop), but it’s a milder problem. It can be further ameliorated if the AI has enough information about the external world to make confident predictions about the possibility of unaligned takeover during this period. The out-of-distribution problem is also less severe: the AI can occasionally query the real researcher to make sure its predictions are still on track.

I think it’s a terrible idea to automatically adopt an equilibrium notion which incentivises the players to come up with increasingly nasty threats as fallback if they don’t get their way. And so there seems to be a good chunk of remaining work to be done, involving poking more carefully at the CoCo value and seeing which assumptions going into it can be broken.

I’m not convinced there is any real problem here. The intuitive negative reaction we have to this “ugliness” is because of (i) empathy and (ii) morality. Empathy is just a part of the utility function which, when accounted for, already ameliorates some of the ugliness. Morality is a reflection of the fact we are already in some kind of bargaining equilibrium. Therefore, thinking about all the threats invokes a feeling of all existing agreements getting dissolved sending us back to the “square one” of the bargaining. And the latter is something that, reasonably, nobody wants to do. But none of this implies this is not the correct ideal notion of bargaining equilibrium.

There is a sense in which agency is a fundamental concept. Before we can talk about physics, we need to talk about metaphysics (what is a “theory of physics”? how do we know which theories are true and which are false?). My best guess theory of metaphysics is infra-Bayesian physicalism (IBP), where agency is a central pillar: we need to talk about hypotheses

*of the agent*, and counterfactual policies*of the agent*. It also looks like epistemic rationality is inseparable from instrumental rational: it’s impossible to do metaphysics without also doing decision theory.Does this refute reductionist materialism? Well, it depends how you define “reductionist materialism”. There is a sense in which IBP is very harmonious with reductionist materialism, because each hypothesis talks about the universe from a “bird’s eye view”, without referring to the relationship of the agent with the universe (this relationship turns out to be possible to infer using the agent’s knowledge of its own source code), or even assuming any agent exists inside the universe described by the hypothesis. But, the agent is still implicit in the “whose hypothesis”.

Once we accept the “viewpoint agent” (i.e. the agent who hypothesizes/infers/decides) as fundamental, we can still ask, what about

*other*agents? The answer is: other agents are programs with high value of (see Definition 1.6 in the IBP article) which the universe is “running” (this is a well-defined thing in IBP). In this sense, other agents are sort of like rocks: emergent from the fundamental reductionist description of the universe. However, there’s a nuance: this reductionist description of the universe is a*belief*of the viewpoint agent. The fact it is a belief (formalized as a homogeneous ultradistribution) is crucial in the definition. So, once again, we cannot eliminate agency from the picture.The silver lining is that, even though the concept of which programs are running is defined using beliefs, i.e. requires a

*subjective*ontology, it seems likely different agents inhabiting the same universe can agree on it (see subsection “are manifest facts objective” in the IBP article), so there is a sense in which it is objective after all. Decide for yourself whether to call this “reductionist materialism”.

This is a fascinating result, but there is a caveat worth noting. When we say that e.g. AlphaGo is “superhuman at go” we are comparing it humans who (i) spent

*years*training on the task and (ii) were selected for being the best at it among a sizable population. On the other hand, with next token prediction we’re nowhere near that amount of optimization on the human side. (That said, I also agree that optimizing a model on next token prediction is very different from optimizing it for text coherence would be, if we could accomplish the latter.)

I’m glad that you guys are interested in working on IBP/PreDCA. Here are a few points that might help you:

The scope of this project seems extremely ambitious. I think that the road from here to empirical demonstrations (assuming that you mean in the real-world rather than some artificial toy setting) is a programme for many people working over many years. Therefore, I think you will benefit from zooming in and deciding on the particular first step you want to take along that road.

Technicality: the definition of in the IBP post is for a fixed loss function, because that was sufficient for purposes of that post, but the definition of the cartesian version is loss-function-agnostic. Ofc it’s completely straightforward to write a loss-function-specific cartesian version and a loss-function-agnostic physicalist version.

Regarding the agentometer/utiliscope, I think it’s probably best to start from studying the cartesian versions, because that’s likely to be simpler and the physicalist theory will build on that.

Specifically, we want to get theorems along the lines of: (i) For an agent with , the asymptotic behavior of the utility function can be inferred nearly unambiguously. (ii) Inferring the utility function of agent-1 and then optimizing it via agent-2 that e.g. has a richer observation channel leads to results that are better for agent-1 than what agent-1 can do on its own, in the long time horizon limit.

The epistemic status of the ulitiscope formula from my presentation is: I’m pretty optimistic that there is

*some*correct formula along those lines, but the specific formula I wrote there is just my best guess after thinking for a short time and I am far from confident it is correct. My confidence would become much better if we demonstrated some non-trivial theorems that show it satisfies some intuitive desiderata.The epistemic status of the definition of is: I’m pretty optimistic it is

*close*to correct, but there is definitely room for quibbling over the details.While studying the computational complexity of the relevant mathematical objects is reasonable, I advise to steer clear of practical implementations of IBRL/IBP (assuming that “practical” means “competitive+ with ANNs”) because of the associated risks, until we are

*much*further along on the theoretical side.

Also, I am completely open to discussing the details of your project in private, if you’re serious about it.

The short answer is, I don’t know.

The long answer is, here are some possibilities, roughly ordered from “boring” to “weird”:

The framework is wrong.

The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it’s hard to say whether they break important things or not.

Humans are just not physicalist agents, you’re not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.

Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.

Because the universe is effectively finite (since it’s asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have

*all*possible computations running which makes them roughly irrelevant hypotheses for a physicalist.We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a “dead copy” of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the “dead copy”.

The problem is that if implies that creates but you consider a counterfactual in which doesn’t create then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it “hard counterfactuals”) only makes sense when the condition you’re counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining “soft counterfactuals” where you consider coarsenings of in addition to itself.

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so “model of this type” is not much of a constraint. On the other hand, I guess it’s theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn’t go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the “irreducible” term.

Master post for ideas about infra-Bayesian physicalism.

Other relevant posts:

PreDCA alignment protocol

Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.

Thank you!

Just because something can be learned efficiently doesn’t mean it’s convergent for a wide variety of cognitive systems.

I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don’t know how this prior looks like, but it’s something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer to that “holy grail prior”, in terms of richness/efficiency, compared to priors that we already know (e.g. MDPs with small number of states which are not rich enough, or the Solomonoff prior which is both statistically and computationally intractable), that would at least be

*evidence*that you’re going in the right direction.And even if such hypothesis classes couldn’t be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems’ cognition.

Hmm, I’m not sure what would it mean for a subset of a hypothesis class to be “convergent”.

The question we actually want here is “Is abstraction, as captured by John’s formalism, instrumentally convergent for a wide variety of cognitive systems?”.

That’s interesting, but I’m still not sure what it means exactly. Let’s say we take a reinforcement learner which a specific hypothesis class, such all MDPs of certain size, or some family of MDPs with low eluder dimension, or the actual AIXI. How would you determine whether your formalism is “instrumentally convergent” for each of those? Is there a rigorous way to state the question?

As I see it, the core theory of natural abstractions is now 80% nailed down

Question 1: What’s the minimal set of articles one should read to understand this 80%?

Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that “abstraction” (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sample complexity (and better yet if you can tell something about the computational complexity), in a manner that cannot be achieved if you discard any of the important-according-to-you pieces. You might have some different approach to this, but I’m not sure what it is.

This is a classical example where having a prediction market creates really bad incentives.