Text whose primary goal is conveying information (as opposed to emotion, experience or aesthetics) should be skimming friendly. Time is expensive, words are cheap. Skimming is a vital mode of engaging with text, either to evaluate whether it deserves a deeper read or to extract just the information you need. As a reader, you should nurture your skimming skills. As a writer, you should treat skimmers as a legitimate and important part of your target audience. Among other things it means:

Good title and TLDR/abstract

Clear and useful division into sections

Putting the high-level picture and conclusions first, the technicalities and detailed arguments later. Never leave the reader clueless about where you’re going with something for a long time.

Visually emphasize the central points and make them as self-contained as possible. For example, in the statement of mathematical theorems avoid terminology whose definition is hidden somewhere in the bulk of the text.

Stronger: as a writer you should assume your modal reader is a skimmer, both because they are, and because even non skimmers are only going to remember about the same number of things that the good skimmer does.

An AI progress scenario which seems possible and which I haven’t seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That’s because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we’re still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevertheless, some superhuman performance might be produced by sped-up simulation, reducing noise in human behavior and controlling the initial conditions (e.g. simulating a human on a good day). As a result, we will have some period of time during which AGI is already here, automation is in full swing, but there’s little or no further escalation. At the end of this period, the missing ingredients will be assembled (maybe with the help of AI researchers) and superhuman AI (possibly a fast takeoff) begins.

It’s interesting to try and work out the consequences of such a scenario, and the implications on AI strategy.

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

That is similar to gaining uploads (borrowing terminology from Egan, we can call them “sideloads”), but it’s not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you’re distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that’s not better than plain speedup.

To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it’s necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).

Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it’s not feasible to build a dataset with those faculties represented explicitly that’s good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.

I don’t see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior “stream of thought”. In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn’t pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it’s not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans).

For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an “intrinsic” AGI might lead to the classical “foom” scenario.

EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job).

So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.

Another thing that might happen is a data bottleneck.

Maybe there will be a good enough dataset to produce a sideload that simulates an “average” person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).

Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That’s because, unless you’re training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your “episodes”. Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.

In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

Another way to describe the same (or similar) plateau: we could think about GPT-n as GLUT with approximation between prerecorded answers: it can produce intelligent products similar to the ones which were created by humans in the past and are presented in its training dataset – but not above the human intelligence level, as there is no superintelligent examples in the dataset.

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn’t make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user’s beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user’s policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user’s subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI’s prior over universes and ϵ some sufficiently small bound. We require

∀T:Eυ∼ζ[(Eπυuμυ[Vυ(T)]−Eπaμυ[Vυ(T)])+]≤ϵ

Here, V(T) designates the value after event T happens, and (x)+ is defined to be 0 for x<0 and x otherwise.

I now outline an example of a Hippocratic protocol, which I call Hippocratic timeline-driven learning (HTDL).

Suppose, for simplicity, that the goal is accomplishing some binary task (i.e. we can either succeed or fail). We can imagine the task to be something like “construct a defense system against malign AI”. At any given moment, the user reports their subjective timeline: i.e. for, any t, the probabilities of success and failure after time t, assuming the user will continue working on the task by themselves. We assume the AI can learn a good model of the user’s policy and the user’s nominal report, but this nominal report might deviate from the true timeline because the system can be corrupted (i.e. the user manipulated to enter false information or the input channel hacked). Moreover, we assume that such corruption counts as failing the task.

The AI operates by maximizing the AI-expectation of the user’s (unknown) utility function Eυ∼ζ[Eπμυ[Uυ]], subject to the Hippocratic principle as a constraint. Since we’re working in a delegative setting, the AI can trivially achieve the constraint by always delegating. However, it can do much better than that, by short-term quantilizing w.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation. Then, after completing this short-term maneuver, the AI can do another short-term quantilization, and continue in this vein. Essentially, the AI is gradually bending the user’s trajectory towards success.

Of course, we can improve the guarantees of such a protocol either by adding more information or by using stronger assumptions (e.g. assume the user is quasi-rational). But timeline-driven short-term quantilization serves as a useful feasibility proof.

However, it can do much better than that, by short-term quantilizing w.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.

I don’t understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don’t see a distribution here, or even a clear ordering over actions (given that we don’t have access to the utility function).

I’m probably missing something obvious, but more details would really help.

The distribution is the user’s policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the “maneuver”. More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI’s probability of corruption is bounded by ϵϕ.

I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn’t specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

I like this because it’s simple and obviously correct. Also I can see at least one way you could implement it:

a. Suppose the AI is ‘shadowing’ a human worker doing a critical task. Say it is ‘shadowing’ a human physician.

b. Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a ‘good’ physician would make, predicted outcome for the ‘good’ decision]. Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].

c. The machine also must have a confidence or this won’t work.

With large numbers and outright errors made by the physician, it’s then possible to detect all the cases where the [decision the physician made] has a substantially worse outcome than the [predicted decision a ‘good’ physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it’s time to call for a second opinion.

In the long run, of course, there will be a point where the [predicted decision a ‘good’ physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident.

(as an example, alpha go zero likely doesn’t benefit from asking another master go player for a ‘second opinion’ when it sees the player it is advising make a bad call)

I haverepeatedlyargued for a departure from pure Bayesianism that I call “quasi-Bayesianism”. But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here’s another way to understand it, using Bayesianism’s own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don’t need to depart far from Bayesianism to win here. We only need to somehow justify randomization, perhaps by something like infinitesimal random perturbations of the belief state (like with reflective oracles). But, in a way, this is exactly what quasi-Bayesianism does: a quasi-Bayes-optimal policy is in particular Bayes-optimal when the prior is taken to be in Nash equilibrium of the associated zero-sum game. However, Bayes-optimality underspecifies the policy: not every optimal reply to a Nash equilibrium is a Nash equilibrium.

This argument is not entirely novel: it is just a special case of an environment that the agent cannot simulate, which is the original motivation for quasi-Bayesianism. In some sense, any Bayesian agent is dogmatic: it dogmatically beliefs that the environment is computationally simple, since it cannot consider a hypothesis which is not. Here, Omega exploits this false dogmatic belief.

Bayeseans are allowed to understand that there are agents with better estimates than they have. And that being offered a bet _IS_ evidence that the other agent THINKS they have an advantage.

Randomization (aka “mixed strategy”) is well-understood as the rational move in games where opponents are predicting your choices. I have read nothing that would even hint that it’s unavailable to Bayesean agents. The relevant probability (updated per Bayes’s Rule) would be “is my counterpart trying to minimize my payout based on my choices”.

edit: I realize you may be using a different definition of “bayeseanism” than I am. I’m thinking humans striving for rational choices, which perforce includes the knowledge of incomplete computation and imperfect knowledge. Naive agents can be imagined that don’t have this complexity. Those guys are stuck, and Omega’s gonna pwn them.

I’m thinking humans striving for rational choices,

It feels like there’s better words for this like rationality, whereas bayesianism is a more specific philosophy about how best to represent and update beliefs.

And here I thought the reason was going to be that Bayesianism doesn’t appear to include the cost of computation. (Thus, the usual dutch book arguments should be adjusted so that “optimal betting” does not leave one worse off for having payed, say, an oracle, too much for computation.)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires considering shallow time discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models, but doesn’t seem very interesting: to receive the maximin payoff, the agents only have to learn the rules of the game, they need not learn the reward functions of the other players or anything else about them.

We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a population of agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let Ai be the pure strategy set of the i-th agent and O:=∏iAi the set of pure outcomes. The set of n-round outcome histories is On. The population of agents on the n-round can then be described as a probability measureμn∈ΔOn. Suppose the policy of the i-th player (that is, of all the agents that take the role of the i-th player) is πi:On→ΔAi. Then we can define a time evolution rule that produces μn+1 from μn. This rule works as follows: in order to sample μn+1 we sample μn once per player (this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn’t matter which). A set of policies is considered to be in equilibrium, when for any i, and any alternative policy π′i, letting π′i play against the same population (i.e. all other copies of the i-th player still play πi) doesn’t improve expected utility. In other words, on each round the “mutant” agent retains its own history but the other player histories are still sampled from the same μn. It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the πi to be learning algorithms would result in convergence to a Nash payoff in the γ→1 (shallow time discount) limit.

For example, consider the Prisoner’s dilemma. In the repeated Prisoner’s dilemma with shallow time discount, CC is an equilibrium because of the tit-for-tat policy. On the other hand, in the “population” (massively multi-player?) repeated Prisoner’s dilemma, DD is the only equilibrium. Tit-for-tat doesn’t work because a single “defect bot” can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn’t know the defect bot defected on the previous round.

Note that we get a very different setting if we allow the players to see each other’s histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner’s Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other’s source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.

We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents’ histories. This should lead to superrationality for the same reasons I discussedbefore. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.

What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don’t know to define this limitation.

EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.

Instead of postulating access to a portion of the history or some kind of limited access to the opponent’s source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like “counting pieces of evidence” should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.

What do you mean by equivalent? The entire history doesn’t say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.

The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can’t get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called “superrationality”.

In the previous “population game” setting, we assumed all players are “born” at the same time and learn synchronously, so that they always play against players of the same “age” (history length). Instead, we can consider a “mortal population game” setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the “thermodynamic” N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player’s utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.

As before, each player type i chooses a policy . (We can also consider the case where players of the same type may have different policies, but let’s keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.

If we consider only the age distribution, then its evolution doesn’t depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P.

Does the dynamics have fixed points?O∗ can be regarded as a subspace of (O⊔{⊥})ω. The later is compact (in the product topology) by Tychonoff’s theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also compact but ΔO∗ isn’t. However, it is easy to see that Pis closed in Δ(O⊔{⊥})ω and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of (O⊔{⊥})ω). Moreover, it is easy to see Tπ is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that Tπ has a fixed point.

What are the fixed points like? Of course it depends on π. In a fixed point, every player observes a sequence of IID plays in all of eir games. Therefore, if π satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the γ→1 limit, then, in the same limit, fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as “assume the plays in the next game will be sampled from a random past game”, and it works for any Bayesian or “quasi-Bayesian” (i.e. using incomplete/fuzzy models) agent that includes all IID processes in its prior.

This raises a range of interesting questions:

Are any/all of the fixed points attractors?

Does convergence to a fixed point occur for all or at least almost all initial conditions?

Do all Nash equilibria correspond to fixed points?

Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn’t have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn’t depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence is guaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.

I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.

Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.

Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.

Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.

The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.

In principle, we might be able to use knowledge of metacosmology in order to engineer an “atheist prior” for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, “death” can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent’s reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don’t give the right object, since the reward is still tied to the agent’s actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete^{[2]} Markov chain, the reward function being a function of the state. Examples:

The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.

The Markov chain is a representation of people and social interactions. The reward correspond to concepts like “happiness” or “friendship” et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

The Markov chain is Botworld with some of the rules left unspecified. The reward is the total number of a particular type of item.

Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don’t want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.

Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By “refinement”, I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the “ontological belief”. For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a “objective” prior and a Cartesian boundary, we can construct a “subjective” prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to “Nirvana” state. Alternatively, we can allow for stochastic action selection and use the gambler construction.

Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By “indirectly observable” I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won’t have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent “unspecial”. For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of Ω(1). On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.

Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of “embeddedness”. It seems like this can be naturally considered a special case of the previous construction, if we allow the “ontological belief” to include beliefs pertaining to computations.

Unless it’s some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope. ↩︎

“Incomplete” in the sense of Knightian uncertainty, like in quasi-Bayesian RL. ↩︎

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).

To salvage this idea, we need to take not only description complexity but also computational complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:

The description complexity, which is the length of P.

The computation time complexity, which is the size of R.

The computation space complexity, which is the maximum between the depth of R and log|S|.

The precomputation time complexity, which is the time it takes P to run.

The precomputation space complexity, which is the space P needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not any policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.

We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)

Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.

[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]

With this definition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement does hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.

Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is “perfectly (unbounded) goal-directed”.

Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.

I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.

If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.

Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).

Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).

What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

What is the computational complexity of producing a policy with given g?

Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.

You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

I am referring to this and related work by Armstrong.

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a “quasi-realizable” setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. But, so far I haven’t studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.

Consider a multi-armed bandit, where the arm space is [0,1]. First, consider the follow realizable setting: the reward is a deterministic function r:[0,1]→[0,1] which is known to be a polynomial of degree d at most. In this setting, learning is fairly easy: it is enough to sample d+1 arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is lower-bounded by a particular polynomial f:[0,1]→[0,1] of degree d at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven’t derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low d). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.

Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky’s Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasi-Bayesianism. This leads me to ask, what are the implications of quasi-Bayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasi-Bayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?

Probably not too original but I haven’t seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for “malign AI is attacking”, but that still leaves us open to surprise takeovers in which there is no chance to press the button.

Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a “groundhog day” scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times the present rate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some average future rate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.

Amplifying by probability: We allow the user(s) to choose “success” or “failure” (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.

Amplifying by parallelization: The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized. But, it is also the safest since the rate of malign AI takeover is only amplified by O(1) compared to the background. [EDIT: Actually, it’s not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don’t know how to do this.]

A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.

More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.

The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle^{[1]}. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.

Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.

Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.

More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di—es—can-ic-ul-ar—es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a “subjective regret bound”: that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user’s values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote F.

F allows making formal statements about a formal model M of the world, as seen from the AI’s perspective.M includes such elements as observations, actions, rewards and corruption. That is, M reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional “perceptible” reward function, an instrumental reward function, a semi-instrumental reward functions, dynamically-inconsistent rewards, rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, M also reflects the question-answer interface and the user’s behavior.

A single question can consist, for example, of asking for the probability of some sentence in F or the expected value of some expression of numerical type in F. However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question qF together with a natural language (N) annotationqN. This annotation has to explain the question in human understandable terms, and also convince the user that qN is indeed an accurate natural language rendering of qF. The user’s feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between N and F, i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in F and explained in N that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn’t admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.

The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is: it requires much less optimization power to produce some useful question than to produce a malicious question. Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach, i.e. having the agent to produce additional natural language text that attempts to refute the validity of the annotation.

Of course, in addition to the question interface, the physical interface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like “do nothing” or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.

Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user’s knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a committed answer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee “subgame perfection”: i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.

In order to improve the user’s ability to produce answers and stay informed about the AI’s strategy, we introduce another interface through which the user can ask questions from the AI. In this direction, the user asks the question in N and the AI both translates it to F and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formal proof in which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a “censored” answer while pointing out the presence of a potential infohazard.

Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user’s knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning setting, by allowing the user’s knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.

A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.

Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.

It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.

Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.

This design is made of so many parts! It might benefit from a proof that it is the universal answer to a formal question, which I expect to seem less overly complex.

I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.

Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.

The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.

The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.

The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.

The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.

Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.

Learning theory starts from formulating natural desiderata for agents, whereas “logic-AI” usually starts from postulating a logic-based model of the agent ad hoc.

Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.

Learning theory focuses on objects that are observable or finite/constructive, whereas logic-AI often considers objects that unobservable, infinite and unconstructive (which I consider to be a philosophical error).

Learning theory emphasizes induction whereas logic-AI emphasizes deduction.

However, recently I noticed that quasi-Bayesian reinforcement learning and Turing reinforcement learning have very suggestive parallels to logic-AI. TRL agents have beliefs about computations they can run on the envelope: these are essentially beliefs about mathematical facts (but, we only consider computable facts and computational complexity plays some role there). QBRL agents reason in terms of hypotheses that have logical relationships between them: the order on functions corresponds to implication, taking the minimum of two functions corresponds to logical “and”, taking the concave hull of two functions corresponds to logical “or”. (but, there is no “not”, so maybe it’s a sort of intuitionist logic?) In fact, fuzzy beliefs form a continuous dcpo, and considering some reasonable classes of hypotheses probably leads to algebraic dcpo-s, suggesting a strong connection with domain theory (also, it seems like considering beliefs within different ontologies leads to a functor from some geometric category (the category of ontologies) to dcpo-s).

These parallels suggest that the learning theory of QBRL/TRL will involve some form of deductive reasoning and some type of logic. But, this doesn’t mean that QBRL/TRL is redundant w.r.t. logic AI! In fact, QBRL/TRL might lead us to discover exactly which type of logic do intelligent agents need and what is the role logic should play in the theory and inside the algorithms (instead of trying to guess and impose the answer ad hoc, which IMO did not work very well so far). Moreover, I think that the type of logic we are going to get will be something finitist/constructivist, and in particular this is probably how Goedelian paradoxes will be avoid. However, the details remain to be seen.

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving “Omega” (something that predicts the agent’s decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent “Murphy” as in Murphy’s law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maximin value in pure strategies. (The stochastic version can be regarded as a special case of the deterministic version where the agent has access to an external random number generator that is hidden from the rest of the environment according to the hypothesis.) To every decision problem, we can now correspond an incomplete hypothesis as follows. Every time Omega makes a prediction about the agent’s future action in some counterfactual, we have Murphy make a guess instead. This guess cannot be directly observed by the agent. If the relevant counterfactual is realized, then the agent’s action renders the guess false or true. If the guess is false, the agent receives infinite (or, sufficiently large) reward. If the guess is true, everything proceeds as usual. The maximin value then corresponds to the scenario where the guess is true and the agent behaves as if its action controls the guess. (Which is exactly what FDT and its variants try to achieve.)

For example, consider (repeated) counterfactual mugging. The incomplete hypothesis is a partially observable stochastic game (between the agent and Murphy), with the following states:

s0: initial state. Murphy has two actions: g+ (guess the agent will pay), transitioning to s1+ and g− (guess the agent won’t pay) transitioning to s1−. (Reward = 0)

s1+: Murphy guessed the agent will pay. Transitions to s2a+ or s2b+ with probability 12 to each (the coin flip). (Reward = 0)

s1−: Murphy guessed the agent won’t pay. Transitions to s2a− or s2b− with probability 12 to each (the coin flip). (Reward = 0)

s2a+: Agent receives the prize. Transitions to s3u. (Reward = +1)

s2b+: Agent is asked for payment. Agent has two actions: p+ (pay) transitioning to s3r+ and p− (don’t pay) transitioning to s3w−. (Reward = 0)

s2a−: Agent receives nothing. Transitions to s3u. (Reward = 0)

s2b−: Agent is asked for payment. Agent has two actions: p+ (pay) transitioning to s3w+ and p− (don’t pay) transitioning to s3r−. (Reward = 0)

s3u: Murphy’s guess remained untested. Transitions to s0. (Reward = 0)

s3r+: Murphy’s guess was right, agent paid. Transitions to s0. (Reward = −0.1)

s3r−: Murphy’s guess was right, agent didn’t pay. Transitions to s0. (Reward = 0)

s3w+: Murphy’s guess was wrong, agent paid. Transitions to s0. (Reward = +1.9)

s3w−: Murphy’s guess was wrong, agent didn’t pay. Transitions to s0. (Reward = +2)

The only percepts the agent receives are (i) the reward and (ii) whether it is asked for payment or not. The agent’s maximin policy is paying, since it guarantees an expected reward of 12⋅1+12⋅(−0.1)=0.45 per round.

We can generalize this to an imperfect predictor (a predictor that sometimes makes mistakes), by using the same construction but adding noise to Murphy’s guess for purposes other than the guess’s correctness. Apparently, We can also generalize to the variant where the agent can randomize against Omega and Omega decides based on its predictions of the probabilities. This, however, is more complicated. In this variant there is no binary notion of “right” and “wrong” guess. Instead, we need to apply some statistical test to the guesses and compare it against a threshold. We can then consider a family of hypotheses with different thresholds, such that (i) with probability 1, for all but some finite number of thresholds, accurate guesses would never be judged wrong by the test (ii) with probability 1, consistently inaccurate guesses will be judged wrong by the test, with any threshold.

The same construction applies to logical counterfactual mugging, because the agent cannot distinguish between random and pseudorandom (by definition of pseudorandom). In TRL there would also be some family of programs the agent could execute s.t., according the hypothesis, their outputs are determined by the same “coin flips” as the offer to pay. However, this doesn’t change the optimal strategy: the “logical time of precommitment” is determined by the computing power of the “core” RL agent, without the computer “envelope”.

My takeaway from this is that if we’re doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.

How would you handle Agent Simulates Predictor? Is that what TRL is for?

That’s about right. The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined (that’s why people have been struggling with TDT/UDT/FDT for so long) while the thing I’m doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally (“rationally” according to the FDT philosophy).

TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of “thinking time” and “interacting with environment time” (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it’s better to be predictable or predicting.

“The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined”—What do you mean here?

I’m curious whether you’re referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait’s Hitchhiker to end up in town if the predictor is perfect, so that it wouldn’t actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.

It is not a mere “concern”, it’s the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and “objective” description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it’s supposed to work is, the agent needs to locate all copies of itself or things “logically correlated” with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.

Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).

Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.

But in Newcomb’s problem, the agent’s reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren’t you open to the charge of formalizing the wrong problem?

The point is, if you put this “quasi-Bayesian” agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you’re judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.

Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent’s point of view, it would reach Nirvana if it dodged the predictor. From Omega’s point of view, if Omega two-boxed and the agent one-boxed, the agent’s reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual “Omega makes an error of prediction” is ill-defined, it’s conditioning on an event of probability 0.

Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb’s problem, but instead of saying “Omega is almost always right” we add another person Bob (maybe the mad scientist who built Omega) who’s willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you’ll randomize seems a bit dodgy as well.

Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form

limγ→1Eμπγ[U(γ)]≥f(μ)

Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.

What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is

limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)

But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).

What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is

limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]

It still has the same form!

Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:

limγ→1Eμπγ[U(γ)]≥f(π,μ)

To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.

In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:

limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]

It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.

The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.

Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)

If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function X that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.

EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega.

EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses 1 dollar. Every time the agent one-boxes, the gambler wins 1p−1 dollars, where p is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the γ→1 limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if “idealized Omega” is wrong.

Epistemic status: moderately confident, based on indirect evidence

I realized that it is very hard to impossible to publish an academic work that takes more than one conceptual inferential step away from the current paradigm. Especially when the inferential steps happen in different fields of knowledge.

You cannot publish a paper where you use computational learning theory to solve metaphysics, and then use the new metaphysics to solve the interpretation of quantum mechanics. A physics publication will not understand the first part, or even understand how it can be relevant. As a result, they will also fail to understand the second part. A computer science publication will not understand or be interested in the second part.

Publishing the two parts separately one after the other also won’t work. The first part might be accepted, but the reviewers of the second part won’t be familiar with it, and the same problems will resurface. The only way to win seems to be: publish the first part, wait until it becomes widely accepted, and only then publish the second part.

Hmm. I think I need more detail on your model of publishing and wide-acceptance and their relationship to truth. It seems likely that unless they’re circularly dependent, you can publish the smaller-departure in parallel with exploring the further implications in different journals, and in research agendas rather than results publication.

I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as “population ethics” and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:

Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let it(A,B)>0 be the weight of the interaction person A had with person B at time t (if there was no interaction at this time then it(A,B)=0). Then, we can define Alice’s affinity to Bob as

afft(Alice,Bob):=t∑s=−∞αt−sis(Alice,Bob)

Here α∈(0,1) is some constant. Ofc αt−s can be replaced by many other functions.

Now, we can the define the social distance of Alice to Bob as

Here β>0 is some constant, and the power law was chosen rather arbitrarily, there are many functions of aff that can work. Dead people should probably count in the infimum, but their influence wanes over time since they don’t interact with anyone (unless we count consciously thinking about a person as an interaction, which we might).

This is a time-dependent metric (or quasimetric, if we allow for asymmetric interactions such as thinking about someone or admiring someone from afar) on the set of people. If i is bounded and there is a bounded number of people Alice can interact with at any given time, then there is some C>1 s.t. the number of people within distance r from Alice is O(Cr). We now define the reward as

rt(Alice):=∑pλdt(Alice,p)wt(p)

Here λ∈(0,1C) is some constant and wt(p) is the “welfare” of person p at time t, or whatever is the source of value of people for Alice. Finally, the utility function is a time discounted sum of rewards, probably not geometric (because hyperbolic discounting is a thing). It is also appealing to make the decision rule to be minimax-regret over all sufficiently long time discount parameters, but this is tangential.

Notice how the utility function is automatically finite and bounded, and none of the weird paradoxes of population ethics and infinitary ethics crop up, even if there is an infinite number of people in the universe. I like to visualize people space a tiling of hyperbolic space, with Alice standing in the center of a Poincare or Beltrami-Klein model of it. Alice’s “measure of caring” is then proportional to volume in the model (this probably doesn’t correspond to exactly the same formula but it’s qualitatively right, and the formula is only qualitative anyway).

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam’s razor (which is the only reasonable way to reason). Of course with bounded algorithms the convergence will be faster, something like the inverse bounded-busy-beaver, but still very slow. Any learning algorithm with inductive bias towards simplicity will have generalization failures when coming across the faultlines that carve reality at the joints, at every new level of the domain hierarchy.

This has an important consequence for alignment: in order to stand a chance, any alignment protocol must be fully online, meaning that whatever data sources it uses, those data sources must always stay in the loop, so that the algorithm can query the data source whenever it encounters a faultline. Theoretically, the data source can be disconnected from the loop at the point when it’s fully “uploaded”: the algorithm unambiguously converged towards a detailed accurate model of the data source. But in practice the convergence there will be very slow, and it’s very hard to know that it already occurred: maybe the model seems good for now but will fail at the next faultline. Moreover, convergence might literally never occur if the machine just doesn’t have the computational resources to contain such an upload (which doesn’t mean it doesn’t have the computational resources to be transformative!)^{[1]}

This is also a reason for pessimism regarding AI outcomes. AI scientists working through trial and error will see the generalization failures becoming more and more rare, with longer and longer stretches of stable function in between. This creates the appearance of increasing robustness. But, in reality robustness increases very slowly. We might reach a stable stretch between “subhuman” and “far superhuman” and the next faultline will be the end.

In the Solomonoff analogy, we can imagine the real data source as a short but prohibitively expensive program, and the learned model of the data source as an affordable but infinitely long program: as time progresses, more and more bits of this program will be learned, but there will always be bits that are still unknown. Of course, any prohibitively expensive program can be made affordable by running it much slower than real-time, which is something that Turing RL can exploit, but at some point this becomes impractical. ↩︎

An alignment-unrelated question: Can we, humans, increase the probability that something weird happens in our spacetime region (e.g., the usual laws of physics stop working) by making it possible to compress our spacetime location? E.g., by building a structure that is very regular (meaning that its description can be very short) and has never been built before in our space region, something like make a huge perfectly aligned rectangular grid of hydrogen atoms, or something like that.

It’s like a magical ritual for changing the laws of physics. This gives a new meaning to summoning circles, pentagrams, etc.

We can rephrase your question as follows: “Can we increase the probability of finding an error in the known laws of physics by performing an experiment with a simple property that never happened before, either naturally or artificially”? And the answer is: yes! This is actually what experimental physicists do all the time: perform experiments that try to probe novel circumstances where it is plausible (Occam-razor-wise) that new physics will be discovered.

As to magical rituals, sufficiently advanced technology is indistinguishable from magic :)

I have a sense that similar principles are at play with Spaced Repetition, and that pointing out that connection may be relevant to effectively handling this issue

convergence might literally never occur if the machine just doesn’t have the computational resources to contain such an upload

I think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it’s going to see in the target sequence. Now consider the program that says: “if the last number was predicted by the agent to be 0 with probability larger than 1−2−1010 then the next number is 1; otherwise it is 0.” Since it takes much less than 1010 bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than 1−2−1010 (after observing only 0s so far).

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can’t understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the γ→1 limit. The only obvious natural requirement for RL agents in this case is approximating Bayes-optimality. However, Bayes-optimality is not even “weakly feasible”: it is NP-hard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?

Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it’s possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NP-hard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.

In particular, consider the following setting, that I call “expanding safety envelope” (XSE). Assume that each hypothesis in the prior ζ is “decorated” by a set F of state-action pairs s.t. (i) any (s,a)∈F is safe, i.e. the leading term of Q(s,a,γ) in the γ→1 expansion is maximal (ii) for each s∈S, there is (s,a)∈F s.t.a is Blackwell-optimal for s (as a special case we can let F contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in F. If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in F. We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it’s possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.

The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSE-learnable? (ii) given an arbitrary decorated prior, is it possible to find the maximal-probability-mass set of hypotheses, which is XSE-learnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sum-of-squares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.

This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worst-case mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are “quasi-traps”: not actual traps, but trap-like on short timescales. This suggests that applying an approach like XSE to quasi-traps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.

Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to “not wander too far” in state space, in the simplest case because the entire state space is small. But, instead of “wandering far” from a fixed place in state space, we can constrain “wandering far” w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasi-irreversible departures from this trajectory.

In multi-armed bandits and RL theory, there is a principle known as “optimism in the face of uncertainty”. This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: “optimism in the face of cheap-to-resolve uncertainty, pessimism in the face of expensive-to-resolve uncertainty”. Following this principle doesn’t lead to actual Bayes-optimality, but perhaps it is in some sense a good enough approximation.

In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.

[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]

The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?

Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?

0th step: all robots remember s0

1st step: all robots remember s0s1

2nd step: ^{1}⁄_{2} of robots remember s0s1s0 and ^{1}⁄_{2} of robots remember s0s1s1

3rd step: ^{1}⁄_{3} of robots remembers s0s1s0s1, ^{1}⁄_{3} of robots remember s0s1s1s0 and ^{1}⁄_{3} of robots remember s0s1s1s1

There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.

Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].

(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)

So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.

Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:

## Vanessa Kosoy’s Shortform

Text whose primary goal is conveying information (as opposed to emotion, experience or aesthetics) should be skimming friendly. Time is expensive, words are cheap. Skimming is a vital mode of engaging with text, either to evaluate whether it deserves a deeper read or to extract just the information you need. As a reader, you should nurture your skimming skills. As a writer, you should treat skimmers as a legitimate and important part of your target audience. Among other things it means:

Good title and TLDR/abstract

Clear and useful division into sections

Putting the high-level picture and conclusions first, the technicalities and detailed arguments later. Never leave the reader clueless about where you’re going with something for a long time.

Visually emphasize the central points and make them as self-contained as possible. For example, in the statement of mathematical theorems avoid terminology whose definition is hidden somewhere in the bulk of the text.

Stronger: as a writer you should assume your modal reader is a skimmer, both because they are, and because even non skimmers are only going to remember about the same number of things that the good skimmer does.

An AI progress scenario which seems possible and which I haven’t seen discussed: an imitation plateau.

The key observation is,

imitation learning algorithms. That’s because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes^{[1]}might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans haverealizablefrom the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are nottoocomplex.This opens the possibility that close-to-human-level AI will arrive while we’re still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevertheless, some superhuman performance might be produced by sped-up simulation, reducing noise in human behavior and controlling the initial conditions (e.g. simulating a human on a good day). As a result, we will have some period of time during which AGI is already here, automation is in full swing, but there’s little or no further escalation. At the end of this period, the missing ingredients will be assembled (maybe with the help of AI researchers) and superhuman AI (possibly a fast takeoff) begins.

It’s interesting to try and work out the consequences of such a scenario, and the implications on AI strategy.

Such as GPT-n ↩︎

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

That

issimilar to gaining uploads (borrowing terminology from Egan, we can call them “sideloads”), but it’s not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you’re distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that’s not better than plain speedup.To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it’s necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).

Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it’s not feasible to build a dataset with those faculties represented explicitly that’s good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.

I don’t see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior “stream of thought”. In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn’t pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it’s not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans).

For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a

causeof low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an “intrinsic” AGI might lead to the classical “foom” scenario.EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job).

So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.

Another thing that might happen is a data bottleneck.

Maybe there will be a good enough dataset to produce a sideload that simulates an “average” person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).

Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That’s because, unless you’re training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your “episodes”. Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.

In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

Another way to describe the same (or similar) plateau: we could think about GPT-n as GLUT with approximation between prerecorded answers: it can produce intelligent products similar to the ones which were created by humans in the past and are presented in its training dataset – but not above the human intelligence level, as there is no superintelligent examples in the dataset.

I propose a new formal desideratum for alignment: the

Hippocratic principle. Informally the principle says: an AI shouldn’t make things worse compared to letting the user handle them on their own, in expectation w.r.t. theuser’sbeliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user’s policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user’s subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI’s prior over universes and ϵ some sufficiently small bound. We require

∀T:Eυ∼ζ[(Eπυuμυ[Vυ(T)]−Eπaμυ[Vυ(T)])+]≤ϵ

Here, V(T) designates the value after event T happens, and (x)+ is defined to be 0 for x<0 and x otherwise.

I now outline an example of a Hippocratic protocol, which I call

Hippocratic timeline-driven learning(HTDL).Suppose, for simplicity, that the goal is accomplishing some binary task (i.e. we can either succeed or fail). We can imagine the task to be something like “construct a defense system against malign AI”. At any given moment, the user reports their subjective timeline: i.e. for, any t, the probabilities of success and failure after time t, assuming the user will continue working on the task by themselves. We assume the AI can learn a good model of the user’s policy and the user’s nominal report, but this nominal report might deviate from the true timeline because the system can be corrupted (i.e. the user manipulated to enter false information or the input channel hacked). Moreover, we assume that such corruption counts as failing the task.

The AI operates by maximizing the AI-expectation of the user’s (unknown) utility function Eυ∼ζ[Eπμυ[Uυ]],

subject to the Hippocratic principle as a constraint. Since we’re working in a delegative setting, the AI can trivially achieve the constraint by always delegating. However, it can do much better than that, byshort-term quantilizingw.r.t. the user’s reported success probability (with the user’s policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user’s reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation. Then, after completing this short-term maneuver, the AI can doanothershort-term quantilization, and continue in this vein. Essentially, the AI is gradually bending the user’s trajectory towards success.Of course, we can improve the guarantees of such a protocol either by adding more information or by using stronger assumptions (e.g. assume the user is quasi-rational). But timeline-driven short-term quantilization serves as a useful feasibility proof.

I don’t understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don’t see a distribution here, or even a clear ordering over actions (given that we don’t have access to the utility function).

I’m probably missing something obvious, but more details would really help.

The distribution is the user’s policy, and the utility function for this purpose is the

eventual success probabilityestimated by the user (as part of the timeline report), in the end of the “maneuver”. More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.Oh, right, that makes a lot of sense.

So is the general idea that we quantilize such that we’re choosing in expectation an action that doesn’t have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn’t limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI’s probability of corruption is bounded by ϵϕ.

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn’t specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

I

likethis because it’s simple and obviously correct. Also I can see at least one way you could implement it:a. Suppose the AI is ‘shadowing’ a human worker doing a critical task. Say it is ‘shadowing’ a human physician.

b. Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a ‘good’ physician would make, predicted outcome for the ‘good’ decision]. Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].

c. The machine also must have a confidence or this won’t work.

With large numbers and outright errors made by the physician, it’s then possible to detect all the cases where the [decision the physician made] has a

substantially worseoutcome than the [predicted decision a ‘good’ physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it’s time to call for a second opinion.In the long run, of course, there will be a point where the [predicted decision a ‘good’ physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident.

(as an example, alpha go zero likely doesn’t benefit from asking another master go player for a ‘second opinion’ when it sees the player it is advising make a bad call)

I have repeatedly argued for a departure from pure Bayesianism that I call “quasi-Bayesianism”. But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here’s another way to understand it, using Bayesianism’s own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can

predictAlice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.A possible counterargument is, we don’t need to depart far from Bayesianism to win here. We only need to somehow justify randomization, perhaps by something like infinitesimal random perturbations of the belief state (like with reflective oracles). But, in a way, this is exactly what quasi-Bayesianism does: a quasi-Bayes-optimal policy is in particular Bayes-optimal when the prior is taken to be in Nash equilibrium of the associated zero-sum game. However, Bayes-optimality underspecifies the policy: not every optimal reply to a Nash equilibrium is a Nash equilibrium.

This argument is not entirely novel: it is just a special case of an environment that the agent cannot simulate, which is the original motivation for quasi-Bayesianism. In some sense, any Bayesian agent is dogmatic: it dogmatically beliefs that the environment is computationally simple, since it cannot consider a hypothesis which is not. Here, Omega exploits this false dogmatic belief.

Bayeseans are allowed to understand that there are agents with better estimates than they have. And that being offered a bet _IS_ evidence that the other agent THINKS they have an advantage.

Randomization (aka “mixed strategy”) is well-understood as the rational move in games where opponents are predicting your choices. I have read nothing that would even hint that it’s unavailable to Bayesean agents. The relevant probability (updated per Bayes’s Rule) would be “is my counterpart trying to minimize my payout based on my choices”.

edit: I realize you may be using a different definition of “bayeseanism” than I am. I’m thinking humans striving for rational choices, which perforce includes the knowledge of incomplete computation and imperfect knowledge. Naive agents can be imagined that don’t have this complexity. Those guys are stuck, and Omega’s gonna pwn them.

It feels like there’s better words for this like rationality, whereas bayesianism is a more specific philosophy about how best to represent and update beliefs.

And here I thought the reason was going to be that Bayesianism doesn’t appear to include the cost of computation. (Thus, the usual dutch book arguments should be adjusted so that “optimal betting” does not leave one worse off for having payed, say, an oracle, too much for computation.)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via

modifying the gamerather than abandoning the notion of Nash equilibrium).The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a

repeatedversion. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires consideringshallowtime discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models, but doesn’t seem very interesting: to receive the maximin payoff, the agents only have to learn therules of the game, they need not learn the reward functions of the other players or anything else about them.We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a

populationof agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let Ai be the pure strategy set of the i-th agent and O:=∏iAi the set of pure outcomes. The set of n-round outcome histories is On. The population of agents on the n-round can then be described as aprobability measureμn∈ΔOn. Suppose the policy of the i-th player (that is, of all the agents that take the role of the i-th player) is πi:On→ΔAi. Then we can define a time evolution rule that produces μn+1 from μn. This rule works as follows: in order to sample μn+1 we sample μn onceper player(this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn’t matter which). A set of policies is considered to be in equilibrium, when for any i, and any alternative policy π′i, letting π′i playagainst the same population(i.e. all other copies of the i-th player still play πi) doesn’t improve expected utility. In other words, on each round the “mutant” agent retains its own history but the other player histories are still sampled from the same μn. It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the πi to be learning algorithms would result in convergence to a Nash payoff in the γ→1 (shallow time discount) limit.For example, consider the Prisoner’s dilemma. In the repeated Prisoner’s dilemma with shallow time discount, CC is an equilibrium because of the tit-for-tat policy. On the other hand, in the “population” (massively multi-player?) repeated Prisoner’s dilemma, DD is the only equilibrium. Tit-for-tat doesn’t work because a single “defect bot” can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn’t know the defect bot defected on the previous round.

Note that we get a very different setting if we allow the players to see each other’s histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner’s Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other’s source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.

We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see

a fixed size finite portionof the their opponents’ histories. This should lead to superrationality for the same reasons I discussed before. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton,

mightbe equivalent to some limited access to source code, but I don’t know to define this limitation.EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.

Instead of postulating access to a portion of the history or some kind of limited access to the opponent’s source code, we can consider agents with

fullaccess to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like “counting pieces of evidence” should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.What do you mean by equivalent? The entire history doesn’t say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.

The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can’t get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called “superrationality”.

In the previous “population game” setting, we assumed all players are “born” at the same time and learn synchronously, so that they always play against players of the same “age” (history length). Instead, we can consider a “mortal population game” setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the “thermodynamic” N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player’s utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.

As before, each player type i chooses a policy . (We can also consider the case where players of the same type may have different policies, but let’s keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.

If we consider only the age distribution, then its evolution doesn’t depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P.

Does the dynamics have fixed points?O∗ can be regarded as a subspace of (O⊔{⊥})ω. The later is compact (in the product topology) by Tychonoff’s theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also compact but ΔO∗ isn’t. However, it is easy to see that P

isclosed in Δ(O⊔{⊥})ω and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of (O⊔{⊥})ω). Moreover, it is easy to see Tπ is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that Tπ has a fixed point.What are the fixed points like? Of course it depends on π. In a fixed point, every player observes a sequence of

IIDplays in all of eir games. Therefore, if π satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the γ→1 limit, then, in the same limit,fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as “assume the plays in the next game will be sampled from a random past game”, and it works for any Bayesian or “quasi-Bayesian” (i.e. using incomplete/fuzzy models) agent that includes all IID processes in its prior.This raises a range of interesting questions:

Are any/all of the fixed points attractors?

Does convergence to a fixed point occur for all or at least almost all initial conditions?

Do all Nash equilibria correspond to fixed points?

Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn’t have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn’t depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence

isguaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.I propose to call

metacosmologythe hypothetical field of study which would be concerned with the following questions:Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.

Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.

Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.

Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.

The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.

In principle, we might be able to use knowledge of metacosmology in order to engineer an “atheist prior” for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows:

What kind of agent, and in what conditions, can effectively plan for events after its own death?For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, “death” can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.First, in order to meaningfully plan for death, the agent’s reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don’t give the right object, since the reward is still tied to the agent’s actions and observations. Therefore, we will consider reward functions defined in terms of some

fixed ontology of the external world. Formally, such an ontology can be an incomplete^{[2]}Markov chain, the reward function being a function of the state. Examples:The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.

The Markov chain is a representation of people and social interactions. The reward correspond to concepts like “happiness” or “friendship” et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

The Markov chain is Botworld with some of the rules left unspecified. The reward is the total number of a particular type of item.

Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don’t want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a

setof agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on thehistoryand not only the current state). I will leave out the details for now.Finally, we need to define the prior. To do this, we start by choosing some prior over

refinementsof the ontology. By “refinement”, I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which aresubsetsof the “ontological belief”. For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a “objective” prior and a Cartesian boundary, we can construct a “subjective” prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to “Nirvana” state. Alternatively, we can allow for stochastic action selection and use the gambler construction.Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By “indirectly observable” I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won’t have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent “unspecial”. For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of Ω(1). On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.

Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of

computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of asimulationof Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of “embeddedness”. It seems like this can be naturally considered a special case of the previous construction, if we allow the “ontological belief” to include beliefs pertaining to computations.Unless it’s some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope. ↩︎

“Incomplete” in the sense of Knightian uncertainty, like in quasi-Bayesian RL. ↩︎

This idea was inspired by a correspondence with Adam Shimi.It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of

somethinggoes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).To salvage this idea, we need to take not only description complexity but also

computationalcomplexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:The description complexity, which is the length of P.

The computation time complexity, which is the size of R.

The computation space complexity, which is the maximum between the depth of R and log|S|.

The precomputation time complexity, which is the time it takes P to run.

The precomputation space complexity, which is the space P needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not

anypolicy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)

Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be

approximatedby a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]

With

thisdefinition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirementdoeshold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be

approximatedby a computable policy), we say that π is “perfectly (unbounded) goal-directed”.Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.

I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for

anyprice vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.

Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).

Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).

What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

What is the computational complexity of producing a policy with given g?

Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

I am referring to this and related work by Armstrong.

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best

approximationof the real environment. (Or, the best reward achievable by some space of policies.)In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a “quasi-realizable” setting: there might be no perfect description of the environment in the hypothesis class, but there are some

incompletedescriptions. But, so far I haven’t studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.Consider a multi-armed bandit, where the arm space is [0,1]. First, consider the follow realizable setting: the reward is a deterministic function r:[0,1]→[0,1] which is known to be a polynomial of degree d at most. In this setting, learning is fairly easy: it is enough to sample d+1 arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is

lower-boundedby a particular polynomial f:[0,1]→[0,1] of degree d at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven’t derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low d). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky’s Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasi-Bayesianism. This leads me to ask, what are the implications of quasi-Bayesianism on

humanrationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasi-Bayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?Probably not too original but I haven’t seen it clearly written anywhere.There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time:The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for “malign AI is attacking”, but that still leaves us open to surprise takeovers in which there is no chance to press the button.Amplifying by subjective time:The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a “groundhog day” scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times thepresentrate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some averagefuturerate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.Amplifying by probability:We allow the user(s) to choose “success” or “failure” (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.Amplifying by parallelization:The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized.~~But, it is also the safest since the rate of malign AI takeover is only amplified by O(1) compared to the background.~~[EDIT:Actually, it’s not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don’t know how to do this.]A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.

More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.

The harder the takeoff the more dangerous this attack vector:During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense systemin the beginning of the cycle^{[1]}. On the other hand, the capability of the attacker depends on its powerin the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.Inner control of anchor makes system safer:Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.Additional information about the external world makes system safer:Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

I think this would make a good top-level post. I have the feeling I’ll want to link to it later.

I retracted part of that, see the edit.

In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di—es—can-ic-ul-ar—es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a “subjective regret bound”: that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility

with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user’s values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote F.F allows making formal statements about a formal model M of the world, as seen from the AI’s perspective.M includes such elements as observations, actions, rewards and corruption. That is, M reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional “perceptible” reward function, an instrumental reward function, a semi-instrumental reward functions, dynamically-inconsistent rewards, rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, M also reflects the question-answer interface and the user’s behavior.

A single question can consist, for example, of asking for the probability of some sentence in F or the expected value of some expression of numerical type in F. However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question qF together with a

natural language (N) annotationqN. This annotation has to explain the question in human understandable terms, andalsoconvince the user that qN is indeed an accurate natural language rendering of qF. The user’s feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between N and F, i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in F and explained in N that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn’t admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is:

it requires much less optimization power to produce some useful question than to produce a malicious question.Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach, i.e. having the agent to produce additional natural language text that attempts torefutethe validity of the annotation.Of course, in addition to the question interface, the

physicalinterface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like “do nothing” or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user’s knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a

committedanswer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee “subgame perfection”: i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.In order to improve the user’s ability to produce answers and stay informed about the AI’s strategy, we introduce another interface through which the

usercan ask questions from the AI. In this direction, the user asks the question in N and the AI both translates it to F and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formalproofin which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a “censored” answer while pointing out the presence of a potential infohazard.Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user’s knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning setting, by allowing the user’s knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.

I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording.

A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.

Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.

It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.

Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.

This design is made of so many parts! It might benefit from a proof that it is the universal answer to a formal question, which I expect to seem less overly complex.

I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are

extremelysimple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.

The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.

The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.

The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.

The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.

Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the

environmentis not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.In the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

Learning theory starts from formulating natural

desideratafor agents, whereas “logic-AI” usually starts from postulating a logic-based model of the agent ad hoc.Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.

Learning theory focuses on objects that are observable or finite/constructive, whereas logic-AI often considers objects that unobservable, infinite and unconstructive (which I consider to be a philosophical error).

Learning theory emphasizes

inductionwhereas logic-AI emphasizesdeduction.However, recently I noticed that quasi-Bayesian reinforcement learning and Turing reinforcement learning have very suggestive parallels to logic-AI. TRL agents have beliefs about computations they can run on the envelope: these are essentially beliefs about mathematical facts (but, we only consider computable facts and computational complexity plays some role there). QBRL agents reason in terms of hypotheses that have logical relationships between them: the order on functions corresponds to implication, taking the minimum of two functions corresponds to logical “and”, taking the concave hull of two functions corresponds to logical “or”. (but, there is no “not”, so maybe it’s a sort of intuitionist logic?) In fact, fuzzy beliefs form a continuous dcpo, and considering some reasonable classes of hypotheses probably leads to algebraic dcpo-s, suggesting a strong connection with domain theory (also, it seems like considering beliefs within different ontologies leads to a functor from some geometric category (the category of ontologies) to dcpo-s).

These parallels suggest that the learning theory of QBRL/TRL will involve some form of deductive reasoning and some type of logic. But, this doesn’t mean that QBRL/TRL is redundant w.r.t. logic AI! In fact, QBRL/TRL might lead us to discover exactly

whichtype of logic do intelligent agents need and what is therolelogic should play in the theory and inside the algorithms (instead of trying to guess and impose the answer ad hoc, which IMO did not work very well so far). Moreover, I think that the type of logic we are going to get will be something finitist/constructivist, and in particular this is probably how Goedelian paradoxes will be avoid. However, the details remain to be seen.I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving “Omega” (something that predicts the agent’s decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent “Murphy” as in Murphy’s law). If we assume that the agent cannot randomize against Omega, we need to use the

deterministicversion of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maximin value inpurestrategies. (The stochastic version can be regarded as a special case of the deterministic version where the agent has access to an external random number generator that is hidden from the rest of the environment according to the hypothesis.) To every decision problem, we can now correspond an incomplete hypothesis as follows. Every time Omega makes a prediction about the agent’s future action in some counterfactual, we have Murphy make aguessinstead. This guess cannot be directly observed by the agent. If the relevant counterfactual is realized, then the agent’s action renders the guess false or true. If the guess is false, the agent receives infinite (or, sufficiently large) reward. If the guess is true, everything proceeds as usual. The maximin value then corresponds to the scenario where the guess is true and the agent behaves as if its action controls the guess. (Which is exactly what FDT and its variants try to achieve.)For example, consider (repeated) counterfactual mugging. The incomplete hypothesis is a partially observable stochastic game (between the agent and Murphy), with the following states:

s0: initial state. Murphy has two actions: g+ (guess the agent will pay), transitioning to s1+ and g− (guess the agent won’t pay) transitioning to s1−. (Reward = 0)

s1+: Murphy guessed the agent will pay. Transitions to s2a+ or s2b+ with probability 12 to each (the coin flip). (Reward = 0)

s1−: Murphy guessed the agent won’t pay. Transitions to s2a− or s2b− with probability 12 to each (the coin flip). (Reward = 0)

s2a+: Agent receives the prize. Transitions to s3u. (Reward = +1)

s2b+: Agent is asked for payment. Agent has two actions: p+ (pay) transitioning to s3r+ and p− (don’t pay) transitioning to s3w−. (Reward = 0)

s2a−: Agent receives nothing. Transitions to s3u. (Reward = 0)

s2b−: Agent is asked for payment. Agent has two actions: p+ (pay) transitioning to s3w+ and p− (don’t pay) transitioning to s3r−. (Reward = 0)

s3u: Murphy’s guess remained untested. Transitions to s0. (Reward = 0)

s3r+: Murphy’s guess was right, agent paid. Transitions to s0. (Reward = −0.1)

s3r−: Murphy’s guess was right, agent didn’t pay. Transitions to s0. (Reward = 0)

s3w+: Murphy’s guess was wrong, agent paid. Transitions to s0. (Reward = +1.9)

s3w−: Murphy’s guess was wrong, agent didn’t pay. Transitions to s0. (Reward = +2)

The only percepts the agent receives are (i) the reward and (ii) whether it is asked for payment or not. The agent’s maximin policy is paying, since it guarantees an expected reward of 12⋅1+12⋅(−0.1)=0.45 per round.

We can generalize this to an imperfect predictor (a predictor that sometimes makes mistakes), by using the same construction but adding noise to Murphy’s guess for purposes

otherthan the guess’s correctness. Apparently, We can also generalize to the variant where the agentcanrandomize against Omega and Omega decides based on its predictions of theprobabilities. This, however, is more complicated. In this variant there is no binary notion of “right” and “wrong” guess. Instead, we need to apply some statistical test to the guesses and compare it against a threshold. We can then consider afamilyof hypotheses with different thresholds, such that (i) with probability 1, for all but some finite number of thresholds, accurate guesses would never be judged wrong by the test (ii) with probability 1,consistentlyinaccurate guesseswillbe judged wrong by the test, with any threshold.The same construction applies to

logicalcounterfactual mugging, because the agent cannot distinguish between random and pseudorandom (by definition of pseudorandom). In TRL there would also be some family of programs the agent could execute s.t., according the hypothesis, their outputs are determined by the same “coin flips” as the offer to pay. However, this doesn’t change the optimal strategy: the “logical time of precommitment” is determined by the computing power of the “core” RL agent, without the computer “envelope”.My takeaway from this is that if we’re doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.

How would you handle Agent Simulates Predictor? Is that what TRL is for?

That’s about right. The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined (that’s why people have been struggling with TDT/UDT/FDT for so long) while the thing I’m doing is perfectly well-defined. I describe agents that are able to

learnwhich predictors exist in their environment and respond rationally (“rationally” according to the FDT philosophy).TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of “thinking time” and “interacting with environment time” (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it’s better to be predictable or predicting.

“The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined”—What do you mean here?

I’m curious whether you’re referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait’s Hitchhiker to end up in town if the predictor is perfect, so that it wouldn’t actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.

It is not a mere “concern”, it’s the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and “objective” description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it’s supposed to work is, the agent needs to locate all copies of

itselfor things “logically correlated” with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a

subjectivedescription whichalready includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.

But in Newcomb’s problem, the agent’s reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren’t you open to the charge of formalizing the wrong problem?

The point is, if you put this “quasi-Bayesian” agent into an iterated Newcomb-like problem, it will

learnto get the maximal reward (i.e. the reward associated with FDT). So, if you’re judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent’s point of view, it would reach Nirvana if it dodged the predictor. From Omega’s point of view, if Omega two-boxed and the agent one-boxed, the agent’s reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual “Omega makes an error of prediction” is ill-defined, it’s conditioning on an event of probability 0.

Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb’s problem, but instead of saying “Omega is almost always right” we add another person Bob (maybe the mad scientist who built Omega) who’s willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you’ll randomize seems a bit dodgy as well.

Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to

learna particular class of fuzzy hypotheses is a very general way to formulateasymptotic performance desideratafor RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the formlimγ→1Eμπγ[U(γ)]≥f(μ)

Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the

concave hullof f.What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is

limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)

But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).

What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is

limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]

It still has the same form!

Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:

limγ→1Eμπγ[U(γ)]≥f(π,μ)

To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.

In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:

limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]

It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.

The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct

starting point, after which we need to start proving results aboutwhichfuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the naturallanguagein which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior isuniversalin some appropriate sense.Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)

If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function X that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.

EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega.

EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses 1 dollar. Every time the agent one-boxes, the gambler wins 1p−1 dollars, where p is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the γ→1 limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if “idealized Omega” is wrong.

Epistemic status: moderately confident, based on indirect evidenceI realized that it is very hard to impossible to publish an academic work that takes more than one conceptual inferential step away from the current paradigm. Especially when the inferential steps happen in different fields of knowledge.

You cannot publish a paper where you use computational learning theory to solve metaphysics, and then use the new metaphysics to solve the interpretation of quantum mechanics. A physics publication will not understand the first part, or even understand how it can be relevant. As a result, they will also fail to understand the second part. A computer science publication will not understand or be interested in the second part.

Publishing the two parts separately one after the other also won’t work. The first part might be accepted, but the reviewers of the second part won’t be familiar with it, and the same problems will resurface. The only way to win seems to be: publish the first part,

wait until it becomes widely accepted, and onlythenpublish the second part.Hmm. I think I need more detail on your model of publishing and wide-acceptance and their relationship to truth. It seems likely that unless they’re circularly dependent, you can publish the smaller-departure in parallel with exploring the further implications in different journals, and in research agendas rather than results publication.

So there’s journals of X, Y, and Z, but not XYZ?

(In hindsight this sounds obvious, though the only obvious alternatives would be

it’s hard, but the hardness is in figuring out which place can handle the combination/complexity

Publishing anything is hard (or there’s a limit to the time/space allocated per month, and timing matters)

I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as “population ethics” and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:

Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let it(A,B)>0 be the weight of the interaction person A had with person B at time t (if there was no interaction at this time then it(A,B)=0). Then, we can define Alice’s

affinityto Bob asafft(Alice,Bob):=t∑s=−∞αt−sis(Alice,Bob)

Here α∈(0,1) is some constant. Ofc αt−s can be replaced by many other functions.

Now, we can the define the

social distanceof Alice to Bob asdt(Alice,Bob):=infp1…pn:p1=Alice,pn=Bobn−1∑k=1afft(pk,pk+1)−β

Here β>0 is some constant, and the power law was chosen rather arbitrarily, there are many functions of aff that can work. Dead people should probably count in the infimum, but their influence wanes over time since they don’t interact with anyone (unless we count consciously thinking about a person as an interaction, which we might).

This is a time-dependent metric (or quasimetric, if we allow for asymmetric interactions such as thinking about someone or admiring someone from afar) on the set of people. If i is bounded and there is a bounded number of people Alice can interact with at any given time, then there is some C>1 s.t. the number of people within distance r from Alice is O(Cr). We now define the reward as

rt(Alice):=∑pλdt(Alice,p)wt(p)

Here λ∈(0,1C) is some constant and wt(p) is the “welfare” of person p at time t, or whatever is the source of value of people for Alice. Finally, the utility function is a time discounted sum of rewards, probably not geometric (because hyperbolic discounting is a thing). It is also appealing to make the decision rule to be minimax-regret over all sufficiently long time discount parameters, but this is tangential.

Notice how the utility function is automatically finite and bounded, and none of the weird paradoxes of population ethics and infinitary ethics crop up, even if there is an infinite number of people in the universe. I like to visualize people space a tiling of hyperbolic space, with Alice standing in the center of a Poincare or Beltrami-Klein model of it. Alice’s “measure of caring” is then proportional to volume in the

model(this probably doesn’t correspond to exactly the same formula but it’s qualitatively right, and the formula is only qualitative anyway).Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam’s razor (which is the only reasonable way to reason). Of course with bounded algorithms the convergence will be faster, something like the inverse bounded-busy-beaver, but still very slow. Any learning algorithm with inductive bias towards simplicity will have generalization failures when coming across the faultlines that carve reality at the joints, at every new level of the domain hierarchy.

This has an important consequence for alignment: in order to stand a chance, any alignment protocol must be

fully online, meaning that whatever data sources it uses, those data sources must always stay in the loop, so that the algorithm can query the data source whenever it encounters a faultline. Theoretically, the data source can be disconnected from the loop at the point when it’s fully “uploaded”: the algorithm unambiguously converged towards a detailed accurate model of the data source. But in practice the convergence there will be very slow, and it’s very hard to know that it already occurred: maybe the model seems good for now but will fail at the next faultline. Moreover, convergence might literallyneveroccur if the machine just doesn’t have the computational resources to contain such an upload (which doesn’t mean it doesn’t have the computational resources to be transformative!)^{[1]}This is also a reason for pessimism regarding AI outcomes. AI scientists working through trial and error will see the generalization failures becoming more and more rare, with longer and longer stretches of stable function in between. This creates the appearance of increasing robustness. But, in reality robustness increases very slowly. We might reach a stable stretch between “subhuman” and “far superhuman” and the next faultline will be the end.

In the Solomonoff analogy, we can imagine the real data source as a short but prohibitively expensive program, and the learned model of the data source as an affordable but infinitely long program: as time progresses, more and more bits of this program will be learned, but there will always be bits that are still unknown. Of course, any prohibitively expensive program can be made affordable by running it much slower than real-time, which is something that Turing RL can exploit, but at some point this becomes impractical. ↩︎

An alignment-unrelated question: Can we, humans, increase the probability that something weird happens in our spacetime region (e.g., the usual laws of physics stop working) by making it possible to compress our spacetime location? E.g., by building a structure that is very regular (meaning that its description can be very short) and has never been built before in our space region, something like make a huge perfectly aligned rectangular grid of hydrogen atoms, or something like that.

It’s like a magical ritual for changing the laws of physics. This gives a new meaning to summoning circles, pentagrams, etc.

We can rephrase your question as follows: “Can we increase the probability of finding an error in the known laws of physics by performing an experiment with a simple property that never happened before, either naturally or artificially”? And the answer is: yes! This is actually what experimental physicists do all the time: perform experiments that try to probe novel circumstances where it is plausible (Occam-razor-wise) that new physics will be discovered.

As to magical rituals, sufficiently advanced technology is indistinguishable from magic :)

I have a sense that similar principles are at play with Spaced Repetition, and that pointing out that connection may be relevant to effectively handling this issue

I think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it’s going to see in the target sequence. Now consider the program that says: “if the last number was predicted by the agent to be 0 with probability larger than 1−2−1010 then the next number is 1; otherwise it is 0.” Since it takes much less than 1010 bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than 1−2−1010 (after observing only 0s so far).

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can’t understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be

unlearnable, meaning that no algorithm has Bayesian regret going to zero in the γ→1 limit. The only obvious natural requirement for RL agents in this case is approximating Bayes-optimality. However, Bayes-optimality is not even “weakly feasible”: it is NP-hard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?Although a generic prior with traps is unlearnable, some priors with traps

arelearnable. Indeed, it can happen that it’s possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NP-hard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors thatareefficiently decidable.In particular, consider the following setting, that I call “expanding safety envelope” (XSE). Assume that each hypothesis in the prior ζ is “decorated” by a set F of state-action pairs s.t. (i) any (s,a)∈F is safe, i.e. the leading term of Q(s,a,γ) in the γ→1 expansion is maximal (ii) for each s∈S, there is (s,a)∈F s.t.a is Blackwell-optimal for s (as a special case we can let F contain all safe actions). Imagine an agent that takes random actions among those

a prioriknown to be in F. If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in F. We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it’s possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSE-learnable? (ii) given an arbitrary decorated prior, is it possible to find the maximal-probability-mass set of hypotheses, which is XSE-learnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sum-of-squares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.

This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with

how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worst-case mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are “quasi-traps”: not actual traps, but trap-like on short timescales. This suggests that applying an approach like XSE to quasi-traps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would implylearning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to “not wander too far” in state space, in the simplest case because the entire state space is small. But, instead of “wandering far” from a fixed place in state space, we can constrain “wandering far” w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasi-irreversible departures from this trajectory.

In multi-armed bandits and RL theory, there is a principle known as “optimism in the face of uncertainty”. This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: “optimism in the face of cheap-to-resolve uncertainty, pessimism in the face of expensive-to-resolve uncertainty”. Following this principle doesn’t lead to actual Bayes-optimality, but perhaps it is in some sense a good enough approximation.

Master post for ideas about infra-Bayesianism.

In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are

mutually inconsistent.[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]

The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?

Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot

andan s1 robot. How will our population look like?0th step: all robots remember s0

1st step: all robots remember s0s1

2nd step:

^{1}⁄_{2}of robots remember s0s1s0 and^{1}⁄_{2}of robots remember s0s1s13rd step:

^{1}⁄_{3}of robots remembers s0s1s0s1,^{1}⁄_{3}of robots remember s0s1s1s0 and^{1}⁄_{3}of robots remember s0s1s1s1There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step

will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].

(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)

So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we

cangive a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:

ST(t∣s)=ET(t∣s)ηt∑t′∈SE