Shortform #101 Conversation experiment: Tabooing $topics after previously discussed for $time

Do you ever find yourself talking at length, sometimes repetitively about the same topic(s)? Do you notice others around you doing similarly? Experiment with tabooing that topic or topics (with consent of your conversation partner(s), of course) for an evening (or whatever your time interval is) and see what happens!

Spur of the moment tonight, I asked my friend after we finished coworking and were hanging out if they wanted to try the conversation topic taboo experiment. They said yes, and that led to us having a few additional really good conversations tonight in addition to the tabooed topic conversations we had had earlier in the night (those were good too, please note that tabooing a topic doesn’t mean a topic is bad necessarily).

I will be proposing this experiment at Virginia Rationalists’ next meetup (tomorrow!) and see how that goes.

Below is a list of powerful optimizers ranked on properties, as part of a brainstorm on whether there’s a simple core of consequentialism that excludes corrigibility. I think that AlphaZero is a moderately strong argument that there is a simple core of consequentialism which includes inner search.

Properties

Simple: takes less than 10 KB of code. If something is already made of agents (markets and the US government) I marked it as N/A.

Coherent: approximately maximizing a utility function most of the time. There are other definitions:

Not being money-pumped

Nate Soares’s notion in the MIRI dialogues: having all your actions point towards a single goal

Adversarially coherent: something like “appears coherent to weaker optimizers” or “robust to perturbations by weaker optimizers”. This implies that it’s incorrigible.

will achieve high utility even when “disrupted” by an optimizer somewhat less powerful

Search+WM: operates by explicitly ranking plans within a world-model. Evolution is a search process, but doesn’t have a world-model. The contact with the territory it gets comes from directly interacting with the world, and this is maybe why it’s so slow

Thing

Simple?

Coherent?

Adv. coherent?

Search+WM?

Humans

N

Y

Sometimes

Y

AIXI-tl

Y

Y

N

Y

Stockfish

N

Y

Y

Y

AlphaZero/OAI5

Y

Y

Y

Y

Markets

N/A

Y

Y

Y

US government

N/A

Y

N

Y

Evolution

Y

N

N

N

Notes:

Humans are not adversarially coherent: prospect theory and other cognitive biases can be exploited, indoctrination, etc.

AIXI-tl is not adversarially coherent because it is an embedded agent and can be switched off etc.

AlphaZero: when playing chess, you can use another strategy and it still wins

Markets are inexploitable, but they don’t do search in a world-model other than the search done by individual market participants

The US government is not adversarially coherent in most circumstances, even if its subparts are coherent; lobbying can affect the US government’s policies, and it is meant to be corrigible by the voting population.

Evolution is not coherent: species often evolve to extinction; foxes and rabbits, etc.

There is no concrete definition of what an AGI actually is. Discussions around this subject matter mostly revolves around integrating nondeterministic systems in modern civilization. Humans on a very basic level operate in both deterministic and nondeterministic ways, and one may substitute another even in the same domain of an individual just because of different context or time. Reality is similar in that it constitutes both determinism and nondeterminism, but cognition has only been able to operate in the deterministic space, thus throughout most of history, we’ve been dealing mostly with deterministic systems, at least of the systems created by humans through status functions. Artificial nondeterministic systems are still something new to current civilization.

Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)

Given a program, we can tell whether it is an agent, and if so, which utility function it has^{[1]} (the “evaluating agent” section of the article).

I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.

Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.

If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).

If there are no precursors like that, decide which of them are humans.

Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).

Detection

How to identify agents which are our agent’s precursors? Let our agent be G and let H be another agents which exists in the universe according to hypothesis Θ^{[2]}. Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).

A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it. It is possible that any such Θ would have a sufficiently large description complexity penalty that it doesn’t matter. On the second hand, if Θ is unconditionally Knightian uncertain about H creating G then the utility will be upper bounded by the scenario in which G doesn’t exist, which is liable to make Θ an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of G by H would be contingent on G’s behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case Θ is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe (in the hope that some such coarsening would leave the creation of G uncertain).

Notice that any agent whose existence is contingent on G’s policy cannot be detected as a precursor: the corresponding program doesn’t even “run”, because we don’t apply a G-policy-counterfactual to the bridge transform.

Classification

How to decide which precursors are which? One tool we have is the g parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.

Humans are classified as “benign”.

Most (by probability mass) malign simulation hypotheses contain at least one precursor classified as “malign”.

Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as “irrelevant”.

Assistance

Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.

Inner Alignment

Why can this solve inner alignment? In any model-based approach, the AI doesn’t train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning^{[3]}. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.

Outer Alignment

Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn’t assume humans are perfect agents: g can be less than infinity. I suspect that when g<∞ the utility function becomes somewhat ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there’s no problem.

Moreover, the entire method can be combined with the Hippocratic principle to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).

We do need a lot more research to fully specify this “utility reconstruction” and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible.

In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose.

Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).

A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it.

Can you please explain how does this not match the definition? I don’t yet understand all the math, but intuitively, if H creates G / doesn’t interfere with the creation of G, then if H instead followed policy “do not create G/ do interfere with the creation of G”, then G’s code wouldn’t run?

Can you please give an example of a precursor that does match the definition?

The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn’t create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it “hard counterfactuals”) only makes sense when the condition you’re counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining “soft counterfactuals” where you consider coarsenings of Θ in addition to Θ itself.

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G‘s timeline, and measure the mutual information between this action(s) and H‘s observations at some interval on H’s timeline.

Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the intersection of these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor^{[1]} agent i.e. the one meeting the intersection for the minimal t.

More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.

This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.

More on Counterfactuals

In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:

Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

Weaknesses

My main concerns with this approach are:

The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.

The feasibility of a good enough classifier. At present, I don’t have a concrete plan for attacking this, as it requires inputs from outside of computer science.

Inherent “incorrigibility”: once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won’t defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I’m not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.

There’s a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn’t contain terms such as “oh btw don’t kill people while you’re building the nanosystem”. However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the “relative difficulty assumption” (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.

We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.

Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.

AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form “send such-and-such innocuous packet to such-and-such IP address”. If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.

Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as “would be labeled by humans as such-and-such” instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.

In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn’t help with learning the human-centric dataset, implying that the concept models don’t contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.

Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big “if” but one we need to deal with anyway).

I noticed that some people use “skeptical” to mean “my armchair reasoning is better than all expert knowledge and research, especially if I am completely unfamiliar with it”.

Example (not a real one): “I am skeptical about the idea that objects would actually change their length when their speed approaches the speed of light.”

The advantage of this usage is that it allows you to dismiss all expertise you don’t agree with, while making you sound a bit like an expert.

I suspect you’re reacting to the actual beliefs (disbelief in your example), rather than the word usage. In common parlance, “skeptical” means “assign low probability”, and that usage is completely normal and understandable.

The ability to dismiss expertise you don’t like is built into humans, not a feature of the word “skeptical”. You could easily replace “I am skeptical” with “I don’t believe” or “I don’t think it’s likely” or just “it’s not really true”.

I think that “skeptical” works better as a status move. If I say I don’t believe you, that makes us two equals who disagree. If I say I am skeptical… I kinda imply that you are not. Similarly, a third party now has the options to either join the skeptical or the non-skeptical side of the debate.

(Or maybe I’m just overthinking things, of course.)

Shortform #100 Writing publicly considered beneficial, fun, and not that scary

After writing one hundred shortform posts, writing publicly no longer feels scary and really just feels like a habit more than anything else (especially because the last 33 posts were near daily or daily). A habit I intend to continue as these are fun to write (even when I feel grumpy or hit an ugh field before doing so) and occupy a nice role in my life, plus I love growing my writing & other skills when creating these posts.

I feel a strong desire to write bigger posts than these, but I like having the shortforms as a consistent & foundational habit for regular public writing. Consistency and building foundational habits are of paramount importance, atop such you can build bigger and better things.

Here’s to continuing daily shortforms and growing what is next too!

Moral Mazes is my favorite management book ever, because instead of “how to be a good manager” it’s about “empirical observations of large-scale organizational dynamics involving management”.

I wish someone would write an updated version—a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author’s take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good null hypothesis—any company saying “we aren’t/won’t become a moral maze” has a pretty huge evidential burden to cross.

I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI. These decisions are going to be made within the context of a moral maze.

To me, this means that some strategies (“everyone in the company has a thorough and complete understanding of AGI risks”) will almost certainly fail. I think the only strategies that work well inside of moral mazes will work at all.

To sum up my takes here:

basically every company eventually becomes a moral maze

AGI deployment decisions will be made in the context of a moral maze

understanding moral maze dynamics is important to AGI deployment strategy

basically every company eventually becomes a moral maze

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook’s pivot to the “metaverse”, for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg’s beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it’s relatively clear which ones are moral mazes).

So my strategic corollary to this is that it’s probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.

In the case of facebook, even in the face of all of their history of actions, I think on the margin I’d prefer the founder to the median replacement to be leading the company.

(Edit: I don’t think founders remaining at the head of a company isn’t evidence that the company isn’t a moral maze. Also I’m not certain I agree that facebook’s pivot couldn’t have been done by a moral maze.)

Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics.

I guess that means that if AGI deployment is very incremental—a sequence of small changes to many different AI systems, that only in retrospect add up to AGI—moral maze dynamics will still be paramount, even in founder-led companies.

I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences

If Saturday was a stay-in get attention hijacked kind of day, today was a “get the heck out of the house and do something” kind of day. I met up with a friend to visit the Chrysler Art Museum and attended an hourish long glassblowing demonstration which was absolutely fascinating, I loved it! I enjoyed walking around the modernist & contemporary sections of the museum (only stayed for a few hours so I didn’t go to all parts of the museum this time around), had a few decidedly new experiences during that and found those pleasant.

The M.C. Escher exhibit was amazing, perspective altering as expected, and inspiring :) Expressing mathematical and/or physics concepts via art is a beautiful thing, I want to see more of that.

After visiting the museum, I went to my favorite cafe and sat at the cafe’s big round table by a window. I studied some, but mostly had conversations with several different people throughout the day. Getting into intellectual conversations with strangers is truly enjoyable, you never know what you may learn or concepts you may be introduced to.

During the evening, I met with a friend & coworked for an hour before hanging out. That was a very productive hour, wow do I have much left to do after the Retreat and yet more for organizing to do. I love that work so much though, it’s great :) My friend & I had excellent conversations after working, I’m glad we met up and look forward to the next time!

I do not have a desk in my office yet, and use my desktop while sitting on the floor. My foot fell asleep while writing this lol. That happens a fair bit, but it’s okay, I shall acquire desks soon hopefully and have a proper setup once again.

I have a new project for which I actively don’t want funding for myself: it’s too new and unformed to withstand the pressure to produce results for specific questions by specific times*. But if it pans out in ways other people value I wouldn’t mind retroactive payment. This seems like a good fit for impact certificates, which is a tech I vaguely want to support anyway.

Someone suggested that if I was going to do that I should mint and register the cert now, because that norm makes IC markets more informative, especially about the risk of very negative projects. This seems like a good argument but https://www.impactcerts.com/mint seems borked and I don’t know of better options. Now this is requiring thought and the whole point was to not have to do that yet.

So I’m crowdsourcing. What are your thoughts on this? What are potential best practices I should support? Counter arguments?

*my psychology is such that there’s no way around this that also guarantees speeding up the work. If someone wanted to fund the nice things for Elizabeth project I’d accept but no guarantee I would produce any faster. I *have* asked for funding for my collaborator and a TBD research assistant.

I will definitely not be sharing the object level project in this thread.

I hurt my hand so if my replies look weird that’s why.

”...If perfection is impossible that is no excuse for not trying. Hold yourself to the highest standard you can imagine, and look for one still higher. Do not be content with the answer that is almost right; seek one that is exactly right.”

Sounds like destructive advice for a lot of people. I could add a personal disclaimer or adjust the tone away from “never feel satisfied” towards “don’t get complacent” though that’s a beyond what I feel a summarizer ought to do.

Similarly, the ‘argument’ virtue sounds like bad advice to take literally, unless tempered with a ‘shut up and be socially aware’ virtue.

I’d appreciate any perspective on this or what I should do.

Most advice is contraindicated for some people, so if it’s not a valid Law, it should only be called to attention, not all else equal given weight or influence beyond what calling something to attention normally merits. Even for Laws, there is no currently legible Law saying that people must or should follow them, that depends on the inscrutable values of Civilization. It’s not a given that people should optimize themselves for being agents. So advice for being an effective agent might be different from advice for being healthy or valuable, or understanding topos theory, or building 5 meters high houses of cards.

For perfectionism, I think never being satisfied with where you’re at now doesn’t mean you can’t take pride in how far you’ve come?

“Don’t feel complacent” feels different from “striving for perfection” to me. The former feels more like making sure your standards don’t drop too much (maintaining a good lower bound), whereas the latter feels more like pushing the upper limit. When I think about complacency, I think about being careful and making sure that I am not e.g. taking the easy way out because of laziness. When I think about perfectionism (in the 12 virtues sense), I think about imagining ways things can be better and finding ways to get closer to that ideal.

I don’t really understand the ‘argument’ virtue so no comment for that.

Shortform #98 Visual media consumption considered enjoyable but net harmful for now

In shortform #89 I declared another media diet. That diet ended yesterday evening so I tried out a TV show, Neil Gaiman’s “The Sandman” on netflix. Which...promptly hijacked my attention from last night until the afternoon of today. I came across LRNZ’s “Golem” for free on Wednesday last week at my local cafe hangout, took it home, and devoured it in one sitting. I could keep listing examples, on and on...of all the times some visual media (digital or analog) has so utterly hacked my attention that I abandoned all else in favor it...but that list would be ridiculously long.

Friction increasing or other interventions around visual media consumption that work well for me:

Video games: if the playtime is with a group of others and scheduled in advance, that poses no issues & maximises my enjoyment of the time (I don’t usually like playing video games alone anymore).

Movies: going over to a friend’s house or a theater makes this totally fine. I may also be okay with limiting to one per week because honestly I just don’t watch movies very much.

What I simply have to ban, unfortunately:

TV shows

YouTube videos or similar but from other platforms

Three exceptions: Video is from work & required to watch. Video is educational / from ROSE for a workshop. Video is short and was sent directly to me by a friend.

What I am uncertain about but likely need to ban:

Webcomics, manga, visual novels, comics

I don’t engage with such media very often, but when I have, they’ve been very hijack-y. A blanket ban with exception for reading set amount of chapters or arcs per time interval with a group is probably the best bet.

After experiencing attention hacking or hijacking, my mood tends to crater, my mind is unfocused & hazy or a bit...loose (see yesterday’s shortform for a great example of that), and I feel bad about the time spent. Not guilt per se, more of a melancholy feeling about having missed out on more enjoyable non-hijacking activities.

So! Back to a media diet starting tomorrow 7 August and lasting through 10 September this year.

Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols that don’t rely on value learning become vastly less safe when combined with IBP:

Example 1:Using steep time discount to disincentivize dangerous long-term plans. For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.

Example 2:Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.

Example 3:Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.

The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.

Infradistributions admit an information-theoretic quantity that doesn’t exist in classical theory. Namely, it’s a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:

Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a closed convex subset of ΔX. Then, imagine someone trying to communicate a message by choosing a distribution out of Θ. Formally, let Y be any other finite set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information capacity of the protocol is the mutual information between the projection on Y and the projection on X according to η, i.e. Iη(prX;prY). The “Knightian entropy” of Θ is now defined to be the maximum of Iη(prX;prY) over all choices of Y, θ, K. For example, if Θ is Bayesian then it’s 0, whereas if Θ=⊤X, it is ln|X|.

Here is one application^{[1]} of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion ϵ of the prior ζ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.

Let O be the set of observations and A the set of actions (which might be “just” predictions) of our AI, and for any environment τ and prior ξ, let Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm starting with prior ξ and interacting with environment τ for n time steps. We have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part. For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these distributions for all choices of μ′ (“attacker policy”) is some Θβτ(n)∈Δ(A×O)n. The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of n. The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in n) malign capacity bounds.

Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.

Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).

For any h∈H denote by V(h) the reward promised by h:

V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)

Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.

hn:=argmaxh∈GnV(h)

Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.

If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.

If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.

Model 2: Let the spaces of arms and hypotheses be

A:=H:=Sd:={x∈Rd+1∣∥x∥=1}

Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.

For t=1 we get the usual maximin (“pessimism”), for t=0 we get maximax (“optimism”) and for other values of t we get something in the middle (we can call “t-mism”).

It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set

μ∗t:=argmaxμEμ[U(a∗t)]

Θt:=tΘ+(1−t)μ∗t

Then we get

a∗(Θt)=a∗t(Θ)

More precisely, any pessimistically optimal action for Θt is t-mistically optimal for Θ (the converse need not be true in general, thanks to the arbitrary choice involved in μ∗t).

To first approximation it means we don’t need to consider t-mistic agents since they are just special cases of “pessimistic” agents. To second approximation, we need to look at what the transformation of Θ to Θt does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If U has low description complexity and t is not too small then essentially we get full equivalence between “pessimism” and t-mism. If tis small then we get a strictly “narrower” prior (for t=0 we are back at ordinary Bayesianism). However, if U has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.

Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I’m quite confident the overall idea is sound.

We will work with homogenous ultracontributions (HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).

Syntax

Let Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by:

0∈T (intended meaning: the uninhabited type)

1∈T (intended meaning: the one element type)

If α∈Tι then α∈T

If α,β∈T then α+β∈T (intended meaning: disjoint union)

If α,β∈T then α×β∈T (intended meaning: Cartesian product)

If α∈T then (α)∈T (intended meaning: predicates with argument of type α)

For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic terms:

prαβ∈F0α×β→α

iαβ∈F0α→α+β

Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a “commutator” of type α×β→β×α.

∧α∈F0(α)×(α)→(α) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]

∃αβ∈F0(α×β)→(β)

∀αβ∈F0(α×β)→(β) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]

Assume that for each n∈N there is some Dn⊆□[n]: the set of “describable” ultracontributions [EDIT: it is probably sufficient to only have the fair coin distribution in D2 in order for it to be possible to approximate all ultracontributions on finite sets]. If μ∈Dn then ┌μ┐∈V(∑ni=11)

We recursively define the set of all terms Fα→β. We denote Vα:=F1→α.

If f∈F0α→β then f∈Fα→β

If f1∈Fα1→β1 and f2∈Fα2→β2 then f1×f2∈Fα1×α2→β1×β2

If f1∈Fα1→β1 and f2∈Fα2→β2 then f1+f2∈Fα1+α2→β1+β2

If f∈Fα→β then f−1:F(β)→(α)

If f∈Fα→β and g∈Fβ→γ then g∘f∈Fα→γ

Elements of V(α) are called formulae. Elements of V(1) are called sentences. A subset of V(1) is called a theory.

Semantics

Given T⊆V(1), a modelM of T is the following data. To each α∈T, there must correspond some compact Polish space M(t) s.t.:

M(0)=∅

M(1)=pt (the one point space)

M(α+β)=M(α)⊔M(β)

M(α×β)=M(α)×M(β)

M((α))=□M(α)

To each f∈Fα→β, there must correspond a continuous mapping M(f):M(α)→M(β), under the following constraints:

pr, i, diag and the “algebrators” have to correspond to the obvious mappings.

M(=α)=⊤diagM(α). Here, diagX⊆X×X is the diagonal and ⊤C∈□X is the sharp ultradistribution corresponding to the closed set C⊆X.

Consider α∈T and denote X:=M(α). Then, M(()α)=⊤□X⋉id□X. Here, we use the observation that the identity mapping id□X can be regarded as an infrakernel from □X to X.

M(⊥)=⊥pt

M(⊤)=⊤pt

S(M(∨)(μ,ν)) is the convex hull of S(μ)∪S(ν)

S(M(∧)(μ,ν)) is the intersection of S(μ)∪S(ν)

Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, M(∃αβ)(μ)=pr∗μ.

Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, p:M(∀αβ)(μ) iff for all q∈Δc(X×Y), if pr∗q=p then q:μ.

M(f1×f2)=M(f1)×M(f2)

M(f1+f2)=M(f1)⊔M(f2)

M(f−1)(μ)=M(f)∗(μ).

M(g∘f)=M(g)∘M(f)

M(┌μ┐)=μ

Finally, for each ϕ∈T, we require M(ϕ)=⊤pt.

Semantic Consequence

Given ϕ∈V(1), we say M⊨ϕ when M(ϕ)=⊤pt. We say T⊨ϕ when for any model M of T, M⊨ϕ. It is now interesting to ask what is the computational complexity of deciding T⊨ϕ. [EDIT: My current best guess is co-RE]

Applications

As usual, let A be a finite set of actions and O be a finite set of observation. Require that for each o∈O there is σo∈Tι which we interpret as the type of states producing observation o. Denote σ∗:=∑o∈Oσo (the type of all states). Moreover, require that our language has the nonlogical symbols s0∈V0(σ∗) (the initial state) and, for each a∈A, Ka∈F0σ∗→(σ∗) (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.

Let’s also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).

0-th order

Syntax

Let A be the set of propositional variables. We define the language L:

Any a∈A is also in L

⊥∈L

⊤∈L

Given ϕ,ψ∈L, ϕ∧ψ∈L

Given ϕ,ψ∈L, ϕ∨ψ∈L

Notice there’s no negation or implication. We define the set of judgements J:=L×L. We write judgements as ϕ⊢ψ (”ψ in the context of ϕ”). A theory is a subset of J.

Semantics

Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□X. The latter is required to satisfy:

M(⊥)=⊥X

M(⊤)=⊤X

M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of the corresponding sets

M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of the corresponding sets

For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)

1-st order

Syntax

We define the language using the usual syntax of 1-st order logic, where the allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by types from some set T. For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables (v1…vn), we denote Lv the set of formulae whose free variables are a subset of v1…vn. We define the set of judgements J:=⋃vLv×Lv.

Semantics

Given T⊆J, a model of T consists of

For every t∈T, a compact Polish space M(t)

For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of Xv:=□(∏ni=1M(ti))

It must satisfy the following:

Mv(⊥)=⊥Xv

Mv(⊤)=⊤Xv

Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)

Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)

Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm. Consider also some σ:{1…n}→{1…m} s.t. sσ(i)=ti. Given ϕ∈Lv, we can form the substitution ψ:=ϕ[xi=yσ(i)]∈Lu. We also have a mapping fσ:Xv→Xu given by fσ(x1…xm)=(xσ(1)…xσ(n)). We require Mu(ψ)=f∗(Mv(ϕ))

Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv(ϕ))

Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require that p:Mv∖vi(∀vi:ϕ) if an only if, for all q∈ΔXv s.t pr∗q=p, q:pr∗(Mv(ϕ))

There is a special type of crisp infradistributions that I call “affine infradistributions”: those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).

To produce some evidence for the conjecture, let’s consider a slightly different problem. Specifically, introduce a new semantics in which □X is replaced by the set of linear subspaces of some finite dimensional vector space V. A model M is required to satisfy:

M(⊥)=0

M(⊤)=V

M(ϕ∧ψ)=M(ϕ)∩M(ψ)

M(ϕ∨ψ)=M(ϕ)+M(ψ)

For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)

If you wish, this is “non-unitary quantum logic”. In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform T into an equivalent theory s.t. all judgments are of the following forms:

a=⊥

a=⊤

a⊢b

Pairs of the form c=a∧b, d=a∨b.

Here, a,b,c,d∈A are propositional variables and “ϕ=ψ” is a shorthand for the pair of judgments ϕ⊢ψ and ψ⊢ϕ.

Second, we make sure that our T also satisfies the following “closure” properties:

If a⊢b and b⊢c are in T then so is a⊢c

If c=a∧b is in T then so are c⊢a and c⊢b

If c=a∨b is in T then so are a⊢c and b⊢c

If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c

If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d

Third, we assign to each a∈A a real-valued variable xa. Then we construct a linear program for these variables consisting of the following inequalities:

For any a∈A: 0≤xa≤1

For any a⊢b in T: xa≤xb

For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb

For any a=⊥: xa=0

For any a=⊤: xa=1

Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model M we can construct a solution by setting

xa:=dimM(a)dimM(⊤)

I don’t have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form a⊢b are satisfied. This is a moduli space of poset representations. It is easy to see it’s non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds at some point of the variety. The first is highly likely and the second is at least plausible.

The algorithm also seems to have a natural extension to the original infra-Bayesian setting.

When using infra-Bayesian logic to define a simplicity prior, it is natural to use “axiom circuits” rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce “shorthand” symbols for repeating terms. This doesn’t affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.

Instead of introducing all the “algebrator” logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:

For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n

For any n∈N and σ∈Sn, σ×α∈Fαn→αn

However, if we do this then it’s not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over Q, and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).

In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.

[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]

The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?

Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?

0th step: all robots remember s0

1st step: all robots remember s0s1

2nd step: ^{1}⁄_{2} of robots remember s0s1s0 and ^{1}⁄_{2} of robots remember s0s1s1

3rd step: ^{1}⁄_{3} of robots remembers s0s1s0s1, ^{1}⁄_{3} of robots remember s0s1s1s0 and ^{1}⁄_{3} of robots remember s0s1s1s1

There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.

Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].

(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)

So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.

Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:

ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′

Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.

Shortform #101 Conversation experiment: Tabooing $topics after previously discussed for $time

Do you ever find yourself talking at length, sometimes repetitively about the same topic(s)? Do you notice others around you doing similarly? Experiment with tabooing that topic or topics (with consent of your conversation partner(s), of course) for an evening (or whatever your time interval is) and see what happens!

Spur of the moment tonight, I asked my friend after we finished coworking and were hanging out if they wanted to try the conversation topic taboo experiment. They said yes, and that led to us having a few additional really good conversations tonight in addition to the tabooed topic conversations we had had earlier in the night (those were good too, please note that tabooing a topic doesn’t mean a topic is bad necessarily).

I will be proposing this experiment at Virginia Rationalists’ next meetup (tomorrow!) and see how that goes.

Below is a list of powerful optimizers ranked on properties, as part of a brainstorm on whether there’s a simple core of consequentialism that excludes corrigibility. I think that AlphaZero is a moderately strong argument that there is a simple core of consequentialism which includes inner search.

Properties

Simple: takes less than 10 KB of code. If something is already made of agents (markets and the US government) I marked it as N/A.

Coherent: approximately maximizing a utility function most of the time. There are other definitions:

Not being money-pumped

Nate Soares’s notion in the MIRI dialogues: having all your actions point towards a single goal

John Wentworth’s setup of Optimization at a Distance

Adversarially coherent: something like “appears coherent to weaker optimizers” or “robust to perturbations by weaker optimizers”. This implies that it’s incorrigible.

Sufficiently optimized agents appear coherent—Arbital

will achieve high utility even when “disrupted” by an optimizer somewhat less powerful

Search+WM: operates by explicitly ranking plans within a world-model. Evolution is a search process, but doesn’t have a world-model. The contact with the territory it gets comes from directly interacting with the world, and this is maybe why it’s so slow

ThingSimple?Coherent?Adv. coherent?Search+WM?N

Y

Sometimes

Y

Y

Y

N

Y

N

Y

Y

Y

Y

Y

Y

Y

N/A

Y

Y

Y

US governmentN/AYNYEvolutionYNNNNotes:

Humans are not adversarially coherent: prospect theory and other cognitive biases can be exploited, indoctrination, etc.

AIXI-tl is not adversarially coherent because it is an embedded agent and can be switched off etc.

AlphaZero: when playing chess, you can use another strategy and it still wins

Markets are inexploitable, but they don’t do search in a world-model other than the search done by individual market participants

The US government is not adversarially coherent in most circumstances, even if its subparts are coherent; lobbying can affect the US government’s policies, and it is meant to be corrigible by the voting population.

Evolution is not coherent: species often evolve to extinction; foxes and rabbits, etc.

There is no concrete definition of what an AGI actually is. Discussions around this subject matter mostly revolves around integrating nondeterministic systems in modern civilization. Humans on a very basic level operate in both deterministic and nondeterministic ways, and one may substitute another even in the same domain of an individual just because of different context or time. Reality is similar in that it constitutes both determinism and nondeterminism, but cognition has only been able to operate in the deterministic space, thus throughout most of history, we’ve been dealing mostly with deterministic systems, at least of the systems created by humans through status functions. Artificial nondeterministic systems are still something new to current civilization.

Master post for alignment protocols.

Other relevant shortforms:

Autocalibrated quantilized debate

Hippocratic principle

IDA variants

Dialogic RL

More dialogic RL

## Precursor Detection, Classification and Assistance (PreDCA)

Infra-Bayesian physicalism provides us with two key building blocks:

Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)

Given a program, we can tell whether it is an agent, and if so, which utility function it has

^{[1]}(the “evaluating agent” section of the article).I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

For each hypothesis in the prior, check which agents are

precursorsof our agent according to this hypothesis.Among the precursors, check whether some are

definitelyneither humans nor animals nor previously created AIs.If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).

If there are no precursors like that, decide which of them are humans.

Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).

## Detection

How to identify agents which are our agent’s precursors? Let our agent be G and let H be another agents which exists in the universe according to hypothesis Θ

^{[2]}. Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it. It is possible that any such Θ would have a sufficiently large description complexity penalty that it doesn’t matter. On the second hand, if Θ is unconditionally Knightian uncertain about H creating G then the utility will be upper bounded by the scenario in which G doesn’t exist, which is liable to make Θ an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of G by H would be contingent on G’s behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case Θ is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only Θ we also examine

coarseningsof Θ which are not much more complex to describe (in the hope that some such coarsening would leave the creation of G uncertain).Notice that any agent whose existence is contingent on G’s policy cannot be detected as a precursor: the corresponding program doesn’t even “run”, because we don’t apply a G-policy-counterfactual to the bridge transform.

## Classification

How to decide which precursors are which? One tool we have is the g parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.

Humans are classified as “benign”.

Most(by probability mass) malign simulation hypotheses contain at leastoneprecursor classified as “malign”.Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as “irrelevant”.

## Assistance

Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.

## Inner Alignment

Why can this solve inner alignment? In any model-based approach, the AI doesn’t train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning

^{[3]}. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.## Outer Alignment

Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn’t assume humans are perfect agents: g can be less than infinity. I suspect that when g<∞ the utility function becomes somewhat

ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there’s no problem.Moreover, the entire method can be combined with the Hippocratic principle to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).

We do need a lot more research to fully specify this “utility reconstruction” and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible.

I’m ignoring details like “what if H only exists with certain probability”. The more careful analysis is left for later.

In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose.

Can you please explain how does this not match the definition? I don’t yet understand all the math, but intuitively, if H creates G / doesn’t interfere with the creation of G, then if H instead followed policy “do not create G/ do interfere with the creation of G”, then G’s code wouldn’t run?

Can you please give an example of a precursor that does match the definition?

The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn’t create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it “hard counterfactuals”) only makes sense when the condition you’re counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining “soft counterfactuals” where you consider coarsenings of Θ in addition to Θ itself.

Here’s a video of a talk I gave about PreDCA.

Two more remarks.

## User Detection

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a

randomaction (or sequence of actions) at some point (or interval) on G‘s timeline, and measure themutual informationbetween this action(s) and H‘s observations at some interval on H’s timeline.Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the

intersectionof these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor^{[1]}agent i.e. the one meeting the intersection for the minimal t.More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.

This probably does

notmake the detection ofmalignagents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.## More on Counterfactuals

In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:

Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.

Apply the counterfactual to ¯Θ.

Renormalize the result from HUC to HUD.

We still need precursor detection, otherwise the AI can create some new agent and make it the nominal “user”.

Some additional thoughts.

## Non-Cartesian Daemons

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

## Weaknesses

My main concerns with this approach are:

The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.

The feasibility of a good enough classifier. At present, I don’t have a concrete plan for attacking this, as it requires inputs from outside of computer science.

Inherent “incorrigibility”: once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won’t defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I’m not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.

There’s a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn’t contain terms such as “oh btw don’t kill people while you’re building the nanosystem”. However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the “relative difficulty assumption” (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.

We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.

Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.

AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form “send such-and-such innocuous packet to such-and-such IP address”. If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.

Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as “would be labeled by humans as such-and-such” instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.

In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t.

the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn’t help with learning the human-centric dataset, implying that the concept models don’t contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big “if” but one we need to deal with anyway).

I noticed that some people use “skeptical” to mean “my armchair reasoning is better than all expert knowledge and research, especially if I am completely unfamiliar with it”.

Example (not a real one): “I am

skepticalabout the idea that objects would actually change their length when their speed approaches the speed of light.”The advantage of this usage is that it allows you to dismiss all expertise you don’t agree with, while making you sound a bit like an expert.

I suspect you’re reacting to the actual beliefs (disbelief in your example), rather than the word usage. In common parlance, “skeptical” means “assign low probability”, and that usage is completely normal and understandable.

The ability to dismiss expertise you don’t like is built into humans, not a feature of the word “skeptical”. You could easily replace “I am skeptical” with “I don’t believe” or “I don’t think it’s likely” or just “it’s not really true”.

I think that “skeptical” works better as a status move. If I say I don’t believe you, that makes us two equals who disagree. If I say I am skeptical… I kinda imply that

youarenot. Similarly, a third party now has the options to either join the skeptical or the non-skeptical side of the debate.(Or maybe I’m just overthinking things, of course.)

Shortform #100 Writing publicly considered beneficial, fun, and not that scary

After writing one hundred shortform posts, writing publicly no longer feels scary and really just feels like a habit more than anything else (especially because the last 33 posts were near daily or daily). A habit I intend to continue as these are fun to write (even when I feel grumpy or hit an ugh field before doing so) and occupy a nice role in my life, plus I love growing my writing & other skills when creating these posts.

I feel a strong desire to write bigger posts than these, but I like having the shortforms as a consistent & foundational habit for regular public writing. Consistency and building foundational habits are of paramount importance, atop such you can build bigger and better things.

Here’s to continuing daily shortforms and growing what is next too!

Are there any modern competitors to metamed? Have a health issue, very willing to spend money to fix it.

https://www.facebook.com/groups/1781724435404945 might be a good way to find somebody who provides that kind of service.

You could post a research bounty on this site.

AGI will probably be deployed by a Moral MazeMoral Mazes is my favorite management book ever, because instead of “how to be a good manager” it’s about “empirical observations of large-scale organizational dynamics involving management”.

I wish someone would write an updated version—a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author’s take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good null hypothesis—any company saying “we aren’t/won’t become a moral maze” has a pretty huge evidential burden to cross.

I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI. These decisions are going to be made within the context of a moral maze.

To me, this means that some strategies (“everyone in the company has a thorough and complete understanding of AGI risks”) will almost certainly fail. I think the only strategies that work well inside of moral mazes will work at all.

To sum up my takes here:

basically every company eventually becomes a moral maze

AGI deployment decisions will be made in the context of a moral maze

understanding moral maze dynamics is important to AGI deployment strategy

Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes.

Facebook’s pivot to the “metaverse”, for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg’s beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it’s relatively clear which ones are moral mazes).

Agree that founders are a bit of an exception. Actually that’s a bit in the longer version of this when I talk about it in person.

Basically: “The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes”.

So my strategic corollary to this is that it’s probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.

In the case of facebook, even in the face of all of their history of actions, I think on the margin I’d prefer the founder to the median replacement to be leading the company.

(Edit: I don’t think founders remaining at the head of a company isn’t evidence that the company isn’t a moral maze. Also I’m not certain I agree that facebook’s pivot couldn’t have been done by a moral maze.)

Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes

internally(i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that theirexternalactions are only weakly influenced by moral maze dynamics.I guess that means that if AGI deployment is very incremental—a sequence of small changes to many different AI systems, that only in retrospect add up to AGI—moral maze dynamics will still be paramount, even in founder-led companies.

I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences

Shortform #99 Glassblowing, Contemporary Art, & M.C. Escher! | Also, good conversations :)

If Saturday was a stay-in get attention hijacked kind of day, today was a “get the heck out of the house and do something” kind of day. I met up with a friend to visit the Chrysler Art Museum and attended an hourish long glassblowing demonstration which was absolutely fascinating, I loved it! I enjoyed walking around the modernist & contemporary sections of the museum (only stayed for a few hours so I didn’t go to all parts of the museum this time around), had a few decidedly new experiences during that and found those pleasant.

The M.C. Escher exhibit was amazing, perspective altering as expected, and inspiring :) Expressing mathematical and/or physics concepts via art is a beautiful thing, I want to see more of that.

After visiting the museum, I went to my favorite cafe and sat at the cafe’s big round table by a window. I studied some, but mostly had conversations with several different people throughout the day. Getting into intellectual conversations with strangers is truly enjoyable, you never know what you may learn or concepts you may be introduced to.

During the evening, I met with a friend & coworked for an hour before hanging out. That was a very productive hour, wow do I have much left to do after the Retreat and yet more for organizing to do. I love that work so much though, it’s great :) My friend & I had excellent conversations after working, I’m glad we met up and look forward to the next time!

I do not have a desk in my office yet, and use my desktop while sitting on the floor. My foot fell asleep while writing this lol. That happens a fair bit, but it’s okay, I shall acquire desks soon hopefully and have a proper setup once again.

I have a new project for which I actively don’t want funding for myself: it’s too new and unformed to withstand the pressure to produce results for specific questions by specific times*. But if it pans out in ways other people value I wouldn’t mind retroactive payment. This seems like a good fit for impact certificates, which is a tech I vaguely want to support anyway.

Someone suggested that if I was going to do that I should mint and register the cert now, because that norm makes IC markets more informative, especially about the risk of very negative projects. This seems like a good argument but https://www.impactcerts.com/mint seems borked and I don’t know of better options. Now this is requiring thought and the whole point was to not have to do that yet.

So I’m crowdsourcing. What are your thoughts on this? What are potential best practices I should support? Counter arguments?

*my psychology is such that there’s no way around this that also guarantees speeding up the work. If someone wanted to fund the nice things for Elizabeth project I’d accept but no guarantee I would produce any faster. I *have* asked for funding for my collaborator and a TBD research assistant.

I will definitely not be sharing the object level project in this thread.

I hurt my hand so if my replies look weird that’s why.

I’m in the process of summarizing The Twelve Virtues of Rationality and don’t feel good about writing the portion on perfectionism

”...If perfection is impossible that is no excuse for not trying. Hold yourself to the highest standard you can imagine, and look for one still higher. Do not be content with the answer that is almost right; seek one that is exactly right.”

Sounds like destructive advice for a lot of people. I could add a personal disclaimer or adjust the tone away from “never feel satisfied” towards “don’t get complacent” though that’s a beyond what I feel a summarizer ought to do.

Similarly, the ‘argument’ virtue sounds like bad advice to take literally, unless tempered with a ‘shut up and be socially aware’ virtue.

I’d appreciate any perspective on this or what I should do.

Most advice is contraindicated for some people, so if it’s not a valid Law, it should only be called to attention, not all else equal given weight or influence beyond what calling something to attention normally merits. Even for Laws, there is no currently legible Law saying that people must or should follow them, that depends on the inscrutable values of Civilization. It’s not a given that people should optimize themselves for being agents. So advice for being an effective agent might be different from advice for being healthy or valuable, or understanding topos theory, or building 5 meters high houses of cards.

For perfectionism, I think never being satisfied with where you’re at now doesn’t mean you can’t take pride in how far you’ve come?

“Don’t feel complacent” feels different from “striving for perfection” to me. The former feels more like making sure your standards don’t drop too much (maintaining a good lower bound), whereas the latter feels more like pushing the upper limit. When I think about complacency, I think about being careful and making sure that I am not e.g. taking the easy way out because of laziness. When I think about perfectionism (in the 12 virtues sense), I think about imagining ways things can be better and finding ways to get closer to that ideal.

I don’t really understand the ‘argument’ virtue so no comment for that.

Thank you, I hadn’t noticed the difference but I agree that complacency is not the message.

I think I can word things the way you are and spread a positive message.

Thanks a lot, you’ve un-stumped me.

Shortform #98 Visual media consumption considered enjoyable but net harmful for now

In shortform #89 I declared another media diet. That diet ended yesterday evening so I tried out a TV show, Neil Gaiman’s “The Sandman” on netflix. Which...promptly hijacked my attention from last night until the afternoon of today. I came across LRNZ’s “Golem” for free on Wednesday last week at my local cafe hangout, took it home, and devoured it in one sitting. I could keep listing examples, on and on...of all the times some visual media (digital or analog) has so utterly hacked my attention that I abandoned all else in favor it...but that list would be ridiculously long.

Friction increasing or other interventions around visual media consumption that work well for me:

Video games: if the playtime is with a group of others and scheduled in advance, that poses no issues & maximises my enjoyment of the time (I don’t usually like playing video games alone anymore).

Movies: going over to a friend’s house or a theater makes this totally fine. I may also be okay with limiting to one per week because honestly I just don’t watch movies very much.

What I simply have to ban, unfortunately:

TV shows

YouTube videos or similar but from other platforms

Three exceptions: Video is from work & required to watch. Video is educational / from ROSE for a workshop. Video is short and was sent directly to me by a friend.

What I am uncertain about but likely need to ban:

Webcomics, manga, visual novels, comics

I don’t engage with such media very often, but when I have, they’ve been very hijack-y. A blanket ban with exception for reading set amount of chapters or arcs per time interval with a group is probably the best bet.

After experiencing attention hacking or hijacking, my mood tends to crater, my mind is unfocused & hazy or a bit...loose (see yesterday’s shortform for a great example of that), and I feel bad about the time spent. Not guilt per se, more of a melancholy feeling about having missed out on more enjoyable non-hijacking activities.

So! Back to a media diet starting tomorrow 7 August and lasting through 10 September this year.

Master post for ideas about infra-Bayesianism.

Master post for ideas about infra-Bayesian physicalism.

Other relevant posts:

Incorrigibility in IBP

PreDCA alignment protocol

Infra-Bayesian physicalism is an interesting example in favor of the thesis that

the more qualitatively capable an agent is, the less corrigible it is.(a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols thatdon’trely on value learning become vastly less safe when combined with IBP:Example 1:Using steep time discount to disincentivize dangerous long-term plans.For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.Example 2:Debate.This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.Example 3:Forecasting/imitation (an IDA in particular).For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.

Infradistributions admit an information-theoretic quantity that doesn’t exist in classical theory. Namely, it’s a quantity that measures how many bits of

Knightianuncertainty an infradistribution has. We define it as follows:Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a closed convex subset of ΔX. Then, imagine someone trying to communicate a message by choosing a distribution out of Θ. Formally, let Y be any other finite set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information capacity of the protocol is the mutual information between the projection on Y and the projection on X according to η, i.e. Iη(prX;prY). The “Knightian entropy” of Θ is now defined to be the

maximumof Iη(prX;prY) over all choices of Y, θ, K. For example, if Θ is Bayesian then it’s 0, whereas if Θ=⊤X, it is ln|X|.Here is one application

^{[1]}of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion ϵ of the prior ζ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.Let O be the set of observations and A the set of actions (which might be “just” predictions) of our AI, and for any environment τ and prior ξ, let Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm starting with prior ξ and interacting with environment τ for n time steps. We have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part. For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these distributions for all choices of μ′ (“attacker policy”) is some Θβτ(n)∈Δ(A×O)n. The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is called the

malign capacityof the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of n. The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in n) malign capacity bounds.This is an idea I’m collaborating on with Johannes Treutlein.

Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.

Model 1:Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).For any h∈H denote by V(h) the reward promised by h:

V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)

Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.

hn:=argmaxh∈GnV(h)

Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.

If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.

If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.

Model 2:Let the spaces of arms and hypotheses beA:=H:=Sd:={x∈Rd+1∣∥x∥=1}

Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.

One of the postulates of infra-Bayesianism is the maximin decision rule. Given a crisp infradistribution Θ, it defines the optimal action to be:

a∗(Θ):=argmaxaminμ∈ΘEμ[U(a)]

Here U is the utility function.

What if we use a different decision rule? Let t∈[0,1] and consider the decision rule

a∗t(Θ):=argmaxa(tminμ∈ΘEμ[U(a)]+(1−t)maxμ∈ΘEμ[U(a)])

For t=1 we get the usual maximin (“pessimism”), for t=0 we get maximax (“optimism”) and for other values of t we get something in the middle (we can call “t-mism”).

It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set

μ∗t:=argmaxμEμ[U(a∗t)]

Θt:=tΘ+(1−t)μ∗t

Then we get

a∗(Θt)=a∗t(Θ)

More precisely, any pessimistically optimal action for Θt is t-mistically optimal for Θ (the converse need not be true in general, thanks to the arbitrary choice involved in μ∗t).

To first approximation it means we don’t need to consider t-mistic agents since they are just special cases of “pessimistic” agents. To second approximation, we need to look at what the transformation of Θ to Θt does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If U has low description complexity and t is not too small then essentially we get full equivalence between “pessimism” and t-mism. If t

issmall then we get a strictly “narrower” prior (for t=0 we are back at ordinary Bayesianism). However, if U has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for

higher-orderlogic. There might be holes and/or redundancies in the precise definitions given here, but I’m quite confident the overall idea is sound.We will work with homogenous ultracontributions (HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).

SyntaxLet Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by:

0∈T (intended meaning: the uninhabited type)

1∈T (intended meaning: the one element type)

If α∈Tι then α∈T

If α,β∈T then α+β∈T (intended meaning: disjoint union)

If α,β∈T then α×β∈T (intended meaning: Cartesian product)

If α∈T then (α)∈T (intended meaning: predicates with argument of type α)

For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the

logicalatomic terms:prαβ∈F0α×β→α

iαβ∈F0α→α+β

Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a “commutator” of type α×β→β×α.

=α∈V0(α×α)

diagα∈F0α→α×α

()α∈V0((α)×α) (intended meaning: predicate evaluation)

⊥∈V0(1)

⊤∈V0(1)

∨α∈F0(α)×(α)→(α)

∧α∈F0(α)×(α)→(α) [

EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]∃αβ∈F0(α×β)→(β)

∀αβ∈F0(α×β)→(β) [

EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]Assume that for each n∈N there is some Dn⊆□[n]: the set of “describable” ultracontributions [

EDIT: it is probably sufficient to only have the fair coin distribution in D2 in order for it to be possible to approximate all ultracontributions on finite sets]. If μ∈Dn then ┌μ┐∈V(∑ni=11)We recursively define the set of all terms Fα→β. We denote Vα:=F1→α.

If f∈F0α→β then f∈Fα→β

If f1∈Fα1→β1 and f2∈Fα2→β2 then f1×f2∈Fα1×α2→β1×β2

If f1∈Fα1→β1 and f2∈Fα2→β2 then f1+f2∈Fα1+α2→β1+β2

If f∈Fα→β then f−1:F(β)→(α)

If f∈Fα→β and g∈Fβ→γ then g∘f∈Fα→γ

Elements of V(α) are called formulae. Elements of V(1) are called sentences. A subset of V(1) is called a theory.

SemanticsGiven T⊆V(1), a

modelM of T is the following data. To each α∈T, there must correspond some compact Polish space M(t) s.t.:M(0)=∅

M(1)=pt (the one point space)

M(α+β)=M(α)⊔M(β)

M(α×β)=M(α)×M(β)

M((α))=□M(α)

To each f∈Fα→β, there must correspond a continuous mapping M(f):M(α)→M(β), under the following constraints:

pr, i, diag and the “algebrators” have to correspond to the obvious mappings.

M(=α)=⊤diagM(α). Here, diagX⊆X×X is the diagonal and ⊤C∈□X is the sharp ultradistribution corresponding to the closed set C⊆X.

Consider α∈T and denote X:=M(α). Then, M(()α)=⊤□X⋉id□X. Here, we use the observation that the identity mapping id□X can be regarded as an infrakernel from □X to X.

M(⊥)=⊥pt

M(⊤)=⊤pt

S(M(∨)(μ,ν)) is the convex hull of S(μ)∪S(ν)

S(M(∧)(μ,ν)) is the intersection of S(μ)∪S(ν)

Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, M(∃αβ)(μ)=pr∗μ.

Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, p:M(∀αβ)(μ) iff for all q∈Δc(X×Y), if pr∗q=p then q:μ.

M(f1×f2)=M(f1)×M(f2)

M(f1+f2)=M(f1)⊔M(f2)

M(f−1)(μ)=M(f)∗(μ).

M(g∘f)=M(g)∘M(f)

M(┌μ┐)=μ

Finally, for each ϕ∈T, we require M(ϕ)=⊤pt.

Semantic ConsequenceGiven ϕ∈V(1), we say M⊨ϕ when M(ϕ)=⊤pt. We say T⊨ϕ when for any model M of T, M⊨ϕ. It is now interesting to ask what is the computational complexity of deciding T⊨ϕ. [

EDIT: My current best guess is co-RE]ApplicationsAs usual, let A be a finite set of actions and O be a finite set of observation. Require that for each o∈O there is σo∈Tι which we interpret as the type of states producing observation o. Denote σ∗:=∑o∈Oσo (the type of all states). Moreover, require that our language has the nonlogical symbols s0∈V0(σ∗) (the initial state) and, for each a∈A, Ka∈F0σ∗→(σ∗) (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.

Let’s also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).

0-th orderSyntaxLet A be the set of propositional variables. We define the language L:

Any a∈A is also in L

⊥∈L

⊤∈L

Given ϕ,ψ∈L, ϕ∧ψ∈L

Given ϕ,ψ∈L, ϕ∨ψ∈L

Notice there’s no negation or implication. We define the set of judgements J:=L×L. We write judgements as ϕ⊢ψ (”ψ in the context of ϕ”). A theory is a subset of J.

SemanticsGiven T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□X. The latter is required to satisfy:

M(⊥)=⊥X

M(⊤)=⊤X

M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of the corresponding sets

M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of the corresponding sets

For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)

1-st orderSyntaxWe define the language using the usual syntax of 1-st order logic, where the allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by types from some set T. For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables (v1…vn), we denote Lv the set of formulae whose free variables are a subset of v1…vn. We define the set of judgements J:=⋃vLv×Lv.

SemanticsGiven T⊆J, a model of T consists of

For every t∈T, a compact Polish space M(t)

For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of Xv:=□(∏ni=1M(ti))

It must satisfy the following:

Mv(⊥)=⊥Xv

Mv(⊤)=⊤Xv

Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)

Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)

Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm. Consider also some σ:{1…n}→{1…m} s.t. sσ(i)=ti. Given ϕ∈Lv, we can form the substitution ψ:=ϕ[xi=yσ(i)]∈Lu. We also have a mapping fσ:Xv→Xu given by fσ(x1…xm)=(xσ(1)…xσ(n)). We require Mu(ψ)=f∗(Mv(ϕ))

Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv(ϕ))

Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require that p:Mv∖vi(∀vi:ϕ) if an only if, for all q∈ΔXv s.t pr∗q=p, q:pr∗(Mv(ϕ))

For any ϕ⊢ψ∈T, Mv(ϕ)⪯Mv(ψ)

There is a special type of crisp infradistributions that I call “affine infradistributions”: those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).

To produce some evidence for the conjecture, let’s consider a slightly different problem. Specifically, introduce a new semantics in which □X is replaced by the set of linear subspaces of some finite dimensional vector space V. A model M is required to satisfy:

M(⊥)=0

M(⊤)=V

M(ϕ∧ψ)=M(ϕ)∩M(ψ)

M(ϕ∨ψ)=M(ϕ)+M(ψ)

For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)

If you wish, this is “non-unitary quantum logic”. In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform T into an equivalent theory s.t. all judgments are of the following forms:

a=⊥

a=⊤

a⊢b

Pairs of the form c=a∧b, d=a∨b.

Here, a,b,c,d∈A are propositional variables and “ϕ=ψ” is a shorthand for the pair of judgments ϕ⊢ψ and ψ⊢ϕ.

Second, we make sure that our T also satisfies the following “closure” properties:

If a⊢b and b⊢c are in T then so is a⊢c

If c=a∧b is in T then so are c⊢a and c⊢b

If c=a∨b is in T then so are a⊢c and b⊢c

If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c

If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d

Third, we assign to each a∈A a real-valued variable xa. Then we construct a linear program for these variables consisting of the following inequalities:

For any a∈A: 0≤xa≤1

For any a⊢b in T: xa≤xb

For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb

For any a=⊥: xa=0

For any a=⊤: xa=1

Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model M we can construct a solution by setting

xa:=dimM(a)dimM(⊤)

I don’t have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form a⊢b are satisfied. This is a moduli space of

poset representations. It is easy to see it’s non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds atsomepoint of the variety. The first is highly likely and the second is at least plausible.The algorithm also seems to have a natural extension to the original infra-Bayesian setting.

When using infra-Bayesian logic to define a simplicity prior, it is natural to use “axiom circuits” rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce “shorthand” symbols for repeating terms. This doesn’t affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length

exponentially.Instead of introducing all the “algebrator” logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:

For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n

For any n∈N and σ∈Sn, σ×α∈Fαn→αn

However, if we do this then it’s not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over Q, and start from testing the identity of the types appearing inside parentheses), so we

canvalidate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are

mutually inconsistent.[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]

The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?

Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot

andan s1 robot. How will our population look like?0th step: all robots remember s0

1st step: all robots remember s0s1

2nd step:

^{1}⁄_{2}of robots remember s0s1s0 and^{1}⁄_{2}of robots remember s0s1s13rd step:

^{1}⁄_{3}of robots remembers s0s1s0s1,^{1}⁄_{3}of robots remember s0s1s1s0 and^{1}⁄_{3}of robots remember s0s1s1s1There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step

will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].

(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)

So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we

cangive a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:

ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′

Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.

Consider the following two policies.π