The problem is not in one of the conditions separately but in their conjunction: see my follow-up comment. You could argue that learning an exact model of Carol doesn’t really imply condition 2 since, although the model *does* imply everything Carol is ever going to say, Alice is not capable of extracting this information from the model. But then it becomes a philosophical question of what does it mean to “believe” something. I think there is value in the “behaviorist” interpretation that “believing X” means “behaving optimally given X”. In this sense, Alice can separately believe the two facts described by conditions 1 and 2, but cannot believe their conjunction.

# Vanessa Kosoy(Vanessa Kosoy)

IMO there are two reasons why finite-state MDPs are useful.

First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).

Second, we may be able to combine finite-state MDP techniques with an algorithm that

*learns the relevant features*, where “features” in this case corresponds to a mapping from histories to states. Now, of course there needn’t be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with*approximate*models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.

I think there is some confusion here coming from the unclear notion of a Bayesian agent with beliefs about theorems of PA. The reformulation I gave with Alice, Bob and Carol makes the problem clearer, I think.

Well, being surprised by Omega seems rational. If I found myself in a real life Newcomb problem I would also be very surprised and suspect a trick for a while.

Moreover, we need to unpack “learns that causality exists”. A quasi-Bayesian agent will eventually learn that it is part of a universe ruled by the laws of physics. The laws of physics are the ultimate “Omega”: they predict the agent and everything else. Given this understanding, it is not more difficult than it should be to understand Newcomb!Omega as a special case of Physics!Omega. (I don’t really have an understanding of quasi-Bayesian learning algorithms and how learning one hypothesis affects the learning of further hypotheses, but it seems plausible that things can work this way.)

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best

*approximation*of the real environment. (Or, the best reward achievable by some space of policies.)In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a “quasi-realizable” setting: there might be no perfect description of the environment in the hypothesis class, but there are some

*incomplete*descriptions. But, so far I haven’t studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.Consider a multi-armed bandit, where the arm space is . First, consider the follow realizable setting: the reward is a deterministic function which is known to be a polynomial of degree at most. In this setting, learning is fairly easy: it is enough to sample arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is

*lower-bounded*by a particular polynomial of degree at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven’t derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low ). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.

Here’s another perspective. Suppose that now Bob and Carol have symmetrical roles: each one asks a question, allows Alice to answer, and then reveals the right answer. Alice gets a reward when ey answer correctly. We can now see that perfect honesty actually

*is*tractable. It corresponds to an incomplete hypothesis. If Alice learns this hypothesis, ey answer correctly any question ey already heard before (no matter who asks now and who asked before). We can also consider a different incomplete hypothesis that allows real-time simulation of Carol. If Alice learns this hypothesis, ey answer correctly any question asked by Carol. However, the*conjunction*of both hypotheses is already intractable. There’s no impediment for Alice to learn both hypotheses: ey can both memorize previous answers*and*answer all questions by Carol. But, this doesn’t automatically imply learning the conjunction.- 29 Mar 2020 12:42 UTC; 6 points) 's comment on Thinking About Filtered Evidence Is (Very!) Hard by (

From my perspective, the trouble here comes from the honesty condition. This condition hides an unbounded quantifier: “if the speaker will

*ever*say something, then it is true”. So it’s no surprise we run into computational complexity and even computability issues.Consider the following setting. The agent Alice repeatedly interacts with two other entities: Bob and Carol. When Alice interacts with Bob, Bob asks Alice a yes/no question, Alice answers it and receives either +1 or −1 reward depending on whether the answer is correct. When Alice interacts with Carol, Carol tells Alice some question and the answer to that question.

Suppose that Alice starts with some low-information prior and learns over time about Bob and Carol both. The honesty condition becomes “if Carol will

*ever*say and Bob asks the question , then the correct answer is ”. But, this condition might be computationally intractable so it is not in the prior and cannot be learned. However, weaker versions of this condition might be tractable, for example “if Carol says at time step between and , and Bob asks at time , then the correct answer is ”. Since simulating Bob is still intractable, this condition cannot be expressed as a vanilla Bayesian hypothesis. However, it*can*be expressed as an incomplete hypothesis. We can also have an incomplete hypothesis that is the conjunction of this weak honesty condition with a full simulation of Carol. Once Alice learned this incomplete hypothesis, ey answer correctly*at least*those questions which Carol have already taught em or*will*teach em within 1000 time steps.- 27 Mar 2020 14:33 UTC; 6 points) 's comment on Thinking About Filtered Evidence Is (Very!) Hard by (

I think that your reasoning here is essentially the same thing I was talking about before:

...the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb’s problem, it will learn to one-box, since its model will predict that one-boxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned “causal” relationships are not the same as physical causality

Since then I evolved this idea into something that wins in counterfactual mugging as well, using quasi-Bayesianism.

I have repeatedly argued for a departure from pure Bayesianism that I call “quasi-Bayesianism”. But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here’s another way to understand it, using Bayesianism’s own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can

*predict*Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.A possible counterargument is, we don’t need to depart far from Bayesianism to win here. We only need to somehow justify randomization, perhaps by something like infinitesimal random perturbations of the belief state (like with reflective oracles). But, in a way, this is exactly what quasi-Bayesianism does: a quasi-Bayes-optimal policy is in particular Bayes-optimal when the prior is taken to be in Nash equilibrium of the associated zero-sum game. However, Bayes-optimality underspecifies the policy: not every optimal reply to a Nash equilibrium is a Nash equilibrium.

This argument is not entirely novel: it is just a special case of an environment that the agent cannot simulate, which is the original motivation for quasi-Bayesianism. In some sense, any Bayesian agent is dogmatic: it dogmatically beliefs that the environment is computationally simple, since it cannot consider a hypothesis which is not. Here, Omega exploits this false dogmatic belief.

I remember reading some speculation that Zinc supplements and (separate speculation) garlic supplements might have some beneficial effect against COVID19, but can’t find the source. Anyone knows what’s the status on that?

Probably stupid question, but why electrolyte drinks rather than just water?

I don’t want this. There’s a field of alignment outside of the community that uses the Alignment Forum, with very different ideas about how progress is made; it seems bad to have an evaluation of work they produce according to metrics that they don’t endorse.

This seems like a very strange claim to me. If the proponents of the MIRI-rationalist view think that (say) a paper by DeepMind has valuable insights

*from the perspective of the MIRI-rationalist paradigm*, and should be featured in “best [according to MIRI-rationalists] of AI alignment work in 2018”, how is it bad? On the contrary, it is very valuable the the MIRI-rationalist community is able to draw each other’s attention to this important paper.So, such a rating seems to have not much upside, and does have downside, in that non-experts who look at these ratings and believe them will get wrong beliefs about which work is useful.

*Anything*anyone says publicly can be read by a non-expert, and if something wrong was said, and the non-expert believes it, then the non-expert gets wrong beliefs. This is a general problem with non-experts, and I don’t see how is it worse here. Of course if the MIRI-rationalist viewpoint is*true*then the resulting beliefs will not be wrong at all. But this just brings us back to the object-level question.(I already see people interested in working on CHAI-style stuff who say things that MIRI-rationalist viewpoint says where my internal response is something like “I wish you hadn’t internalized these ideas before coming here”.)

So, not only is the MIRI-rationalist viewpoint wrong, it is

*so*wrong that it irreversibly poisons the mind of anyone exposed to it? Isn’t it a good idea to let people evaluate ideas on their own merits? If someone endorses a wrong idea, shouldn’t you be able to convince em by presenting counterarguments? If you cannot present counterarguments, how are you so sure the idea is actually wrong? If the person in question cannot understand the counterargument, doesn’t it make em much less valuable for your style of work anyway? Finally, if you actually believe this, doesn’t it undermine the entire principle of AI debate? ;)

If I naively imagine using something close to the 2019 review for alignment (even within a single paradigm), I expect my concerns about “sort by prestige” to be much worse, because there are greater political consequences that one could screw up (and, lack of common knowledge about how large those consequences are and how bad they might be might make everyone too anxious to get buy-in).

I don’t think so.

Your main example for the prestige problem with the LW review was “affordance widths”. I admit that I was one of the people who assigned a lot of negative points to “affordance widths”, and also that I did it not purely on abstract epistemic grounds (in those terms the essay is merely mediocre) but because of the added context about the author. When I voted, the question I was answering was “should this be included in Best of 2018”, including all considerations. If I wasn’t supposed to do this then I’m sorry, I haven’t noticed before.

The main reason I think it would be terrible to include “affordance widths” is not exactly prestige. The argument I used before is prestige-based, but that’s because I expected this part to be more broadly accepted, and wished to avoid the more charged debate I anticipated if I ventured closer to the core. The main reason is, I think it would send a really bad message to women and other vulnerable populations who are interested in LessWrong: not because of the identity of the author, but because the essay was obviously designed to justify the author’s behavior. Some of the reputational ramifications of that would be well-earned (although I also expect the response to be disproportional).

On the other hand, it is hard for me to imagine anything of the sort applying to the Alignment Forum. It would be much more tricky to somehow justify sexual abuse through discussion about AI risk, and if someone accomplished it then surely the AI-alignment-qua-AI-alignment value of that work would be very low. The sort of political considerations that do apply here are not considerations that would affect my vote, and I suspect (although ofc I cannot be sure) the same is true about most other voters.

Also, next time I will adjust my behavior in the LW vote also, since clearly it is against the intent of the organizers. However, I suggest that some process is created in parallel to the main vote, where context-dependent considerations

*can*be brought up, either for public discussion or for the attention of the moderator team specifically.

I wonder whether Korzybski was indeed a “memetic ancestor” of LessWrong or more like a slightly crazy elder sibling? In other words, were Yudkowsky or other prominent rationalists significantly influenced by Korzybski, or they just came up with similar-ish ideas independently?

I decided that the answer deserves its own post.

# The Reasonable Effectiveness of Mathematics or: AI vs sandwiches

As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.

Hmm. I appreciate the effort, but I don’t understand this answer. Maybe discussing this point further is not productive in this format.

I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).

This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.

Yes, and in that perspective, the mathematical model can

*tell*me about resonance. It’s actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn’t mean we should be careless about AGI as well.I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of

*unrealistic*mathematical models, empirical data and informal reasoning. But I don’t understand why should we give up so soon? We can work towards realistic mathematical models*and*prepare fallbacks, and even if we don’t arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.

Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.

First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don’t

*literally*start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I*could*prove a theorem if I wanted to. My model is not exactly “intuitive”: I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don’t write proofs that are machine verifiable (some people do that today, but it’s a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a

*proof*that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It’s true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of

*how much*the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a “side benefit” of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification “merely” tells us that the implementation doesn’t have bugs (which is something we should definitely worry about too, when it becomes relevant).

I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)

I’m not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don’t see how they

*could*be settled only with this kind of reasoning.

First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation “find the optimal policy” has description complexity.

Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is “efficiently computable” in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.

One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .

Another way to deal with it is restricting ourselves to the kind of hypotheses for which

*is*tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is*probably*still something like , where this time is incomplete/fuzzy (but I don’t yet know how PSRL for incomplete/fuzzy hypothesis should work).Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):

Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam’s razor for policies.

I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.

Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:

Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.

Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.

In this situation, machine A is a much better plan.

I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:

Basic question-answer protocol

Natural language annotation

Quantilization of questions

Debate over annotations

Dealing with no user answer

Dealing with inconsistent user answers

Dealing with changing user beliefs

Dealing with changing user preferences

Self-reference in user beliefs

Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)

Reverse questions

Translation of counterfactuals from user frame to AI frame

User beliefs about computations

EDIT: 14. Confidence threshold for risky actions

Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can’t tell.

I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.

I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance

*in the relevant range of frequencies*(which is an example of using assumptions in our models that need independent validation). We want bridges that*don’t*fall, don’t we?I don’t build mathematical theories of how to write code, and usually don’t prove my code correct

On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.

It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim.

I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.

First, you can have a

subjectiveregret bound which doesn’t require all actions to be recoverable (it does requiresomeactions to beapproximatelyrecoverable, which is indeed the case in the real world).Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.

Third, checking regret bounds for priors in which all actions

arerecoverable serves as a sanity test for candidate AGI algorithms. It is not asufficientdesideratum, but I do think it is necessary.I agree that

someof the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?This seems wrong to me. Can you elaborate what do you mean by “powerful” in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).

I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain

dimension parameterof the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.