LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

# Charlie Steiner(Charlie Steiner)

# Limited agents need approximate induction

Neat. Okay, here’s my sketch of a counterexample (or more precisely, a way in which this may violate my imagined desiderata for counterfactuals). A statement that leads to a to a short proof of does not seem like the counterfactual ancestor of every single statement that does not lead to a short proof of .

E.g. “52

*63=3126” doesn’t seem like it counterfactually implies that there exists a division of the plane that requires at least 5 colors to fill in without having identical colors touch. It seem much more like it counterfactually implies “51*63=3063″ (i.e. the converse of the trolljecture seems really false).But I dunno, maybe that’s just my aesthetics about proofs that go through . Presumably “legitimate counterfactual consequence” could be cashed out as some sort of pragmatic statement about a decision-making algorithm?

This is an interesting way of thinking about logical counterfactuals. It all seems to come down to what desiderata you desiderate.

We might assign a DAG to by choosing a reference theorem-prover, which uses theorems/syntactic rules to generate more theorems. We then draw an edge to each sentence in from its direct antecedents in it first proof in the reference theorem-prover. One option is that would only be allowed to disagree with sentences in for sentences that are descendants of . But this doesn’t specify your , because it doesn’t assign a place in the graph to sentences not in .

If we try a similar theorem-prover assignment to specify with false under PA, we’ll get very silly things as soon as the theorem-prover proves and explodes; our graph will no longer follow the route analogous to the one for . Is there some way to enforce that analogy?

Ideally what I think I desiderate is more complicated than only disagreeing on -descendants of - for example if is a counterexample to a universal statement that is not a descendant of any individual cases, I’d like the universal statement to be counterfactually disproved.

This seems neat. The immediate question I have is whether this scheme assigns logical probabilities only to an infinite sequence, rather than to individual elements.

[Assuming that the limit exists. (One is restricted to a certain class of selection rules that don’t contain much information about , or else one could just pick only the fives of , or only the threes, and the limit would not exist.)]

But I suppose I’ll just have to commit idle speculation until further posts are released :)

(edited)

~~I dunno that I buy that the Benford test is a desideratum at all levels of computational resources (if n is even, A(n) is even), but~~I’m still pretty hyped that you’re posting about logical uncertainty.Actually, now I’m curious about the modulo arithmetic properties of Ackermann’s function.

It’s interesting that this is basically the opposite of the Gaifman condition—clearly there are conflicting intuitions about what makes a ‘good’ conditional logical probability.

On the open problem; In order to prove 3 from 2, all you need is that when proves − 3 follows from 2 if you do that substitution, and then divide by , which is less than or equal 1 (this may assume an extra commonsense axiom that probabilities are positive).

Now consider applying rule 1 to , T proven by s. R proves that only one of is true, and also proves that only one of is true. Thus 3 is derivable from 1 and 2.

Is the idea that the proof necessary to use 1 is of infinite length, and you want your logic to be finitary? Hm. This seems odd, because is in some sense already a function with an infinitely long argument. How do you feel about using 2 in the form of , therefore , which has the same amount of argument as ? I’m confused about at least one thing.

Also, is there some reason you prefer not to reply using the button below the comment?

Oh, whoops! Managed to confuse myself. I’m not totally sure about Benford’s law, but am happy to assume for the sake of argument now that I’m not actually deluded.

Double Edit: Hm, the converse of the equidistribution theorem doesn’t hold even close to as strictly as I thought it did. Nevermind me.

I am, as usual, a bit confused. If you require a sentence to be consistent with (e.g.) PA before being added to , this proposal is unable to assign nonzero probability to the trillionth digit of pi being 2 - and conditional on the trillionth digit of pi counterfactually being 2, it is unable to go on to believe in PA.

It seems like some looser condition for adding to the theory is needed. Not just as a sop to practicality, but to get at some important desiderata of logical counterfactuals.

Here’s the picture of logical counterfactuals that I’m currently thinking under:

People have some method for generating mental models of math, and these mental models have independencies that the ground truth mathematics doesn’t. E.g., when I imagine the trillionth digit of pi being 2, it doesn’t change (in the mental model) whether the Collatz conjecture is true. In fact, for typical scenarios I consider, I can continue to endorse (in my mental model) the typical properties of the real numbers, even when considering a collection of statements, most of which are inconsistent with those properties (like assigning a distribution over some digit of pi).

This apparent independence produces an apparent partial causal graph (within a certain family of mental models), which leads to the use of causal language like “Even if one set the trillionth digit of pi to 2, it would not change the things I’m taking for granted in my mental models, nor would it change the things that would not change in my mental model when I change the setting of this digit of pi.”

When describing the failure mode, you have the approximate-Solomonoff agent try to predict (by assigning approximately uniform probability since it can’t invert the hash in O(n^2) time), and then plug that distribution into the reward function to compute the “expected reward” of action 1 (very small, since only one can be correct).

However, this problem would be circumvented if the agent did things the opposite way—first try to predict (by assigning probability 0.9 to based on the frequency data), and only then trying to invert the hash and failing.

There might be a more general rule here: the order of prediction where you fail later rather than earlier gives better results. Or: the more approximate an answer, the less precedence it should have in the final prediction.

This is still a bit unsatisfying—it’s not abstract reasoning (at least not obviously) - but I think the equivalent would look more like abstract reasoning if the underlying predictor had to use smarter search on a smaller hypothesis space than Solomonoff induction.

I wonder if we can cash this counterfactual mugging out in terms of ignorance about physical facts. This dodges the difficulty of logical counterfactuals, but maybe not too boringly.

After all, what Omega says to you is “I calculated and took different actions depending on the result.” When you take the logical counterfactual problem head-on, you are attempting to answer “what if was different than what it was?” But there is another possible question, which is “why ?” Or, “what is my (prior-ish; we’re being updateless) distribution over this parameter that Omega used in deciding what action to take?”

If we knew Omega was just going to use something like a big tower of exponents, the distribution over the first digit follows Benford’s law. But maybe was also think Omega might have chosen a digit of pi, and so we have to add in some uniform contribution, etc.

In your example for bias, the agent only has incentive to manipulate humans because it’s treating human word as truth, rather than as evidence. For example, an AI that relies on button-pressing to learn about human morality will try to press its own buttons if it thinks that the buttons are identical to morality, but will not do so if it has a causal model of the world that allows for morality as one of several causes of button presses.

So a fully probabilistic value learner as in Dewey 2011 doesn’t have this manipulativeness—the trouble is just that we don’t know how to write down the perfect probabilistic model of the world that such a value learner needs in order to work. Hm, I wonder if there’s a way to solve this problem with lots of data and stochastic gradient descent.

(EDIT: For toy problems, you might try to learn correct moral updating from examples of correct moral updates, but the data would be hard to generate for the real world and the space to search would be huge. It seems to me that an AI couldn’t start ignorant, then learn how to learn about morality as it explored the world, then explore the world.)

I think you can put this scheme on a nicer foundation by talking about strategies rather than actions, and by letting the AI have some probability distribution over .

Then you just use the strategy that maximizes . You can also think of this as doing a simplification of the expected utility calculation that bakes in the assumption that the AI can’t change .

You can then reintroduce the action with the observation that the AI will also be well-behaved if it maximizes .

Ah, I see what you mean. I wonder what kind of real-world issues crop up if you identify with no output along the output channel.

Another way to approach identifying is with observational data collected by humans. Or (hopefully) with some stronger and more semantically-rich method that tries to separate the causes we really want to preserve from the causes that we regard as interference.

The reason we want a description of counterfactuals is to allow for a model of the world where we plug in counterfactual actions and get back the expected outcome, allowing us to choose between actions/strategies. Counterfactuals don’t have any reality outside of how we think about them.

Thus, the motivation for an improvement to the causal-intervention model of counterfactuals is not that it should correspond to some external reality, but that it should help reach good outcomes. We can still try to make a descriptive model of how humans do logical counterfactual reasoning, but our end goal should be to understand why something like that actually leads to good outcomes.

It’s important to note that it’s okay to use human reasoning to validate something that is supposedly not just a descriptive model of human reasoning. Sure, it creates selection bias, but what other reasoning are we ging to use? See Neurath’s Boat (improving philosophy is like rebuilding a boat piece by piece while adrift at sea), Ironism (awareness and acceptance of the contingency of our beliefs).

In the end, I suspect that what counts as a good model for predicting outcomes of actions will vary strongly depending on the environment. See related rambling by me, particularly part 4. This is more related to Scott Garrabrant’s logical inductors in hindsight than it was in any kind of foresight.

As a corollary, I think this also pinpoints why the Doomsday argument isn’t magic. Sure, before I see more information, I expect to be halfway through my reference class. But precisely because it doesn’t use much information, this prior is very easy to update. As soon as we have some ideas about models of the world, we should probably just use those models.

This seems like the sort of reasonable and introspective rationalization that could grow up around literally any tradition.

Even if something is a neutral or bad idea overall, it merely takes a little creativity to focus on the good parts that it has. How many generations did smart people support the gold standard? (To pick an idea that’s suboptimal and impactful but not too crazy or repulsive.) I’m sure there are some quite nice essays ruminating on its good points.

Now, suppose that you are reading an essay “drawing out the value” of the gold standard. How much does this essay help you arrive at useful beliefs about currency? Perhaps it is quite helpful, because even if it’s highlighting the good points of a bad idea, those particular good points might be general things it’s helpful to keep in mind. On the other hand, if those good points are cherry-picked to tell a favorable story about the gold standard, then maybe keeping them in mind will come at the expense of tracking other important factors.

Overall, I would say that an essay on the good points of the gold standard is only useful if you already have some key knowledge about currencies that will allow you to integrate the essay into a larger system of understanding, without being unduly swayed by it. And I think that this sort of flexible understanding requires a lot of work to gain, and most people don’t have it about most things. Therefore, an essay about the good points of an idea of unknown quality can be helpful for people with strong understanding of the subject, but is not very useful for most readers.

Why do people react to fire alarms? It’s not just that they’re public—smoke is public too. One big factor is that we’ve had reacting to fire alarms drilled into us since childhood, a policy probably formulated after a few incidents of children not responding to fire alarms.

What this suggests is even if signals are unclear, maybe what we really need is training. If some semi-arbitrary advance is chosen, people may or may not change their behavior when that advance occurs, depending on whether they have been successfully trained to be able to change their behavior.

On the other hand, we should already be working on AI safety, and so attempting to set up a fire alarm may be pointless—we need people to already be evacuating and calling the firefighters.

What do you think about making burritos that have low information content given good burrito recipes (or sampling from a distribution f that has low information relative to the learned distribution b)? Fancy burritos seem a lot simpler to describe than nanobot-containing burritos, given the machinery to describe mediocre burritos.