A (possibly dumb) question about G and v. If the sentences of L are equivalent to vertices of G, then are the arrows in G being interpreted as rules of inference? If so, how does this deal with rules of inference that take multiple sentences of input (both A and A->B are needed to arrive at B) since the arrows can only “link” two sentences?
Diffractor
Interestingly enough, the approximate coherence condition doesn’t hold when there is a short proof of φ from ψ, but it does hold when there is a sufficiently short proof of ψ from φ. (Very roughly, the valuation of (φ∧¬ψ) is negligible, while the valuation of (φ∧ψ) is approximately equal to the valuation of φ.) So coherence only works one way.
On a mostly unrelated note, this sort of reasoning doesn’t seem to link well with the strategy of “add important mathematical facts that you discover into your pool of starting axioms to speed up the process of deriving things.” While a set of axioms, and a set of axioms plus a bunch of utility tool theorems (Godel’s 2nd incompleteness, Lob’s theorem, the fundamental theorem of calculus, etc..) may “span” the same space of theorems, the second one is much easier to quickly derive new interesting statements from, and seems to be how humans think about math. The inconsistency involved in getting 10 utility on the 5-and-10 problem is much easier to spot if the second incompleteness theorem is already in your pool of sentences to apply to a new problem.
As with the Russel’s Paradox example, in practice, counterfactuals seem to be mind-dependent, and vary depending on which of the many different lines of reasoning a mind heads down. If you define a subjective distance relative to a particular search algorithm, this objective valuation would just use the shortest possible subjective distance to the statement and a contradiction. The valuation using the human distance function of a statement in naive set theory would be high, because we were slow to notice the contradiction. So that aspect of counterfactual reasoning seems easy to capture.
Has anyone checked what this does on ASP problems?
UDT has this same problem, though. In UDT, model uncertainty is being exploited instead of environmental uncertainty, but conditioning on “Agent takes action A” introduces spurious correlations with features of the model where it takes action A.
In particular, only one of the actions will happen in the models where con(PA) is true, so the rest of the actions occur in models where con(PA) is false, and this causes problems as detailed in “The Odd Counterfactuals of Playing Chicken” and the comments on “An Informal Conjecture on Proof Length and Logical Counterfactuals”.
I suspect this may also be relevant to non-optimality when the environment is proving things about the agent. The heart of doing well on those sorts of problems seems to be the agent trusting that the predictor will correctly predict its decision, but of course, a PA-based version of UDT can’t know that a PA or ZFC-based proof searcher will be sound regarding its own actions.
I don’t know, that line of reasoning that U()=10 seems like a pretty clear consequence of PA+Sound(PA)+A()=a, and the lack of a counterfactual for “X is false” doesn’t violate any of my intuitions. It’s just reasoning backwards from “The agent takes action a” to the mathematical state of affairs that must have produced it (there is a short proof of X).
On second thought, the thing that broke the original trolljecture was reasoning backwards from “I take action a” to the mathematical state of affairs that produced it. Making inferences about the mathematical state of affairs in your counterfactuals using knowledge of your own decision procedure does seem to be a failure mode at first glance.
Maybe use the counterfactual of “find-and-replace all instances of X’s source code in the universe program U with action a, and evaluate”? But that wouldn’t work for different algorithms that depend on checking the same math facts. There needs to be some way to go from “X takes action A” to “closely related algorithm Y takes action B”. But that’s just inferring mathematical statements from the combination of actions and knowing X’s decision rule.
I’ll stick with the trolljecture as the current best candidate for “objective” counterfactuals, because reasoning backwards from actions and decision rules a short way into math facts seems needed to handle “logically related” algorithms, and this counterexample looks intuitively correct.
Quick question: It is possible to drive the probability of x down arbitrarily far by finding a bunch of proofs of the form “x implies y” where y is a theorem. But the exact same argument applies to not x.
If the theorem-prover always finds a proof of the form “not x implies y” immediately afterwards, the probability wouldn’t converge, but it would fluctuate within a certain range, which looks good enough.
What, if any, conditions need to be imposed on the theorem prover to confine the probabilities assigned to an unprovable statement to a range that is narrower than (0, 1)?
It looks legitimate, actually.
Remember, is set-valued, so if , . In all other cases, . is a nonempty convex set-valued function, so all that’s left is to show the closed graph property. If the limiting value of is something other than 0, the closed graph property holds, and if the limiting value of is 0, the closed graph property holds because .
What does the Law of Logical Causality say about CON(PA) in Sam’s probabilistic version of the troll bridge?
My intuition is that in that case, the agent would think CON(PA) would be causally downstream of itself, because the distribution of actions conditional on CON(PA) and CON(PA) are different.
Can we come up with any example where the agent thinking it can control CON(PA) (or any other thing that enables accurate predictions of its actions) actually gets it into trouble?
My first stab at it (will be doing over the weekend). Collect a big list of drama and -storms, and look for commonalities or overarching patterns, in either the failure modes, or in the list of what could have been done to prevent them ahead of time. There are lots of different group failure modes, but a lot of people seem to have an ugh field around even acknowledging the presence of drama, let alone participating in it.
Thus, this seems like a worthwhile thing to throw some effort at, with a special eye towards actually finding the social version of a nuclear reactor control rod.
Hm, I got the same result from a different direction.
(probably very confused/not-even-wrong thoughts ahead)
It’s possible to view a policy of the form “I’ll compute X and respond based on what X outputs” as… tying your output to X, in a sense. Logical link formation, if you will.
And policies of the form “I’ll compute X and respond in a way that makes that output of X impossible/improbable” (can’t always do this) correspond to logical link cutting.
And with this, we see what the chicken rule in MUDT/exploration in LIDT is doing. It’s systematically cutting all the logical links it can, and going “well, if the statement remains correlated with me despite me trying my best to shake off anything that predicts me too well, I guess I “cause” it.”
But some potentially-useful links were cut by this process, such as “having short abstract reasoning available that lets others predict what you will do” (a partner in a prisoner’s dilemma, the troll in troll bridge, etc..)
At the same time, some links should be cut by a policy that diagonalizes against predictions/calls upon an unpredictable process (anything that can be used to predict your behavior in matching pennies, evading Death when Death can’t crack your random number generator, etc...)
So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”.
Predictably choosing an easy-to-predict policy is easy-to-predict, predictably choosing a hard-to-predict policy is hard-to-predict.
This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing.
By the stated definitions, “v-avoidable event” is pretty much trivial when the event doesn’t lead to lasting utility loss. The conditions on “v-avoidable event” are basically:
The agent’s policy converges to optimality.
There’s a sublinear function D(t) where the agent avoids the event with probability 1 for D(t) time, in the limit as t goes to infinity.
By this definition, “getting hit in the face with a brick before round 3” is an avoidable event, even when the sequence of policies lead to the agent getting hit in the face with a brick on round 2 with certainty and it’s possible to dodge it. Let the sublinear function be the constant 1, and let the sequence of policies converge to “dodge” on round 1 and “stay” on round 2, and let the brick incur sublinear utility loss.
This fulfills the conditions, so getting hit in the face with a brick before timestep 3 is a “v-avoidable” event despite certainly occuring. Thus, this condition is only meaningful about lasting failures that incur enough utility loss to prevent convergence to the optimal policy.
What is , in the context of the proof of Lemma A? I don’t believe it was defined anywhere else.
I don’t believe that was defined anywhere, but we “use the definition” in the proof of Lemma 1.
As far as I can tell, it’s a set of (j,y) pairs, where j is the index of a hypothesis, and y is an infinite history string, rather like the set .
How do the definitions of and differ?
A summary that might be informative to other people: Where does the requirement on the growth rate of the “rationality parameter” come from?
Well, the expected loss of the agent comes from two sources. Making a suboptimal choice on its own, and incurring a loss from consulting a not-fully-rational advisor. The policy of the agent is basically “defer to the advisor when the expected loss over all time of acting (relative to the optimal move by an agent who knew the true environment) is too high”. Too high, in this case, cashes out as “higher than ”, where t is the time discount parameter and is the level-of-rationality parameter. Note that as the operator gets more rational, the agent gets less reluctant about deferring. Also note that t is reversed from what you might think, high values of t mean that the agent has a very distant planning horizon, low values mean the agent is more present-oriented.
On most rounds, the agent acts on its own, so the expected all-time loss on a single round from taking suboptimal choices is on the order of , and also we’re summing up over about t rounds (technically exponential discount, but they’re similar enough). So the loss from acting on its own ends up being about .
On the other hand, delegation will happen on at most ~ rounds, with a loss of value, so the loss from delegation ends up being around .
Setting these two losses equal to each other/minimizing the exponent on the t when they are smooshed together gets you x=3. And then must grow asymptotically faster than to have the loss shrink to 0. So that’s basically where the 2⁄3 comes from, it comes from setting the delegation threshold to equalize long-term losses from the AI acting on its own, and the human picking bad choices, as the time horizon t goes to infinity.
Intuitively, it’d be overriding preferences in 1 (but only if pre-exiert humans generally approve of the existence of post-exiert humans. If post-exiert humans had significant enough value drift that humans would willingly avoid situations that cause exiert, then 1 wouldn’t be a preference override),
wouldn’t in 2 (but only if the AI informs humans that [weird condition]->[exiert] first),
would in 3 for lust and nostalgia(because there are lots of post-[emotion] people who approve of the existence of the emotion, and pre-[emotion] people don’t seem to regard post-[emotion] people with horror) but not for intense pain (because neither post-pain people nor pre-pain people endorse its presence)
wouldn’t in 4 for lust and nostalgia, and would for pain, for basically the inverse reasons
and wouldn’t be overriding preferences in 5 (but only if pre-exiert humans generally approve of the existence of post-exiert humans)
Ok, what rule am I using here? It seems to be something like “if both pre-[experience] and post-[experience] people don’t regard it as very bad to undergo [experience] or the associated value changes, then it is overriding human preferences to remove the option of undergoing [experience], and if pre-[experience] or post-[experience] people regard it as very bad to undergo [experience] or the associated value changes, then it is not overriding human preferences to remove the option of undergoing [experience]”
“It is easy to confuse that which is stolen with that which New Caledonia’s cartographer made, in telling the difference you’ll map while you travel and cut with no blade.” is the easiest one to translate.
It’s easy to confuse stuff that corresponds with reality with second-hand stuff that is bullshit but doesn’t obviously seem like it, in telling the difference you’ll have to figure out things as you go and accomplish things when you don’t have the tools to do so *properly* (possibly because existing knowledge on how to do the thing is sketchy or inadequate)
Intermediate update:
The handwavy argument about how you’d get propositional inconsistency in the limit of imposing the constraint of “the string cannot contain and and and… and ”
is less clear than I thought. The problem is that, while the prior may learn that that constraint applies as it updates on more sentences, that particular constraint can get you into situations where adding either or leads to a violation of the constraint.
So, running the prior far enough forward leads to the probability distribution being nearly certain that, while that particular constraint applied in the past, it will stop applying at some point in the future by vetoing both possible extensions of a string of sentences, and then less-constrained conditions will apply from that point forward.
On one hand, if you don’t have the computational resources to enforce full propositional consistency, it’s expected that most of the worlds you generate will be propositionally inconsistent, and midway through generating them you’ll realize that some of them are indeed propositionally inconsistent.
On the other hand, we want to be able to believe that constraints capable of painting themselves into a corner will apply to reality forevermore.
I’ll think about this a bit more. One possible line of attack is having and not add up to one, because it’s possible that the sentence generating process will just stop cold before one of the two shows up, and renormalizing them to 1. But I’d have to check if it’s still possible to -approximate the distribution if we introduce this renormalization, and to be honest, I wouldn’t be surprised if there was a more elegant way around this.
EDIT: yes it’s still possible to -approximate the distribution in known time if you have refer to , although the bounds are really loose. Will type up the proof later.
I read through the entire Logical Induction paper, most-everything on Agent Foundations Forum, the advised Linear Algebra textbook, part of a Computational Complexity textbook, and the Optimal Poly-Time Estimators paper.
I’d be extremely interested in helping out other people with learning MIRI-relevant math, having gone through it solo. I set up a Discord chatroom for it, but it’s been pretty quiet. I’ll PM you both.
If you drop the Pareto-improvement condition from the cell rank, and just have “everyone sorts things by their own utility”, then you won’t necessarily get a Pareto-optimal outcome (within the set of cell center-points), but you will at least get a point where there are no strict Pareto improvements (no points that leave everyone better off).
The difference between the two is… let’s say we’ve got a 2-player 2-move game that in utility-space, makes some sort of quadrilateral. If the top and right edges join at 90 degrees, the Pareto-frontier would be the point on the corner, and the set of “no strict Pareto improvements” would be the top and the right edges.
If that corner is obtuse, then both “Pareto frontier” and “no strict Pareto improvements” agree that both line edges are within the set, and if the corner is acute, then both “Pareto frontier” and “no strict Pareto improvements” agree that only the corner is within the set. It actually isn’t much of a difference, it only manifests when the utilities for a player are exactly equal, and is easily changed by a little bit of noise.
The utility-approximation issue you pointed out seems to be pointing towards the impossibility of guaranteeing limiting to a point on the Pareto frontier (when you make the cell size smaller and smaller), precisely because of that “this set is unstable under arbitrarily small noise” issue.
But, the “set of all points that have no strict Pareto improvements by more than for all players”, ie, the -fuzzed version of “set of points with no strict pareto improvement”, does seem to be robust against a little bit of noise, and doesn’t require the Pareto-improvement condition on everyone’s ranking of cells.
So I’m thinking that if that’s all we can attain (because of the complication you pointed out), then it lets us drop that inelegant Pareto-improvement condition.
I’ll work on the proof that for sufficiently small cell size , you can get an outcome within of the set of “no strict Pareto improvements available”
Nice job spotting that flaw.
There’s a difference between “consistency” (it is impossible to derive X and notX for any sentence X, this requires a halting oracle to test, because there’s always more proof paths), and “propositional consistency”, which merely requires that there are no contradictions discoverable by boolean algebra only. So A^B is propositionally inconsistent with notA, and propositionally consistent with A. If there’s some clever way to prove that B implies notA, it wouldn’t affect the propositional consistency of them at all. Propositional consistency of a set of sentences can be verified in exponential time.
If you are looking for a weaker inner reflection principle, does P((a<P(┌φ┐)<b)→P(┌a−ϵ<P(┌φ┐)<b+ϵ┐)=1)=1 for some finite ϵ sound viable, or are there fatal flaws with it?
This came about while trying to figure out how to break the proof in the probabilistic procrastination paper. Making the reflection principle unable to prove that P(eventually presses button) is above 1−ϵ came up as a possible way forward.