Pretty much that, actually. It doesn’t seem too irrational, though. Upon looking at a mathematical universe where torture was decided upon as a good thing, it isn’t an obvious failure of rationality to hope that a cosmic ray flips the sign bit of the utility function of an agent in there.
The practical problem with values that care about other mathematical worlds, however, is that if the agent you built has a UDT prior over values, it’s an improvement (from the perspective of the prior) for the nosy neigbors/values that care about other worlds, to dictate some of what happens in your world (since the marginal contribution of your world to the prior expected utility looks like some linear combination of the various utility functions, weighted by how much they care about your world) So, in practice, it’d be a bad idea to build a UDT value learning prior containing utility functions that have preferences over all worlds, since it’d add a bunch of extra junk from different utility functions to our world if run.
If exploration is a hack, then why do pretty much all multi-armed bandit algorithms rely on exploration into suboptimal outcomes to prevent spurious underestimates of the value associated with a lever?
Since beliefs/values combinations can be ruled out, would it then be possible to learn values by asking the human about their own beliefs?
It doesn’t hurt my brain, but there’s a brain fog that kicks in eventually, that’s kind of like a blankness with no new ideas coming, an aversion to further work, and a reduction in working memory, so I can stare at some piece of math for a while, and not comprehend it, because I can’t load all the concepts into my mind at once. It’s kind of like a hard limit for any cognition-intensive task.
This kicks in around the 2 hour mark for really intensive work/studying, although for less intensive work/studying, it can vary up all the way up to 8 hours. As a general rule of thumb, the -afinil class of drugs triples my time limit until the brain fog kicks in, at a cost of less creative and lateral thinking.
Because of this, my study habits for school consisted of alternating 2-hour study blocks and naps.
The beliefs aren’t arbitrary, they’re still reasoning according to a probability distribution over propositionally consistent “worlds”. Furthermore, the beliefs converge to a single number in the limit of updating on theorems, even if the sentence of interest is unprovable. Consider some large but finite set S of sentences that haven’t been proved yet, such that the probability of sampling a sentence in that set before sampling the sentence of interest “x”, is very close to 1. Then pick a time N, that is large enough that by that time, all the logical relations between the sentences in S will have been found. Then, with probability very close to 1, either “x” or “notx” will be sampled without going outside of S.
So, if there’s some cool new theorem that shows up relating “x” and some sentence outside of S, like “y->x”, well, you’re almost certain to hit either “x” or “notx” before hitting “y”, because “y” is outside S, so this hot new theorem won’t affect the probabilities by more than a negligible amount.
Also I figured out how to generalize the prior a bit to take into account arbitrary constraints other than propositional consistency, though there’s still kinks to iron out in that one. Check this.
Yup, that particular book is how I learned to prove stuff too.
(well, actually, there was a substantial time delay between reading that and being able to prove stuff, but it’s an extremely worthwhile overview)
You’re pretending that it’s what nature is doing what you update your prior. It works when sentences are shown to you in an adversarial order, but there’s the weird aspect that this prior expects the sentences to go back to being drawn from some fixed distribution afterwards. It doesn’t do a thing where it goes “ah, I’m seeing a bunch of blue blocks selectively revealed, even though I think there’s a bunch of red blocks, the next block I’ll have revealed will probably be blue”. Instead, it just sticks with its prior on red and blue blocks.
There’s a misconception, it isn’t about finding sentences of the form A→Bi and ¬Bi, because if you do that, it immediately disproves A. It’s actually about merely finding many instances of A→Bi where P(Bi|A) has <12 probability, and this lowers the probability of A. This is kind of like how finding out about the Banach-Tarski paradox (something you assign low probability to) may lower your degree of belief in the axiom of choice.
The particular thing that prevents trolling is that in this distribution, there’s a fixed probability of drawing A on the next round no matter how many implications and B’s you’ve found so far. So the way it evades trolling is a bit cheaty, in a certain sense, because it believes that the sequence of truth or falsity of math sentences that it sees is drawn from a certain fixed distribution, and doesn’t do anything like believing that it’s more likely to see a certain class of sentences come up soon.
There’s a difference between “consistency” (it is impossible to derive X and notX for any sentence X, this requires a halting oracle to test, because there’s always more proof paths), and “propositional consistency”, which merely requires that there are no contradictions discoverable by boolean algebra only. So A^B is propositionally inconsistent with notA, and propositionally consistent with A. If there’s some clever way to prove that B implies notA, it wouldn’t affect the propositional consistency of them at all. Propositional consistency of a set of sentences can be verified in exponential time.
I read through the entire Logical Induction paper, most-everything on Agent Foundations Forum, the advised Linear Algebra textbook, part of a Computational Complexity textbook, and the Optimal Poly-Time Estimators paper.
I’d be extremely interested in helping out other people with learning MIRI-relevant math, having gone through it solo. I set up a Discord chatroom for it, but it’s been pretty quiet. I’ll PM you both.
“It is easy to confuse that which is stolen with that which New Caledonia’s cartographer made, in telling the difference you’ll map while you travel and cut with no blade.” is the easiest one to translate.
It’s easy to confuse stuff that corresponds with reality with second-hand stuff that is bullshit but doesn’t obviously seem like it, in telling the difference you’ll have to figure out things as you go and accomplish things when you don’t have the tools to do so *properly* (possibly because existing knowledge on how to do the thing is sketchy or inadequate)
Intuitively, it’d be overriding preferences in 1 (but only if pre-exiert humans generally approve of the existence of post-exiert humans. If post-exiert humans had significant enough value drift that humans would willingly avoid situations that cause exiert, then 1 wouldn’t be a preference override),
wouldn’t in 2 (but only if the AI informs humans that [weird condition]->[exiert] first),
would in 3 for lust and nostalgia(because there are lots of post-[emotion] people who approve of the existence of the emotion, and pre-[emotion] people don’t seem to regard post-[emotion] people with horror) but not for intense pain (because neither post-pain people nor pre-pain people endorse its presence)
wouldn’t in 4 for lust and nostalgia, and would for pain, for basically the inverse reasons
and wouldn’t be overriding preferences in 5 (but only if pre-exiert humans generally approve of the existence of post-exiert humans)
Ok, what rule am I using here? It seems to be something like “if both pre-[experience] and post-[experience] people don’t regard it as very bad to undergo [experience] or the associated value changes, then it is overriding human preferences to remove the option of undergoing [experience], and if pre-[experience] or post-[experience] people regard it as very bad to undergo [experience] or the associated value changes, then it is not overriding human preferences to remove the option of undergoing [experience]”
My first stab at it (will be doing over the weekend). Collect a big list of drama and -storms, and look for commonalities or overarching patterns, in either the failure modes, or in the list of what could have been done to prevent them ahead of time. There are lots of different group failure modes, but a lot of people seem to have an ugh field around even acknowledging the presence of drama, let alone participating in it.Thus, this seems like a worthwhile thing to throw some effort at, with a special eye towards actually finding the social version of a nuclear reactor control rod.