I think about AI alignment. Send help.
James Payor
Hm, I think an important piece of “intuitionistic proof” didn’t transfer, or is broken. Drawing attention to that part:
Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power.
So here, I realize, I am relying on something like “the AI implicitly moves toward an imagined realizable future”. I think that’s a lot easier to get than the pipeline you sketch.
I think I’m being pretty unclear—I’m having trouble conveying my thought structure here. I’ll go make a meta-level comment instead.
Let’s go a little meta.
It seems clear that an agent that “maximizes utility” exhibits instrumental convergence. I think we can state a stronger claim: any agent that “plans to reach imagined futures”, with some implicit “preferences over futures”, exhibits instrumental convergence.
The question then is how much can you weaken the constraint “looks like a utility maximizer”, before instrumental convergence breaks? Where is the point in between “formless program” and “selects preferred imagined futures” at which instrumental convergence starts/stops applying?
---
This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume “backchaining”.)
So, I’m curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to “goal directed” for instrumental convergence to apply?
This might be a difficult question to answer, but it’s probably a good way to understand why instrumental convergence feels so real to other people.
Just a PSA: right-clicking or middle-clicking the posts on the frontpage toggle whether the preview is open. Please make them only expand on left clicks, or equivalent!
I’ve recently noticed something about me: Attempting to push away or not have experience, actually means pushing away those parts of myself that have that experience.
I then feel an urge to remind readers of a view of Rationalist Lent as an experiment. Don’t let it this be another way that you look away from what’s real for you. But do let it be a way to learn more about what’s real for you.
Q7 (Python):
Y = lambda s: eval(s)(s)
Y(‘lambda s: print(“Y = lambda s: eval(s)(s)\\nY({s!r})”)’)Q8 (Python):
Not sure about the interpretation of this one. Here’s a way to have it work for any fixed (python function) f:
f = ‘lambda s: “\\n”.join(s.splitlines()[::-1])’
go = ‘lambda s: print(eval(f)(eval(s)(s)))’
eval(go)(‘lambda src: f”f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})”’)
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.
This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future “you” and merge utility functions, which seems strictly better than not. (Side note: I’m pretty annoyed with all the use of “there’s no coherence theorem for X” in this post.)
As a separate note, the “further out” your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.
(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)
There is a nuclear analog for accident risk. A quote from Richard Hamming:
Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, “It is the probability that the test bomb will ignite the whole atmosphere.” I decided I would check it myself! The next day when he came for the answers I remarked to him, “The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels.” He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, “What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?” I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, “Never mind, Hamming, no one will ever blame you.”
https://en.wikipedia.org/wiki/Richard_Hamming#Manhattan_Project
It wasn’t meant as a reply to a particular thing—mainly I’m flagging this as an AI-risk analogy I like.
On that theme, one thing “we don’t know if the nukes will ignite the atmosphere” has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth’s atmosphere) that we don’t have experience with. Which is an entirely different question than “what happens with the nukes after we don’t ignite the atmosphere in a test explosion”.
I like thinking about coordination from this viewpoint.
Having a go at pointing at “reality-masking” puzzles:
There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as “how can I avoid my students probing at my flaws while teaching” or “how can I have my students recommend me as a good tutor” or etc.
It’s a puzzle in the sense that it’s an aspect of reality you’re grappling with. It’s reality-masking in that the pressure was away from building true/accurate maps.
Having a go at the analogous thing for “disabling part of the epistemic immune system”: the cluster of things we’re calling an “epistemic immune system” is part of reality and in fact important for people’s stability and thinking, but part of the puzzle of “trying to have people be able to think/be agenty/etc” has tended to have us ignore that part of things.
Rather than, say, instinctively trusting that the “immune response” is telling us something important about reality and the person’s way of thinking/grounding, one might be looking to avoid or disable the response. This feels reality-masking; like not engaging with the data that’s there in a way that moves toward greater understanding and grounding.
- 17 Jan 2020 10:17 UTC; 29 points) 's comment on Reality-Revealing and Reality-Masking Puzzles by (
My best so far on puzzle 1:
Score: 108
This is a variant on but we get via , where we implement divide by 2 with sqrt.
I have an intuition that the dutch-book arguments still apply in very relevant ways. I mostly want to talk about how maximization appears to be convergent. Let’s see how this goes as a comment.
My main point: if you think an intelligent agent forms and pursues instrumental goals, then I think that agent will be doing a lot of maximization inside, and will prefer to not get dutch-booked relative to its instrumental goals.
---First, an obvious take on the pizza non-transitivity thing.
If I’m that person desiring a slice of pizza, I’m perhaps desiring it because it will leave me full + taste good + not cost too much.
Is there something wrong with me paying some money to switch the pizza slice back and forth? Well, if the reason I cared about the pizza was that it was low-cost tasty food, then I guess I’m doing a bad job at getting what I care about.
If I enjoy the process of paying for a different slice of pizza, or am indifferent to it, then that’s a different story. And it doesn’t hurt much to pay 1 cent a couple of times anyway.
----
Second, suppose I’m trying to get to the moon. How would I go about it?
I might start with estimates about how valuable different suboutcomes are, relative to my attempt to get to the moon. For instance, I might begin with the theory that I need to have a million dollars to get to the moon, and that I’ll need to acquire some rocket fuel too.
If I’m trying to get to the moon soon, I will be open to plans that make me money quickly, and teach me how to get rocket fuel. I would also like better ideas about how I should get to the moon, and if you told about how calculus and finite-element-analysis would be useful, I’ll update my plans. (And if I were smarter, I might have figured that out on my own.)
If I think that I need a much better grasp of calculus, I might then dedicate some time to learning about it. If you offer me a plan for learning more about calculus, better and faster, I’ll happily update and follow it. If I’m smart enough to find a better plan on my own, by thinking, I’ll update and follow it.
----
So, you might think that I can be an intelligent agent, and basically not do anything in my mind that looks like “maximizing”. I disagree! In my above parable, it should be clear that my mind is continually selecting options that look better to me. I think this is happening very ubiquitously in my mind, and also in agents that are generally intelligent.
I’m pretty confused about how PCR testing can be so bad. Do you have more models/info here you can share?
In particular, I think it might be the case that we’ve done something like overupdate on poorly-done early Chinese PCR. When I looked for data a while back, I only found the early Wuhan stuff, and the company-backed studies claiming 98% or 99% accuracy, neither of which seem trustworthy...
I currently suspect that PCR tests are effective, at least if the patient has grown enough virus to soon be infectious. I’d like to know if this is true. The main beliefs I have here (that may well be false):
The PCR methodology, when done right, should detect the presence of tiny amounts of viral fragments.
The amount of virus needed per unit saliva to infect someone is at least a few orders of magnitude less than the detection threshold for PCR or other amplification techniques.
If my picture is right, I can perhaps still believe in a 50% false negative rate, but I would look to explain that as “you tested them too early in the infection”, and would suspect the false negative rate to be more like 1-5% for a patient that’s shedding enough virus to be infectious.
I once looked into the effectiveness of Australia and New Zealand’s quarantine programs to get a sense for this. I think, until recently, basically no infectious cases made it through their 2-week quarantines. Their track records have become more marred since Delta arrived.
For New Zealand, if I recall correctly, basically no community infection clusters were due to quarantine breakthroughs (citation needed!). Of the cases caught with PCR, 80% tested positive on day 3 of quarantine, and the remaining 20% were positive on day 12.So while some cases might have not been detected, it seems like these didn’t go on to infect others after the 2 weeks. The people getting tested on day 3 could have been infected on their plane flight, or perhaps some days before their flight. I’d guess the median infection was a week old by day 3 of quarantine.
Anyhow, NZ’s numbers seemed to rule out a 50% false-negative rate, because I think their quarantine would have failed if so. They also seem to rule out a 2% false-negative rate, at least for tests done early after infection.
Also, when I look around, I find charts like these that suggest the claimed false negative rates vary absurdly!
Good points! My principled take is that you want to minimize your adversary’s success probability, as a function of the number of guesses they take.
In the usual case where guessing some wrong password X does nothing to narrow their search other than tell them “the password isn’t X”, the best the adversary can do is spend their guesses on passwords in order of decreasing likelihood. If they know your password-generating-procedure, their best chance of succeeding in tries is the sum of the likelihoods of the most common passwords.
I note also that I could “fix” your roll-a-D20 password-generating procedure by rejecting samples until I get something the adversary assigns low probability to.
This won’t work in general though...
Nice!
I note you do at least get a partial ordering here, where some schemes always give the adversary lower cumulative probability of success as increases than others.
This should be similar (perhaps more fine grained, idk) than the min-entropy approach. But I haven’t thought about it :)
Well, the counterexample I have in mind is: flip a coin until it comes up tails, your password is the number of heads you got in a row.
While you could rejection-sample this to e.g. a uniform distribution on the numbers , this would take samples, and isn’t very feasible. (As you need work to get bits of strength!)
I do agree that your trick works in all reasonable cases, whenever it’s not hard to reach different possible passwords.
I think there’s an important thing to note, if it doesn’t already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.
Here’s my attempt at a poor intuitionistic proof:
If you have some kind of program that understands consequences or backchains or etc, then perhaps it’s capable of recognizing that “acquire lots of power” will then let it choose from a much larger set of possibilities. Regardless of the details of how “decisions” are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power. And thus I’m worried about “instrumental convergence”.
---
At this point, I’m already much more worried about instrumental convergence, because backchaining feels damn useful. It’s the sort of thing I’d expect most competent mind-like programs to be using in some form somewhere. It certainly seems more plausible to me that a random mind does backchaining, than a random mind looks like “utility function over here” and “maximizer over there”.
(For instance, even setting aside how AI researchers are literally building backchaining/planning into RL agents, one might expect most powerful reinforcement learners to benefit a lot from being able to reason in a consequentialist way about actions. If you can’t literally solve your domain with a lookup table, then causality and counterfactuals let you learn more from data, and better optimize your reward signal.)
---
Finally, I should point at some relevant thinking around how consequentialists probably dominate the universal prior. (Meaning: if you do an AIXI-like random search over programs, you get back mostly-consequentialists). See this post from Paul, and a small discussion on agentfoundations.