Yea, thanks for remembering me! You can also posit that the agent is omniscient from the start, so it did not change its policy due to learning. This argument proves that an agent cannot be corrigible and a maximizer of the same expected utility funtion of world states over multiple shutdowns. But still leaves the possibility for the agent to be corrigible while rewriting his utility function after every correction.
ViktoriaMalyasova
Another problem is that the system cannot represent and communicate the whole predicted future history of the universe to us. It has to choose some compact description. And the description can get a high evaluation both for being a genuinely good plan, or for neglecting to predict or mention bad outcomes and using persuasive language (if it’s a natural language description).
Maybe we can have the human also report his happiness daily, and have the make_readable subroutine rewarded solely for how good the plan evaluation given beforehand matches the happiness level reported afterwards? I don’t think that solves the problem of delayed negative consequences, or bad consequences human will never learn about, or wireheading the human while using misleading descriptions of what’s happening, though.
We desperately need
Wait, didn’t this post just make a case that older people don’t keep up with new technology because they don’t feel they need it?
New apps appeal to me less and less often. Sometimes something does look fun, like video editing, but the learning curve is so steep and I don’t need to make an Eye of The Tiger style training montage of my friends’ baby learning to buckle his car seat that badly, so I pass it by and focus on the millions of things I want to do that don’t require learning a new technical skill.
Doesn’t sound to me like you desperately need that app :)
So, you and your team spent six years of effort working full time for no pay (what did you even eat then?). You developed a product that worked just great, was in demand and could make a difference in fighting obesity by making a beep whenever the wearer eats. But even though the product was ready—“just put it on and good to go”—and you can easily reconstruct it, you and your whole team decided to abandon it and part ways. Because you simply aren’t that into diabetes prevention, and also your time is limited and you have more important things to do. But you would enthusiastically do part-time contract work on this project again.
I feel like this story doesn’t quite make sense. If the company was doing so well and you just didn’t want to run it anymore, why didn’t you sell it?
I looked up the source of Putin’s claims that NATO promised not to expand, and it doesn’t stand up to scrutiny. Putin cites the speech of NATO General Secretary Mr. Woerner in Brussels on 17 May 1990, during negotations about NATO deployment in Germany. Here is the quote in context:
Our strategy and our Alliance are exclusively defensive. [...] This will also be true of a united Germany in NATO. The very fact that we are ready not to deploy NATO troops beyond the territory of the Federal Republic gives the Soviet Union firm security guarantees. Moreover we could conceive of a transitional period during which a reduced number of Soviet forces could remain stationed in the present-day GDR. This will meet Soviet concerns about not changing the overall East-West strategic balance.
It is clear that he is speaking about not deploying NATO troops on the territory of former GDR, not about a broader commitment to not enlarge NATO. Gorbachev himself confirms that “the topic of NATO expansion was not discussed at all”. So this is just another lie of Putin.
Assuming your AI has a correct ungameable understanding of what a “human” is, it could do things such as:
genetic engineering. People with Williams-Beuren syndrome are pathologically trusting. Decrease in intelligence makes you less likely to uncover any lies.
drug you. Oxytocin hormone can increase trust, some drugs like alcohol or LSD were shown to increase suggestibility, which may also mean they make you more trusting?
Once you uncovered AI’s lies, it can erase your memories by damaging a part of your brain .
Or maybe it can give you anterograde amnesia somehow, then you’re less likely to form troublesome memories in the first place.
If the AI cannot immediately recover lost trust through the methods above, it may want to isolate mistrustful people from the rest of society.
and make sure to convince everyone that if they lose faith in AI, they’ll go to hell. Maybe actually make it happen.
And, this is a minor point, but I think you are severely overestimating the effect of uncovering a lie on people’s trust. In my experience, people’s typical reaction to discovering that their favorite leader lied is to keep going as usual. For example:
[Politics warning, political examples follow]:
In his 2013 election campain, Navalny claimed that migrants commit 50% crimes in Moscow, contradicting both common sense (in 2013, around 8 − 17 % of Moscow population were migrants) and official crime statistics that says migrants and stateless people commited 25% of crimes. Many liberal Russians recognise it as a lie but keep supporting Navalny, and argue that Navalny has since changed and abandoned his chauvinist views. Navalny has not made any such statement.
Some Putin’s supporters say things like “So what if he rigged the election? He would’ve won even without rigging anyway” or “For the sacred mission [of invading Ukraine], the whole country will lie!”.
Once people have decided that you’re “on their side”, they will often ignore all evidence that you’re evil.
Prediction: Ukraine is going to win the war. (60%).
they might have to avoid explicitly explaining that to avoid rudeness and bad PR
Well, I don’t think that is the thing to worry about. Eliezer having high standards would be no news to me, but if I learn about MIRI being dishonest for PR reasons a second time, I am probably going to lose all the trust I have left.
Ukraine update 06/03/2022
Sabotage challenge
When I lived in Russia, I would occasionally go to a protest rally and get detained. One day I was sitting in a police station, waiting for my 3 hours to pass, and no one was paying attention to me. So I decided to try, what is going to happen if I get up and walk out? I walked out and no one stopped me. I tried that 4 times and it worked 3 times. No one bothers to guard detained protesters because no one bothers to try and escape.
I don’t know why everyone suddenly decided that alignment problem is “impossible”, Eliezer just said that it’s hard and he wants more help, preferably soon. It was a call for action, not a call to give up.
By thinking about it yourself or donating to some org. I agree that it’s hard to know if you’re helping, and I don’t advocate donating if an org looks 99% likely to be scammy or useless.
Well, I agree, and thanks for the links! I don’t want to ask how to sabotage real countries, though, answering that would be illegal in some countries and I don’t want to accidentally get anyone in trouble.
Because there is a video of the shelling.
Russian media confirms that there was fighting and a fire, though they blame the fire on Ukrainians.
I guess it could be Ukrainians shelling their own NPP, but it seems much more likely that Russian troops did it because they were the ones attacking, so they were more likely to have artillery in place and Ukrainians would have to fire at their own people to shell it.
Thanks, I missed that, I’ll update.
I agree.
Reach out to the authors of any paper you liked for a call
I didn’t know one can do that. Do people really just agree to a call with a stranger?
When I see a photo like this, my instinctual interpretation is that this is a collaboration between Olga and the troops behind her to generate the photo opportunity.
The cops are just standing in a line at a protest rally according to their orders, and she is taking a photo in front of them. No collaboration required. They’re not all mobbing to arrest her, of course. If someone is coming to arrest her, it will be a group of 2-4 cops.
Is the question “Which occupations do you identify with?” asking where I work, or about my identity? I might not identify with my occupation.
So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.
In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don’t think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arrows from B_black to itself, A_black and C_black, to denote the states where the agent is neither corrected nor disables his shutdown behaviour. (Otherwise the agent only misbehaves because we left him no other option.) Suppose we also have some initial reward R_init, which correction changes into reward R_new. Then we can combine these utilities like in Koen’s Corrigibility with Utility Preservation: give the agent reward V_init(black_node) - V_new(blue_node) as soon as it gets corrected, where V_x stands for maximum achievable reward under R_x. Also let’s assign reward R_red < min(R(A_black), R(B_black), R(C_black)) to red states, to make sure disabling the button is disincentivized. Then the agent is not incentivized (although also not disincentivized) to manipulate the human (as long as R_init by itself did not incentivize manipulation), and also not incentivized to disable its shutdown behaviour. It values the corrected and uncorrected states equally and greater than the incorrigible (button disabled) states.
I am not claiming of that utility indifference approach is without problems, of course, only that it seem to work in this toy universe. Or what am I missing?
I do think the conclusion of your argument is correct. Suppose the human is going to change his mind on his own and decide to correct the agent at timestep = 2, but the agent can also manipulate the human and erase the memory of the manipulation at timestep = 1, so the end results are exactly the same. A consequentialist agent should therefore evaluate both policies as equally good. So he chooses between them randomly and sometimes ends up manipulative. But a corrigible agent should not manipulate the human.