The problem I have and wish to solve is, of course, the accurséd Akrasia that stops me from working on AI safety.
Let’s begin with the easy ones:
1 Stop doing this babble challenge early and go try to solve AI safety.
2 Stop doing this babble challenge early; at 11 pm, specifically, and immediately sleep, in order to be better able to solve AI safety tomorrow.
In fact generally sleep seems to be a problem, I spend 10 hours doing it every day (could be spent solving AI safety) and if I fall short I am tired. No good! So working on this instrumental goal.
3 Get blackout curtains to improve sleep quality
4 Get sleep mask to improve sleep quality
5 Get better mattress to improve sleep quality
6 Find a beverage with more caffeine to reduce the need for sleep
7 Order modafinil online to reduce the need for sleep
And heck while we’re on the topic of stimulants
8 Order adderall online or from a friend to increase ability to focus
9 Look up good nootropics stacks to improve cognitive ability and hence ability to do AI safety
Now another constraint when doing AI safety is that I don’t have a good shovel-ready list of things to try, and it’s easy for me to get distracted if I can’t just pick something from the task list
10 Check if complice solves this problem
11 Check if some ordinary getting-things-done (that I can stick into roam) solves this problem
12 Make a giant checklist and go down this list
13 Make a personal kanban board of things that would be nice for solving AI safety
And instrumentally useful for creating these task lists?
14 Ask friends who know about AI safety for things to do
15 Apophatically ask for suggestions for things to do via an entry on a list of 50 items for a lesswrong babble challenge
Anyway, I digress. I’m here to solve akrasia, not make a checklist. Unless I need more items on this list, in which case I will go back to checklist construction. Is this pruning? Never mind. Back to the point:
16 Set up some desktop shortcut macro thing in order to automatically start pomodoros when I open my laptop
17 Track time spent doing things useful to AI safety on a spreadsheet
18 Hey, I said “laptop”! Get a better mouse to make using the laptop more fun so I’m more likely to do hard things when using it
19 Get a better desk for more space for notes and to require less expensive shifting into/out of AI safety mode
20 On notes, use the index cards I have to make a proper zettelkasten as a cognitive aid
(Does this solve akrasia? Well, if I have better cognitive aids, then doing cognitively expensive things is easier, so I’m less likely to fail even with my current levels of willpower)
21 Start doing accountability things like promising to review a paper every X time period
22 I said levels of willpower—Google for interventions that increase conscientiousness (there’s gotta be some dodgy big-5 based things) and do those?
Back to the top of the tree
23 Quit my job because it’s using up energy that I could be using to do AI safety
24 Instead of doing my job, pretend to do my job while actually doing AI safety
25 Set up an AI safety screen on work laptop so it’s easy to switch over to doing AI safety during breaks or lunches
Hey, I said lunch
26 Use nutritionally complete meal replacements to save time/willpower that would be spent on food preparation
27 Use nutritionally complete meal replacements to ensure that nutrient intake keeps me in top physical form
28 Exercise (this improves everything, apparently) by running on a treadmill
29 By lifting weights
30 By jogging in a large circle
31 Become a monk and live an austere lifestyle without the distractions of rich food, wine, and lust
32 Become an anti-monk and live a rich lifestyle to ensure that no willpower is wasted on distractions
33 Specifically in vice use nicotine as a performance enhancing stimulant by smoking. Back to stimulants again I guess
34 … or by using nicotine patches or gum or something
35 By using nicotine only if I do AI safety things, in order to develop an addiction to AI safety
Hey, develop an addiction to doing AI safety! People go to serious lengths for addictions, so why not gate it on math?
36 Do so with something very addictive, like opioids
37 Use electric shocks to do classical conditioning
etc. there was a short sci-fi story about this kind of thing let me see if I can find it. Hey, actually, since I said sci-fi, adn this is a babble challenge:
38 Promise very hard to time travel back to this exact point in time, meet future self, recieve advice
(They’re not here :( Oh well) Back on that akrasia-solving:
39 Make up a far-future person who I am specifically working to save (they’re called Dub See Wun). Get invested in their internal life (they want to make their own star!). Feel an emotional connection to them. I’m doing it for them!
40 Specifically put up a “do it for them” poster modelled off the one in the Simpsons
41 DuckDuckGo “how to beat akrasia” and do the top suggestion
42 Adopt strategic probably false beliefs (the world will end in 1 year!! :0) in order to encourage a more aggressive search for strategies
“Aggressive search for strategies” is the virtue that the Sequences call “actually trying”, so in the Sequences-sphere
43 Go to a CFAR workshop, which I heard might be kind of useful towards this sort of thing
44 Or just read the CFAR booklet and apply the wisdom found in there
45 Or some sequence on Lesswrong with exercises that applies some CFARy wisdom
Of course all this willpower boosting and efficiency and stuff wouldn’t help if I was just doing the wrong thing faster (like that one Shen comic, you know the one). So:
46 Consider how much of what I think is working on AI safety is actually just self-actualisy math/CS stuff, throw that out, and actually try to solve the problem
47 Deliberately create and encourage a subagent in my mind that wants to do AI safety (call em Dub See Wun)
48 Adopt strategic infohazards in order to encourage a more focused and aggressive search for strategies
49 Post a lot about AI safety in public forums like Lesswrong so that I feel compelled to do AI safety in my private life in order to maintain the illusion that I’m some kind of AI-safety-doing-person
50 Stop doing this babble challenge at the correct time, and continue to do AI safety or sleep as in 1) or 2). Hey, this one seems good. Think I might try it now!
This means you can build an action that says something like “if I am observable, then I am not observable. If I am not observable, I am observable” because the swapping doesn’t work properly.
This means you can build an action that says something like “if I am observable, then I am not observable. If I am not observable, I am observable” because the swapping doesn’t work properly.
Constructing this more explicitly: Suppose that a1e∈S and a2e∈W∖S. Then if(S,a2,a1) must be empty. This is because for any action a3 in the set if(S,a2,a1), if a3e was in S then it would have to equal a2e which is not in S, and if a3e was not in S it would have to equal a1e which is in S.
Since if(S,a2,a1) is empty, S is not observable.
Because the best part of a sporting event is the betting, I ask Metaculus: [Short-Fuse] Will AbstractSpyTreeBot win the Darwin Game on Lesswrong?
How does your CooperateBot work (if you want to share?). Mine is OscillatingTwoThreeBot which IIRC cooperates in the dumbest possible way by outputting the fixed string “2323232323...”.
I have two questions on Metaculus that compare how good elements of a pair of cryonics techniques are: preservation by Alcor vs preservation by CI, and preservation using fixatives vs preservation without fixatives. They are forecasts of the value (% of people preserved with technique A who are revived by 2200)/(% of people preserved with technique B who are revived by 2200), which barring weird things happening with identity is the likelihood ratio of someone waking up if you learn that they’ve been preserved with one technique vs the other.
Interpreting these predictions in a way that’s directly useful requires some extra work—you need some model for turning the ratio P(revival|technique A)/P(revival|technique B) into plain P(revival|technique X), which is the thing you care about when deciding how much to pay for a cryopreservation.
One toy model is to assume that one technique works (P(revival) = x), but the other technique may be flawed (P(revival) < x). If r < 1, it’s the technique in the numerator that’s flawed, and if r > 1, it’s the technique in the denominator that’s flawed. This is what I guess is behind the trimodality in the Metaculus community median: there are peaks at the high end, the low end, and at exactly 1, perhaps corresponding to one working, the other working, and both working.
For the current community medians, using that model, using the Ergo library, normalizing the working technique to 100%, I find:
Alcor vs CI:
EV(Preserved with Alcor) = 52%
EV(Preserved with Cryonics Institue) = 79%
Fixatives vs non-Fixatives
EV(Preserved using Fixatives) = 73%
EV(Preserved without using Fixatives) = 52%
(here’s the Colab notebook)
The annotations that some other people have put on their lists to show their thinking process as well as the list of assumptions at the start, have been interesting—I haven’t done this this time, but it seems like something worth trying next time.
Keep it in my pocket the whole time.
Locked safe down the Marianas trench.
Am I a time traveller? Is that how I know? If so, hide it in dinosaur times, long before the evil forces lived.
Or hide it in the far future, long after the evil forces lived.
Send it into orbit.
Land it on the moon. Can’t quite think of a way to achieve this, though. Any ideas?
Bury it in a geologically stable location and dig it up later as if it were nuclear waste.
Hide it in a gangster’s treasure box hidden under some foliage, a la 20200.
Start a pen manufacturing company and create many, many identical pens. They won’t be able to tell which one it is.
Eat the pen. Repeatedly, each time it passes through. For 50 years.
Find the guy with 10 years’ worth of energy. Lock them in a room. Offer them their freedom if and only if they vow to protect the pen.
Surgically implant the pen under my skin (hope it’s not made of biologically active materials).
Hidden safe in the walls of the house.
Hidden safe in the attic of the house.
Swiss bank vault (we had those in 1855, right?).
Inside a bottle of wine that will be aged to become a 50-year vintage in 1950.
Write a book on effective altruism (using the pen, of course) - there are probably some good cause areas around in 1855 to use as examples. They will read it, and cease to be evil, thus removing their motivation to acquire the pen.
Give Babbage some pointers on making his difference engine not suck, beginning an early steampunk cybersingularity, and ask the Great Brass Mind how to hide the pen.
Give the pen to my well-connected close friend, [famous person who lived in 1855], providing them with the same evidence I used to find that Einstein would need it.
Select, completely randomly, a point on the surface of the Earth. Bury it under a small amount of earth. Security through obscurity!
Replace each component of the pen, one at a time, until you have two pens: the old pen, and a new pen that’s atom-for-item identical to the original pen. Let the evil forces find the new pen.
Create a replica of the first pen and let the evil forces find it, so that they stop looking.
Bribe every grunt of the evil forces who comes looking for your pen.
Like 10), but the other end; at that point they won’t want to find it, even if they know where it is.
Find Einstein’s parents. Offer them this treasured family heirloom. They will keep it safe and Einstein will inherit it.
Paint the pen black and put in in a soot-filled chimney.
Find Oliver Twist and Fagin, or some other group of Victorian urchins, who are ubiquitous in this age. Hire Fagin’s street urchins to come up with and then red-team test 50-year security plans for the pen.
Become a miserly industrialist, refusing even to give my workers a day off for Christmas. When three ghosts come to visit, use information from the Ghost of Christmas Future to divine the manner in which the evil forces retrieve the pen, and make countermeasures.
All of these plans have some chance of failing, so I can obviously tolerate that. Hence, bet my money at very, very long odds—in the small sliver of timelines in which I succeed, use my money to buy out the evil forces entirely.
Call my friends at the time commission for backup. C’mon, we can’t just forget about protocol here.
Go on an expedition to the Arctic and hide it in the inhospitable ice; I could probably talk some guys in pith helmets into giving me backup.
Or to the deepest jungles of the dark continent of Africa; likewise with the pith helments.
Or to the source of the Nile.
Or to the summit of the Mt. Everest or K2 or whatever’s going to be most awkward for the evil forces..
Or to the Antarctic, which is colder than the Arctic in the middle part.
Or to the deserts of Australia.
Found a cult of Defending the Pen, perhaps using song lyrics from the future as substitute mystical wisdom.
Ask the longer-haired, wiser, and older version of myself who just gave me this quest for advice, since they’re still standing there. Follow their advice.
Bury the pen deep in a coal mine.
Keep your head down and don’t tell anyone that it’s -you- who has the pen—it’s not like the evil forces have any reason to suspect that, unless you give them a good reason to, like boostrapping the world to nanotech using future knowledge or something. Haha. Heh.
Hide the pen under my top hat; since it’s 1855, that won’t look unusual.
Dismantle the pen and hide the seven components throughout the world using techniques described above and below; being smaller, they’ll be harder to find.
Join the evil forces as a simple masked minion; working for them, they won’t suspect you have the pen, until one day as the second-in-command you usurp the leader (as it tradition).
Message in a bottle to the North Sentinel Island, who will repel outsiders including the evil forces.
Give a speech that’s something like “evil forces, you really want to mess with me? I can leap to the moon in a single bound, and that’s just to save me pulling it to ground, which I can also do. You once tried to trap me in a room and I took down your mothership’s entire network before tearing it to shreds. This planet, and this pen specifically, is under my protection. Return to your galaxy,” probably with dramatic orchestral music playing in the background, and then the evil forces will leave.
Check your Messing-with-Time-Wongle, standard issue equipment for all time travellers with missions to defend artefacts that are important to the timeline. Notice that the LED on it flashes green. Precommit to only sending a “green” signal to your MwTW in 50 years if the pen reaches Einstein successfully. Now Time will bend to ensure the pen is not found.
Freeze the pen in liquid nitrogen. It will now be too cold for the evil forces to touch.
The evil forces that I’m leader of, remember. Obviously my disloyal second-in-command will take umbrage if I seem not to be looking for the pen at all—I’m fairly sure they’re a time traveller here to prevent Einstein from laying the physics foundations for the nuclear weapons that will destroy the world in the mid-20th century or something like that, and they keep scribbling notes on this list of about 50 items—but I can still direct them to the wrong place for 50 years. Hey, I think I saw the pen-keeper go into the middle of the Antarctic to launch a rocket!
Bury the pen in a large heap of explosives that only I know how to disarm—WWII mines are still dangerous so them being stable for 50 years should work.
Tie the pen to my ankle, everywhere I go—the traditional mores of the 19th century would make it scandalous for the evil forces to retrieve it from there!
Melt down the pen into a block of ordinary looking gunk. Remake the pen when needed years later.
The added resource constraints (I don’t have a space elevator with me in the room… yet) made this a bit more difficult, which is very nice.
Ask someone for help via the phone
Punch through the door
Unlock the door, go through it
Punch through the wall
Punch through the window
Unlock the window, go through it
Wait for someone to help
Wait for the room to be demolished
Climb up through the ceiling...
...or through one of the missing walls (does it still count as a room?)
Create a series of Lesswrong posts diguised as babble exercises to try to come up with a way out of this room; use the best suggestion
Wait for a friendly GPT-derived AGI to rescue you (admittedly a longshot)
Quantum tunnel out of the room (rare but possible)
Release all of the energy stored in your body in a single burst to destroy walls (10 years! That’s a lot!)
Release all of the energy stored in the phone’s battery in a single burst to destroy walls
Use friction from rubbing clothes against wall to wear through it
Hang self with clothes (morbid, but “I” am no longer in the room)
Wait ten years, starve to death (don’t worry; the GPT-derived AGI can read off my brain structures and revive me later)
Lifelog very accurately online via the phone; have myself be reconstructed outside of the room
I am already outside of the room, 10^10^100 light years away. No problem.
Release all energy stored in body in a single burst to jump through the ceiling and several miles into the sky—this might also allow me to bring a small object to the moon
Punch through the wall, but using phone to protect hands
Punch through the wall using shirt wrapped around to protect hands
Use the power armour that I am wearing as clothes to dismantle room
Wait sufficiently long that my personality is different enough that I am not in the room
Escape mentally via escapism (with help of phone games?)
Use my cool utility-fog based sci-fi clothes to convert wall into nanobots
Redefine “inside” as “outside”, like that SCP that lets you do that
Is this a real room, or a metaphorical “you” video game character? Type the console command to teleport out.
Ask the server admin to teleport me out.
Ask the real life server admin of the simulation we are embedded in to teleport me out (Elon Musk does this with Telsa stock prices)
Tap on the wall of the room to send a Morse code message asking for help.
Use phone’s wifi to connect to the door’s bluetooth and unlock it via the app.
Run at the door really hard.
The phone is a Nokia. Drop it on the ground and the room crumbles.
The phone is that Samsung phone that has batteries that set on fire (with 10 years of charge, that might be bad news for me?) Do so, then use the automatic door unlocking (that happens as a fire safety measure) to leave the room.
Pull off a bit of the phone’s casing and use it as a lockpick.
The phone is that iPhone that can bend easily. Bend it into a shape that can prise the door open. Exit through door.
As above, but prise the window open. Exit through window.
Stop imagining the room.
Use lucid dream powers to escape the room.
Go to sleep and dream of a different place
Grow large enough to break through the room’s walls
The walls are made of air so I can walk through them.
The walls are made of antimatter and annihilate with the surrounding environment.
The walls are made of ice and will melt soon.
Rub together two stick-like objects (derived from my phone, probably) to start a fire, as fire safety measure the door unlocks, etc
Do the five movements to travel to another dimension where we are not trapped
Hack the wi-fi. As an expert hacker, my captors will thus have to recruit me in order to fix their wifi. As they open the door, slip past them.
The room is completely empty. The air pressure outside causes the walls to immediately buckle and break.
About halfway through I forgot that I was only meant to be bringing something to the moon rather than having to visit it myself, and some of my items are very broad (the first one could make up a whole list in itself).
This was very fun!
jump really, really hard
accelerate the spin of the earth until it falls apart
decelerate the orbit of the moon until it falls, by flying comets past it
or by painting one side of the moon black
or by using a giant rocket
or by detonating enough antimatter weaponry
flap your arms, again really, really hard
shine a torch at the moon (photons reach there)
start in space and use an ion drive
project orion nuclear bomb detonated below you
program an AGI and ask the AGI how to get to the moon
build a very tall ladder
wings made of wax
throw it really, really hard
spin around and let go
stand under an asteroid strike and join the ejecta
wait for quantum fluctuations to teleport you there
wait for random gravitational solar system pertubations to bring the moon to you
wait for another civilisation to bring you to the moon
time travel to before Theia hit and join the original moon
add mass to the moon until it becomes the planet and you are on the moon
find the space rocks the apollo astronauts brought back and stand on them
project orion but with fusion
project orion but with antimatter
trigger false vacuum collapse with particle accelerator and use new physics to develop as yet unknowable way of travelling to moon
bird with a spacesuit
submarine with reactionless thruster inside
perpetual motion machine
buy a ticket on musk’s starship
invest in dogecoin, use billions from dogecoin to start space program
stand above a supervolcano and hope ejecta takes you high enough
run very very fast reaching orbital velocity
very long space elevator reaching down from moon
very very long space elevator reaching down from mars
create microscopic black hole and use gravitational slingshot
carefully warp space to make a staircase built from the metric
make a normal staircase
very, very fast bicycle with a ramp
add mass to moon until gravitational tide from moon lifts you from the surface of the earth
deorbit the earth-moon system into the sun and join it in the molten iron in the sun’s core
apollo 11 mission
The statement of the law of logical causality is:
Law of Logical Causality: If conditioning on any event changes the probability an agent assigns to its own action, that event must be treated as causally downstream.
If I’m interpreting things correctly, this is just because anything that’s upstream gets screened off, because the agent knows what action it’s going to take.
You say that LICDT pays the blackmail in XOR blackmail because it follows this law of logical causality. Is this because, conditioned on the letter being sent, if there is a disaster the agent assigns p=0 to sending money, and if there isn’t a disaster the agent assigns p=1 to sending money, so the disaster must be causally downstream of the decision to send money if the agent is to know whether or not it sends money?
I didn’t find the conclusion about the smoke-lovers and non-smoke-lovers obvious in the EDT case at first glance, so I added in some numbers and ran through the calculations that the robots will do to see for myself and get a better handle on what not being able to introspect but still gaining evidence about your utility function actually looks like.
Suppose that, out of the N robots that have ever been built, nN are smoke-lovers and (1−n)N are non-smoke-lovers. Suppose also the smoke-lovers end up smoking with probability p and non-smoke-lovers end up smoking with probability q.
Then (pn+q(1−n))N robots smoke, and ((1−p)n+(1−q)(1−n))N robots don’t smoke. So by Bayes’ theorem, if a robot smokes, there is a pnpn+q(1−n) chance that it’s killed, and if a robot doesn’t smoke, there’s a (1−p)n1−(pn+q(1−n))chance that it’s killed.
Hence, the expected utilities are:
An EDT non-smoke-lover looks at the possibilities. It sees that if it smokes, it expects to get−101pnpn+q(1−n)−1(1−pnpn+q(1−n)) utilons, and that if it doesn’t smoke, it expects to get −100(1−p)n1−(pn+q(1−n)) utilons.
An EDT smoke-lover looks at the possibilities. It sees that if it smokes, it expects to get −90pnpn+q(1−n)+10(1−pnpn+q(1−n)) utilons, and if it doesn’t smoke, it expects to get −100(1−p)n1−(pn+q(1−n)) utilons.
Now consider some equilibria. Suppose that no non-smoke-lovers smoke, but some smoke-lovers smoke. So q=ε and p≫ε. So (taking limits as ε→0 along the way):
non-smoke-lovers expect to get −101 utilons if they smoke, and −100n−pn1−pn utilons if they don’t smoke.n<1 so non-smoke-lovers will choose not to smoke.
smoke-lovers expect to get −90 utilons if they smoke, and −100n−pn1−pn utilons if they don’t smoke. Smoke-lovers would be indifferent between the two if p=10−9n. This works fine if at least 90% of robots are smoke lovers, and equilibrium is achieved. But if less than 90% of robots are smoke-lovers, then there is no point at which they would be indifferent, and they will always choose not to smoke.
But wait! This is fine if more than 90% are smoke-lovers, but if fewer than 90% are smoke-lovers, then they would always choose not to smoke, that’s inconsistent with the assumption that p is much larger than ε. So instead suppose that p is only only a little bit bigger than ε=q, say that p=kε. Then:
non-smoke-lovers expect to get −100(k1+(k−1)n+1100n)n utilons if they smoke, and −100n utilons if they don’t smoke. They will choose to smoke if k<1+1101n−100n2, i.e. if smoke-lovers smoke so rarely that not smoking would make them believe they’re a smoke-lover about to be killed by the blade runner.
smoke-lovers expect to get −100(k1+(k−1)n−110n)n utilons if they smoke, and −100n utilons if they don’t smoke. They are indifferent between these two when k=1+19n−10n2. This means that, when k is at the equilibrium point, non-smoke-lovers will not choose to smoke when fewer than 90% of robots are smoke-lovers, which is exactly when this regime applies.
I wrote a quick python simulation to check these conclusions, and it was the case that p=10−9n for 0.9<n<1, and p=(1+19n−10n2)ε for 0<n<0.9 there as well.
Your reliable thermometer doesn’t need to be well-calibrated—it only has to show the same value whenever it’s used to measure boiling water, regardless of what that value is. So the dependence isn’t quite so circular, thankfully.
So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There’s More section; you can roughly match them up by saying r(x)=∑ifiri(x) and yi(x)=(1−fi)ri(x) , where fi is a real number corresponding to the amount that the agent cares about rewards obtained in episode i and ri is the reward obtained in episode i. Putting both of these into the sum gives R(x)=∑iri(x), the undiscounted, non-myopic reward that the agent eventually obtains.
In terms of the R=R0+R1 definition that I give in the uncertainty framing, this is R0=R(x,y0)=∑ifiri(x)+∑i(1−fi)ri(x0), and R1=R(x,y)−R(x,y0)=∑i(1−fi)(ri(x)−ri(x0)).
So if you let r be a vector of the reward obtained on each step and f be a vector of how much the agent cares about each step then x→x+ϵ∑ifi∂ri∂x , and thus the change to the overall reward is R→R+ϵ∑i∂ri∂x∑jfj∂rj∂x , which can be negative if the two sums have different signs.
I was hoping that a point would reveal itself to me about now but I’ll have to get back to you on that one.
Thoughts on Abram Demski’s Partial Agency:
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function R(x0,y(x0)) where x is the vector of parameters it’s thinking about and y is the vector of parameters it isn’t thinking about. The gradient descent step picks the δx in the direction that maximizes R(x0+δx,y(x0)) (it is myopic so it can’t consider the effects on y), and then moves the agent to the point (x0+δx,y(x0+δx)).
This is dual to a stop-gradient agent, which picks the δx in the direction that maximizes f(x0+δx,y(x0+δx)) but then moves the agent to the point (x0+δx,y(x0)) (the gradient through y is stopped).
Nash equilibria - x are the parameters defining the agent’s behavior.y(x0) are the parameters of the other agents if they go up against the agent parametrized by x0. R is the reward given for an agent x going up against a set of agents y.
Image recognition with a neural network - x is the parameters defining the network, y(x0) are the image classifications for every image in the dataset for the network with parameters x0, and R is the loss function plus the loss of the network described by x on classifying the current training example.
Episodic agent - x are parameters describing the agents behavior.y(x0) are the performances of the agent x0 in future episodes.R is the sum of y, plus the reward obtained in the current episode.
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I’d be myopic is if I didn’t believe that I could, in expectation, improve some part of the reward, perhaps because it’s intractable to calculate (behavior of other agents) or something I’m not programmed to care about (reward in other episodes).
Let R1 be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into R′=R0(x0)+R1(x0) defined in such a way that E(R1(x0+δx)−R1(x0))≈0 for any δx? Then this is would be myopia where the agent either doesn’t know or doesn’t care about R1, or at least doesn’t know or care what its output does to R1. This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let’s use both of them at once! Since you only end up doing gradient descent on R0, it would make sense to say R′(x)=R(x,y(x)) , R0(x)=R(x,y(x0)) , and hence that R1(x)=R(x,y(x))−R(x,y(x0)).
Since R1(x0+δx)=R1(x0)+δx∂R1∂x for small δx, this means that E(∂R1∂x)=0 , so substituting in my expression for R1 gives E(∂R∂x+∂R∂y∂y∂x−∂R∂x)=0 , so E(∂R∂y∂y∂x)=0 . Uncertainly is only over R, so this is just the claim that the agent will be myopic with respect to y if E(∂R∂y)=0. So it won’t want to include y in its gradient calculation if it thinks the gradients with respect to y are, on average, 0. Well, at least I didn’t derive something obviously false!
But Wait There’s More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there’s a R(x)=r(x)+∑iyi(x) structure to a lot of them, where r is the reward on the current episode, and yi are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward rt on each timestep so R(x)=r0(x)+r1(x)α+r2(x)α2+∑i=3ri(x)αi , and saying that α3≪1 so that ∂R∂ri≪1 for i≥3, which is basically the criterion for myopia up above.
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
The Whole City is Center:
This story had a pretty big impact on me and made me try to generate examples of things that could happen such that I would really want the perpetrators to suffer, even more than consequentialism demanded. I may have turned some very nasty and imaginative parts of my brain, the ones that wrote the Broadcast interlude in Unsong, to imagining crimes perfectly calculated to enrage me. And in the end I did it. I broke my brain to the point where I can very much imagine certain things that would happen and make me want the perpetrator to suffer – not infinitely, but not zero either.
The AI Box game, in contrast with the thing it’s a metaphor for, is a two player game played over text chat by two humans where the goal is for Player A to persuade Player B to let them win (traditionally by getting them to say “I let you out of the box”), within a time limit.
Thoughts on Dylan Hadfield-Menell et al.’s The Off-Switch Game.
I don’t think it’s quite right to call this an off-switch—the model is fully general to the situation where the AI is choosing between two alternatives A and B (normalized in the paper so that U(B) = 0), and to me an off-switch is a hardware override that the AI need not want you to press.
The wisdom to take away from the paper: An AI will voluntarily defer to a human—in the sense that the AI thinks that it can get a better outcome by its own standards if it does what the human says—if it’s uncertain about the utilities, or if the human is rational.
This whole setup seems to be somewhat superseded by CIRL, which has the AI, uh, causally find UA by learning its value from the human actions, instead of evidentially(?) doing it by taking decisions that happen to land it on action A when UA is high because it’s acting in a weird environment where a human is present as a side-constraint.
Could some wisdom to gain be that the high-variance high-human-rationality is something of an explanation as to why CIRL works? I should read more about CIRL to see if this is needed or helpful and to compare and contrast etc.
Why does the reward gained drop when uncertainty is too high? Because the prior that the AI gets from estimating the human reward is more accurate than the human decisions, so in too-high-uncertainty situations it keeps mistakenly deferring to the flawed human who tells it to take the worse action more often?
The verbal description, that the human just types in a noisily sampled value of UA, is somewhat strange—if the human has explicit access to their own utility function, they can just take the best actions directly! In practice, though, the AI would learn this by looking at many past human actions (there’s some CIRL!) which does seem like it plausibly gives a more accurate policy than the human’s (ht Should Robots Be Obedient).
The human is Boltzmann-rational in the two-action situation (hence the sigmoid). I assume that it’s the same for the multi-action situation, though this isn’t stated. How much does the exact way in which the human is irrational matter for their results?
Just under a month ago, I said “web app idea: one where you can set up a play-money prediction market with only a few clicks”, because I was playing around on Hypermind and wishing that I could do my own Hypermind. It then occurred to me that I can make web apps, so after getting up to date on modern web frameworks I embarked in creating such a site.
Anyway, it’s now complete enough to use, provided that you don’t blow on it too hard. Here it is: pmarket-maker.herokuapp.com. Enjoy!
You can create a market, and then create a set of options within that market. Players can make buy and sell limit orders on those options. You can close an option and pay out a specific amount per owned share. There are no market makers, despite the pun in the name, but players start with 1000 internet points that they can use to shortsell.
Thoughts on Ryan Carey’s Incorrigibility in the CIRL Framework (I am going to try to post these semi-regularly).
This specific situation looks unrealistic. But it’s not really trying to be too realistic, it’s trying to be a counterexample. In that spirit, you could also just use R2(a,sd)=1000, which is a reward function parametrized by θ that gives the same behavior but stops me from saying “Why Not Just set θ=−1”, which isn’t the point.
How something like this might actually happen: you try to have your R1 be a complicated neural network that can approximate any function. But you butcher the implementation and get something basically random instead, and this R2 cannot approximate the real human reward.
An important insight this highlights well: An off-switch is something that you press only when you’ve programmed the AI badly enough that you need to press the off-switch. But if you’ve programmed it wrong, you don’t know what it’s going to do, including, possibly, its off-switch behavior. Make sure you know under which assumptions your off-switch will still work!
Assigning high value to shutting down is incorrigible, because the AI shuts itself down. What about assigning high value to being in a button state?
The paper considers a situation where the shutdown button is hardcoded, which isn’t enough by itself. What’s really happening is that the human either wants or doesn’t want the AI to shut down, which sounds like a term in the human reward that the AI can learn.
One way to do this is for the AI to do maximum likelihood with a prior that assigns 0 probability to the human erroneously giving the shutdown command. I suspect there’s something less hacky related to setting an appropriate prior over the reward assigned to shutting down.
The footnote on page 7 confuses me a bit—don’t you want the AI to always defer to the human in button states? The answer feels like it will be clearer to me if I look into how “expected reward if the button state isn’t avoided” is calculated.
Also I did just jump into this paper. There are probably lots of interesting things that people have said about MDPs and CIRLs and Q-values that would be useful.
I’m interested in participating in a Blog Post Day III! And I approve of one this month, mostly out of a self-interested regret that I missed out on Blog Post Day II.