Drew the shoggoth and named notkilleveryoneism.
Tetraspace
There’s no official, endorsed CFAR handbook that’s publicly available for download. The CFAR handbook from summer 2016, which I found on libgen, warns
While you may be tempted to read ahead, be forewarned—we’ve often found that participants have a harder time grasping a given technique if they’ve already anchored themselves on an incomplete understanding. Many of the explanations here are intentionally approximate or incomplete, because we believe this content is best transmitted in person. It helps to think of this handbook as a companion to the workshop, rather than as a standalone resource.
which I think is still their view on the matter.
I have heard that they would be more comfortable with people learning rationality techniques in-person from a friend, so if you know any CFAR alumni you could ask them (they’d probably also have a better answer to your question).
Very nice! I like the colour-coding scheme, and the way it ties together those bullet points in MIRI’s research agenda.
Looks like these sequences are going to be a great (content-wise and aesthetically) introduction to a lot of the ideas behind agent foundations; I’m excited.
If anyone asks, I entered a code that I knew was incorrect as a precommitment to not nuke the site.
PMarket Maker
Just under a month ago, I said “web app idea: one where you can set up a play-money prediction market with only a few clicks”, because I was playing around on Hypermind and wishing that I could do my own Hypermind. It then occurred to me that I can make web apps, so after getting up to date on modern web frameworks I embarked in creating such a site.
Anyway, it’s now complete enough to use, provided that you don’t blow on it too hard. Here it is: pmarket-maker.herokuapp.com. Enjoy!
You can create a market, and then create a set of options within that market. Players can make buy and sell limit orders on those options. You can close an option and pay out a specific amount per owned share. There are no market makers, despite the pun in the name, but players start with 1000 internet points that they can use to shortsell.
EDIT 2023-02-25: Such a web app exists for real now, made as an actual product by other people who can devops and design UIs, it’s called Manifold Markets.
At the start of the Sequences, you are told that rationality is a martial art, used to amplify the power of the unaided mind in the same way that a martial art doesn’t necessarily make you stronger but just lets you use your body properly.
Bacon, on the other hand, throws the prospect of using the unaided mind right out; Baconian rationality is a machine, like a pulley or a lever, where you apply your mind however feebly to one end and by its construction the other end moves a great distance or applies a great force (either would do for the metaphor).
If I have my history right, Bacon’s machine is Science. Its function is to accumulate a huge mountain of evidence, so big that even a human could be persuaded by it, and instruction in the use of science is instruction in being persuaded by that mountain of evidence. Philosophers of old simply ignored the mountain of evidence (failed to use the machine) and maybe relied on syllogisms and definitions and hence failed to move the stone column.
And later, with the aid of Bacon’s machine, it turns out that one discovers that you don’t really need this huge mountain of evidence or the systematic stuff and that an ideal reasoner could simply perform a Bayesian update on each bit that comes in and get to the truth way faster, while avoiding all the slowness or all the mistakes that come if you insist on setting up the machine every single time. At your own risk, of course—get your stance slightly wrong lifting a stone column, and you throw your back out.
I.
Clicking on the button permanently switches it to a state where it’s pushed-down, below which is a prompt to enter launch codes. When moused over, the pushed-down button has the tooltip “You have pressed the button. You cannot un-press it.” Screenshot.
(On an unrelated note, on r/thebutton I have a purple flair that says “60s”.)
Upon entering a string of longer than 8 characters, a button saying “launch” appears below the big red button. Screenshot.
II.
I’m nowhere near the PST timezone, so I wouldn’t be able to reliably pull a shenanigan whereby if I had the launch codes I would enter or not enter them depending on the amount of counterfactual money pledged to the Ploughshares Fund in the name of either launch-code-entry-state, but this sentence is not apophasis.
III.
Conspiracy theory: There are no launch codes. People who claim to have launch codes are lying. The real test is whether people will press the button at all. I have failed that test. I came up with this conspiracy theory ~250 milliseconds after pressing the button.
IV. (Update)
I can no longer see the button when I am logged in. Could this mean that I have won?
I might as well post a monthly update on my doing things that might be useful for me doing AI safety.
I decided to just continue with what I was doing last year before I got distracted, and learn analysis, from Tao’s Analysis I, on the grounds that it’s maths which is important to know and that I will climb the skill tree analysis → topology → these fixed point exercises. Have done chapters 5, 6 and 7.
My question on what it would be most useful for me to be doing remains if anyone has any input.
I didn’t find the conclusion about the smoke-lovers and non-smoke-lovers obvious in the EDT case at first glance, so I added in some numbers and ran through the calculations that the robots will do to see for myself and get a better handle on what not being able to introspect but still gaining evidence about your utility function actually looks like.
Suppose that, out of the robots that have ever been built, are smoke-lovers and are non-smoke-lovers. Suppose also the smoke-lovers end up smoking with probability and non-smoke-lovers end up smoking with probability .
Then robots smoke, and robots don’t smoke. So by Bayes’ theorem, if a robot smokes, there is a chance that it’s killed, and if a robot doesn’t smoke, there’s a chance that it’s killed.
Hence, the expected utilities are:
An EDT non-smoke-lover looks at the possibilities. It sees that if it smokes, it expects to get utilons, and that if it doesn’t smoke, it expects to get utilons.
An EDT smoke-lover looks at the possibilities. It sees that if it smokes, it expects to get utilons, and if it doesn’t smoke, it expects to get utilons.
Now consider some equilibria. Suppose that no non-smoke-lovers smoke, but some smoke-lovers smoke. So and . So (taking limits as along the way):
non-smoke-lovers expect to get utilons if they smoke, and utilons if they don’t smoke. so non-smoke-lovers will choose not to smoke.
smoke-lovers expect to get utilons if they smoke, and utilons if they don’t smoke. Smoke-lovers would be indifferent between the two if . This works fine if at least 90% of robots are smoke lovers, and equilibrium is achieved. But if less than 90% of robots are smoke-lovers, then there is no point at which they would be indifferent, and they will always choose not to smoke.
But wait! This is fine if more than 90% are smoke-lovers, but if fewer than 90% are smoke-lovers, then they would always choose not to smoke, that’s inconsistent with the assumption that is much larger than . So instead suppose that is only only a little bit bigger than , say that . Then:
non-smoke-lovers expect to get utilons if they smoke, and utilons if they don’t smoke. They will choose to smoke if , i.e. if smoke-lovers smoke so rarely that not smoking would make them believe they’re a smoke-lover about to be killed by the blade runner.
smoke-lovers expect to get utilons if they smoke, and utilons if they don’t smoke. They are indifferent between these two when . This means that, when is at the equilibrium point, non-smoke-lovers will not choose to smoke when fewer than 90% of robots are smoke-lovers, which is exactly when this regime applies.
I wrote a quick python simulation to check these conclusions, and it was the case that for , and for there as well.
I’d like to report a bug. My comments aren’t larger than worlds, which is a pity, because the kind of content I produce is clearly the most insightful and intelligent of all. I’m also humble to boot—more humble than you could ever believe—which is one of the rationalist virtues that any non-tribal fellow would espouse.
Maybe you got into trouble for talking about that because you are rude and presumptive?
I think this is just a nod to how he’s literally Roko, for whom googling “Roko simulation” gives a Wikipedia article on what happened last time.
Thoughts on Abram Demski’s Partial Agency:
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function where is the vector of parameters it’s thinking about and is the vector of parameters it isn’t thinking about. The gradient descent step picks the in the direction that maximizes (it is myopic so it can’t consider the effects on ), and then moves the agent to the point .
This is dual to a stop-gradient agent, which picks the in the direction that maximizes but then moves the agent to the point (the gradient through is stopped).
For example,
Nash equilibria - are the parameters defining the agent’s behavior. are the parameters of the other agents if they go up against the agent parametrized by . is the reward given for an agent going up against a set of agents .
Image recognition with a neural network - is the parameters defining the network, are the image classifications for every image in the dataset for the network with parameters , and is the loss function plus the loss of the network described by on classifying the current training example.
Episodic agent - are parameters describing the agents behavior. are the performances of the agent in future episodes. is the sum of , plus the reward obtained in the current episode.
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I’d be myopic is if I didn’t believe that I could, in expectation, improve some part of the reward, perhaps because it’s intractable to calculate (behavior of other agents) or something I’m not programmed to care about (reward in other episodes).
Let be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into defined in such a way that for any ? Then this is would be myopia where the agent either doesn’t know or doesn’t care about , or at least doesn’t know or care what its output does to . This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let’s use both of them at once! Since you only end up doing gradient descent on , it would make sense to say , , and hence that .
Since for small , this means that , so substituting in my expression for gives , so . Uncertainly is only over , so this is just the claim that the agent will be myopic with respect to if . So it won’t want to include in its gradient calculation if it thinks the gradients with respect to are, on average, 0. Well, at least I didn’t derive something obviously false!
But Wait There’s More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there’s a structure to a lot of them, where is the reward on the current episode, and are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward on each timestep so , and saying that so that for , which is basically the criterion for myopia up above.
Unrelated Note
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
Thoughts on Ryan Carey’s Incorrigibility in the CIRL Framework (I am going to try to post these semi-regularly).
This specific situation looks unrealistic. But it’s not really trying to be too realistic, it’s trying to be a counterexample. In that spirit, you could also just use , which is a reward function parametrized by that gives the same behavior but stops me from saying “Why Not Just set ”, which isn’t the point.
How something like this might actually happen: you try to have your be a complicated neural network that can approximate any function. But you butcher the implementation and get something basically random instead, and this cannot approximate the real human reward.
An important insight this highlights well: An off-switch is something that you press only when you’ve programmed the AI badly enough that you need to press the off-switch. But if you’ve programmed it wrong, you don’t know what it’s going to do, including, possibly, its off-switch behavior. Make sure you know under which assumptions your off-switch will still work!
Assigning high value to shutting down is incorrigible, because the AI shuts itself down. What about assigning high value to being in a button state?
The paper considers a situation where the shutdown button is hardcoded, which isn’t enough by itself. What’s really happening is that the human either wants or doesn’t want the AI to shut down, which sounds like a term in the human reward that the AI can learn.
One way to do this is for the AI to do maximum likelihood with a prior that assigns 0 probability to the human erroneously giving the shutdown command. I suspect there’s something less hacky related to setting an appropriate prior over the reward assigned to shutting down.
The footnote on page 7 confuses me a bit—don’t you want the AI to always defer to the human in button states? The answer feels like it will be clearer to me if I look into how “expected reward if the button state isn’t avoided” is calculated.
Also I did just jump into this paper. There are probably lots of interesting things that people have said about MDPs and CIRLs and Q-values that would be useful.
The formalisation used in the Sequences (and algorithmic information theory) is the complexity of a hypothesis is the shortest computer program that can specify that hypothesis.
An illustrative example is that, when explaining lightning, Maxwell’s equations are simpler in this sense than the hypothesis that Thor is angry because the shortest computer program that implements Maxwell’s equations is much simpler than an emulation of a humanlike brain and its associated emotions.
In the case of many-worlds vs. Copenhagen interpretation, a computer program that implemented either of them would start with the same algorithm (Schrodinger’s equation), but (the claim is) that the computer program for Copenhagen would have to have an extra section that specified how collapse upon observation worked that many-worlds wouldn’t need.
- 6 Aug 2019 18:30 UTC; 1 point) 's comment on Occam’s Razor: In need of sharpening? by (
I’m off from university (3rd year physics undergrad) for the summer and hence have a lot of free time, and I want to use this to make as much progress as possible towards the goal of getting a job in AI safety technical research. I have found that I don’t really know how to do this.
Some things that I can do:
work through undergrad-level maths and CS textbooks
basic programming (since I do physics, this is at the level required to implement simple numerical methods in MATLAB)
the stuff in Andrew Ng’s machine learning Coursera course
Thus far I’ve worked through the first half of Hutton’s Programming in Haskell on the grounds that functional programming maybe teaches a style of thought that’s useful and opens doors to more theoretical CS stuff.
I’m optimising for something slightly different that purely becoming good at AI safety, in that at the end I’d like to have some legible things to point to or list on a CV or something (or become better-placed to later acquire such legible things).
I’d be interested to hear from people who know more about what would be helpful for this.
- 4 Aug 2019 23:04 UTC; 14 points) 's comment on Open & Welcome Thread—August 2019 by (
I have two questions on Metaculus that compare how good elements of a pair of cryonics techniques are: preservation by Alcor vs preservation by CI, and preservation using fixatives vs preservation without fixatives. They are forecasts of the value (% of people preserved with technique A who are revived by 2200)/(% of people preserved with technique B who are revived by 2200), which barring weird things happening with identity is the likelihood ratio of someone waking up if you learn that they’ve been preserved with one technique vs the other.
Interpreting these predictions in a way that’s directly useful requires some extra work—you need some model for turning the ratio P(revival|technique A)/P(revival|technique B) into plain P(revival|technique X), which is the thing you care about when deciding how much to pay for a cryopreservation.
One toy model is to assume that one technique works (P(revival) = x), but the other technique may be flawed (P(revival) < x). If r < 1, it’s the technique in the numerator that’s flawed, and if r > 1, it’s the technique in the denominator that’s flawed. This is what I guess is behind the trimodality in the Metaculus community median: there are peaks at the high end, the low end, and at exactly 1, perhaps corresponding to one working, the other working, and both working.
For the current community medians (as of 2021-04-18), using that model, using the Ergo library, normalizing the working technique to 100%, I find:
Alcor vs CI:
EV(Preserved with Alcor) = 69%
EV(Preserved with Cryonics Institue) = 76%
Fixatives vs non-Fixatives
EV(Preserved using Fixatives) = 83%
EV(Preserved without using Fixatives) = 34%
Since this hash is publicly posted, is there any timescale for when we should check back to see the preimage?
Well, at least we have a response to the doubters’ “why would anyone even press the button in this situation?”
There are about 8 billion people, so your 24,000 QALYs should be 24,000,000.
To make sure I have this right and my LW isn’t glitching: TurnTrout’s comment is a Drake meme, and the two other replies in this chain are actually blank?