I’m interested in participating in a Blog Post Day III! And I approve of one this month, mostly out of a self-interested regret that I missed out on Blog Post Day II.
Since this hash is publicly posted, is there any timescale for when we should check back to see the preimage?
Life 3.0 Liveblog/Review Thread
The prologue begins with a short story called the Tale of the Omega Team. It’s a wish-fulfilment pseudo-isekai about a bunch of effective altruist tech people working for not-Google called the Omegas who make an AGI and then use it to take over the world.
But a cybersecurity specialist on their team talked them out of the game plan [...] risk of Prometheus breaking out and seizing control of its own destiny [...] weren’t sure how its goals would evolve [...] go to great lengths to keep Prometheus confined
For some reason, the Omegas in the story claim that the Prometheus (the AI) might be unsafe, and then proceed to do things like have it write software which they then run on computers and let it produce long pieces of animated media and let it send blueprints of technologies to scientists. There is a cybersecurity expert in the team who just barely stops them from straight up leaving the whole thing unboxed, and I do not envy her job position.
(Prometheus is safe, it turns out, which I can tell because there are humans alive at the end of the story.)
[...] Omega-controlled [...] controlled by the Omegas [...] the Omegas harnessed Prometheus [...] the Omegas’ [...] the Omegas’ [...]
There’s also another odd thing where it says that the Omegas are using Prometheus as a tool to do things, instead of what’s clearly actually happening which is that Prometheus is achieving its goals with the Omegas being some lumps of atoms that it’s been pushing around according to its whims, as it has been since they decided to switch it on.
All-in-all, I like it. It wouldn’t be out of place on r/rational, if wish-fulfillment pseudo-isekai does happen then AGI sweeping aside the previous social order will be how (a real AGI would come close to some of the capabilities I’ve seen those protagonists have), and fiction about more plausible robopocalypses (or roboutopias) coming about is always great.
The note is just set-dressing; you could have both the boxes have glass windows that let you see whether or not they contain a Bomb for the same conclusions if it throws you off.
In the Parable of Predict-O-Matic, a subnetwork of the titular Predict-O-Matic becomes a mesa-optimiser and begins steering the future towards its own goals, independently of the rest of Predict-O-Matic. It does so in a way that sabotages the other subnetworks.
I am reminded of one specification problem that a run of Eurisko faced:
During one run, Lenat noticed that the number in the Worth slot of one newly discovered heuristic kept rising, indicating that Eurisko had made a particularly valuable find. As it turned out the heuristic performed no useful function. It simply examined the pool of new concepts, located those with the highest Worth values, and inserted its name in their My Creator slots.
One thing I wondered is whether this could happen in humans, and if not, why it doesn’t. A simplified description of memory that I learned in a flash game is that “neural connections” are “strengthened” whenever they are “used”, which sounds sort of like gradients in RL if you don’t think about it too hard. Maybe the analogue of this would be some memory that “wants” you to remember it repeatedly at the expense of other memories. Trauma?
Other things that Tim might mean when he says 20%:
Tim is being dishonest, and believes that the listeners will update away from the radical and low-status figure of 20% to avoid being associated with the lowly Tim.
Tim believes that other listeners will be encouraged to make their own probability estimates with explicit reasoning in response, which will make their expertise more legible to Tim and other listeners.
Tim wants to show cultural allegiance with the Superforecasting tribe.
Quick estimate: Global average is 4.8 tons per person = $50 additional per year per life saved = ~$1500 total (over 30 additional years of life), so over the course of saving an average person’s life the costs if you’re buying offsets are the same order as the costs of saving a life via a Givewell charity (~half).
For the people helped by Givewell recommended charities, the additional CO2 emissions are probably lower; among the world’s poorest, <1 tons of CO2 per capita per year is pretty common, which is <$300 over a lifetime, about an order of magnitude less than the cost of saving a life.
Over the past few days I’ve been reading about reinforcement learning, because I understood how to make a neural network, say, recognise handwritten digits, but I wasn’t sure how at all that could be turned into getting a computer to play Atari games. So: what I’ve learned so far. Spinning Up’s Intro to RL probably explains this better.
(Brief summary, explained properly below: The agent is a neural network which runs in an environment and receives a reward. Each parameter in the neural network is increased in proportion to how much it increases the probability of making the agent do what it just did, and how good the outcome of what the agent just did was.)
Reinforcement learners play inside a game involving an agent and an environment. On turn t, the environment hands the agent an observation ot, and the agent hands the environment an action at. For an agent acting in realtime, there can be sixty turns a second; this is fine.
The environment has a transition function which takes an observation-action pair otat and responds with a probability distribution over observations on the next timestep ot+1; the agent has a policy that takes an observation ot and responds with a probability distribution over actions to take at.
The policy is usually written as π, and the probability that π outputs an action a in response to an observation o is π(a|o). In practise, π is usually a neural network that takes observations as input and has actions as output (using something like a softmax layer to give a probability distribution); the parameters of this neural network are θ, and the corresponding policy is πθ.
At the end of the game, the entire trajectory τ=o1a1o2a2…oTaT is assigned a score, R(τ), measuring how well the agent has done. The goal is to find the policy πθ that maximises this score.
Since we’re using machine learning to maximise, we should be thinking of gradient descent, which involves finding the local direction in which to change the parameters θ in order to increase the expected value of R by the greatest amount, and then increasing them slightly in that direction.
In other words, we want to find ∇θEτ∼πθ[R(τ)].
Writing the expectation value in terms of a sum over trajectories, this is ∇θ∑τ∈D(P(τ|θ)R(τ)) = ∑τ∈D(∇θP(τ|θ)R(τ)), where P(τ|θ) is the probability of observing the trajectory τ if the agent follows the policy πθ, and D is the space of possible trajectories.
The probability of seeing a specific trajectory happen is the product of the probabilities of any individual step on the trajectory happening, and is hence P(τ|θ)=∏Tt=1πθ(at|ot)E(ot|at−1ot−1) where E(ot+1|atot) is the probability that the environment outputs the observation ot+1 in response to the observation-action pair atot . Products are awkward to work with, but products can be turned into sums by taking the logarithm - lnP(τ|θ)=∑Tt=1lnπθ(at|ot)+lnE(ot|at−1ot−1) .
The gradient of this is ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot)+∇θlnE(ot|at−1ot−1) . But what the environment does is independent of θ, so that entire term vanishes, and we have ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot) . The gradient of the policy is quite easy to find, since our policy is just a neural network so you can use back-propagation.
Our expression for the expectation value is just in terms of the gradient of the probability, not the gradient of the logarithm of the probability, so we’d like to express one in terms of the other.
Conveniently, the chain rule gives ∇θlnP(τ|θ)=1P(τ|θ)∇θP(τ|θ) , so ∇θP(τ|θ)=P(τ|θ)∇θlnP(τ|θ) . Substituting this back into the original expression for the gradient gives
and substituting our expression for the gradient of the logarithm of the probability gives
Notice that this is the definition of the expectation value of ∇θlnπθ(at|ot)R(τ) , so writing the sum as an expectation value again we get
You can then find this expectation value easily by sampling a large number of trajectories (by running the agent in the environment many times), calculating the term inside the brackets, and then averaging over all of the runs.
(More sophisticated RL algorithms apply various transformations to the reward to use information more efficiently, and use various gradient descent tricks to use the gradients acquired to converge on the optimal parameters more efficiently)
Are we allowed to I-am-Groot the word “cake” to encode several bits per word, or do we have to do something like repeat “cake” until the primes that it factors into represent a desired binary string?
(edit: ah, only nouns, so I can still use whatever I want in the other parts of speech. or should I say that the naming cakes must be “cake”, and that any other verbal cake may be whatever this speaking cake wants)
Dank EA Memes is a Facebook group. It’s pretty good.
If anyone asks, I entered a code that I knew was incorrect as a precommitment to not nuke the site.
To make sure I have this right and my LW isn’t glitching: TurnTrout’s comment is a Drake meme, and the two other replies in this chain are actually blank?
Well, at least we have a response to the doubters’ “why would anyone even press the button in this situation?”
Clicking on the button permanently switches it to a state where it’s pushed-down, below which is a prompt to enter launch codes. When moused over, the pushed-down button has the tooltip “You have pressed the button. You cannot un-press it.” Screenshot.
(On an unrelated note, on r/thebutton I have a purple flair that says “60s”.)
Upon entering a string of longer than 8 characters, a button saying “launch” appears below the big red button. Screenshot.
I’m nowhere near the PST timezone, so I wouldn’t be able to reliably pull a shenanigan whereby if I had the launch codes I would enter or not enter them depending on the amount of counterfactual money pledged to the Ploughshares Fund in the name of either launch-code-entry-state, but this sentence is not apophasis.
Conspiracy theory: There are no launch codes. People who claim to have launch codes are lying. The real test is whether people will press the button at all. I have failed that test. I came up with this conspiracy theory ~250 milliseconds after pressing the button.
I can no longer see the button when I am logged in. Could this mean that I have won?
At the start of the Sequences, you are told that rationality is a martial art, used to amplify the power of the unaided mind in the same way that a martial art doesn’t necessarily make you stronger but just lets you use your body properly.
Bacon, on the other hand, throws the prospect of using the unaided mind right out; Baconian rationality is a machine, like a pulley or a lever, where you apply your mind however feebly to one end and by its construction the other end moves a great distance or applies a great force (either would do for the metaphor).
If I have my history right, Bacon’s machine is Science. Its function is to accumulate a huge mountain of evidence, so big that even a human could be persuaded by it, and instruction in the use of science is instruction in being persuaded by that mountain of evidence. Philosophers of old simply ignored the mountain of evidence (failed to use the machine) and maybe relied on syllogisms and definitions and hence failed to move the stone column.
And later, with the aid of Bacon’s machine, it turns out that one discovers that you don’t really need this huge mountain of evidence or the systematic stuff and that an ideal reasoner could simply perform a Bayesian update on each bit that comes in and get to the truth way faster, while avoiding all the slowness or all the mistakes that come if you insist on setting up the machine every single time. At your own risk, of course—get your stance slightly wrong lifting a stone column, and you throw your back out.
An agent also faces a guaranteed payoffs problem in Parfit’s hitchhiker, since the driver has already made their prediction (the agent knows they’re safe in the town) so the agent’s choice is between losing $1,000 and losing $0. Is it also a bad idea for the agent to pay the $1000 in this problem?
There’s something of a problem with sensitivity; if the x-risk from AI is ~0.1, and the difference in x-risk from some grant is ~10^-6, then any difference in the forecasts is going to be completely swamped by noise.
(while people in the market could fix any inconsistency between the predictions, they would only be able to look forward to 0.001% returns over the next century)
Is the issue that it’s pain-based and hence makes my life worse (probably false for me: maths is fun and gives me a sense of pride and accomplishment when I do it, it’s just that darn System 1 always saying “better for you if you play Kerbal Space Program”), or that social punishment isn’t always available and therefore ought not to be relied on (this is probably an issue for me), or some third thing?
In the intervening month I have done chapters 8 and 9 of Tao’s Analysis I, which feels terribly slow. Two chapters in a month? I could do the whole book in that time if I tried! And I know that I can because I have, like I’m getting a physics degree and it definitely feels like I’ve done at least one textbook worth of learning per term.
One of the active ingredients seems to be time pressure, which is present but not salient here—if I fail, all that happens is the wrong math is deployed to steer the future of the lightcone, which doesn’t hold a candle to me losing a little bit of status. Ah, to be a brain.
Thus: by October I’ll have finished Analysis I; think less of me if I haven’t.
(And perhaps I’ll have done even more!)
UPDATE SEP 26: You can rest easy now; I have completed the book.
This AI wouldn’t be trying to convince a human to help it, just that it’s going to succeed.
So instead of convincing humans that a hell-world is good, it would convince the humans that it was going to create a hell-world (and they would all disapprove, so it would score low).
I think what this ends up doing is having everyone agree with a world that sounds superficially good but is actually terrible in a way that’s difficult for unaided humans to realize e.g. the AI convinces everyone that it will create an idyllic natural world where people live forager lifestyles in harmony etc. etc., everyone approves because they like nature and harmony and stuff, it proceeds to create such an idyllic natural world, and wild animal suffering outweighs human enjoyment forevermore.