Doing alignment research with Vivek Hebbar’s team at MIRI.
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I’d be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.
What was the equation for research progress referenced in Ars Longa, Vita Brevis?
“Then we will talk this over, though rightfully it should be an equation. The first term is the speed at which a student can absorb already-discovered architectural knowledge. The second term is the speed at which a master can discover new knowledge. The third term represents the degree to which one must already be on the frontier of knowledge to make new discoveries; at zero, everyone discovers equally regardless of what they already know; at one, one must have mastered every previously-discovered fact before one can discover anything new. The fourth term represents potential for specialization; at one, it is impossible to understand any part without understanding the whole; at zero, it can be subdivided freely. The fifth…”
A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?
We’re definitely unlucky that, of the two challenges, this has been solved and AI strategy is unsolved.
There’s a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.
The agent can have all the biases humans do.
The agent can violate VNM axioms in any other way that doesn’t ruin it, basically anything that has low frequency or importance.
The agent only tries to maximize granite spheres 1 out of every 5 seconds, and the other 4⁄5 is spent just trying not to be turned off.
The agent has arbitrary deontological restrictions, say against sending any command to its actuators whose hash starts with 123.
The agent has 5 goals it is jointly pursuing, but only one of them is consequentialist.
The agent will change its goal depending on which cosmic rays it sees, but is totally incorrigible to us.
The original wording of the tweet was “Suppose that the AI’s sole goal is to maximize the number of granite spheres in its future light cone.” This is a bit closer to my picture of EU maximization but some of the degrees of freedom still apply.
Here’s one factor that might push against the value of Steinhardt’s post as something to send to ML researchers: perhaps it is not arguing for anything controversial, and so is easier to defend convincingly. Steinhardt doesn’t explicitly make any claim about the possibility of existential risk, and barely mentions alignment. Gates spends the entire talk on alignment and existential risk, and might avoid being too speculative because their talk is about a survey of basically the same ML researcher population as the audience, and so can engage with the most important concerns, counterarguments, etc.
I’d guess that the typical ML researcher who reads Steinhardt’s blogposts will basically go on with their career unaffected, whereas one that watches the Gates video will now know that alignment is a real subfield of AI research, plus the basic arguments for catastrophic failure modes. Maybe they’ll even be on the lookout for impressive research from the alignment field, or empirical demonstrations of alignment problems.
Caveats: I’m far from a NeurIPS author and spent 10 minutes skipping through the video, so maybe all of this is wrong. Would love to see evidence one way or the other.
I agree with the following caveats:
I think you’re being unfair to that Rob tweet and the MIRI position; having enough goal-directedness to maximize the number of granite spheres + no special structure to reward humans is a far weaker assumption than utility maximization. The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic. (note that I haven’t talked to him and definitely don’t speak for him or MIRI)
Shard theory is mostly just a frame and hasn’t discovered anything yet; the nontrivial observations about how agents and values behave rely on ~9 nonobvious claims, and the obviously true observations are not very powerful in arguing for alternate models of how powerful AI behaves. [If this sounds critical of shard theory, note that I’m excited about the shard theory frame, it just seems premature to conclude things from the evidence we have]
[edited to add:] Reflection might give some degree of coherence. This is important in the MIRI frame and also in the shard theory frame.
I feel like FTX is a point against utilitarianism for the same reasons Bentham is a point for utilitarianism. If you take an ethical system to logical conclusions and anticipate feminism, animal rights, etc. this is evidence for a certain algorithm creating good in practice. If you commit massive fraud this is evidence against.
This also doesn’t shift my meta-ethics much, so maybe I’m not one of the people you’re talking about?
I agree and now am pretty confused.
Hypothesis: much of this is explained by the simpler phenomenon of loss aversion. $1 to your ingroup is a gain, $1 to your outgroup is a loss and therefore mentally multiplied by ~2. The paper finds a factor of 3, so maybe there’s something else going on too.
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.
In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years’ lead or better research practices still within the human range
SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt
algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS
SpaceX gets 10x better cost per ton to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower
This seems sufficient for “what failure looks like” scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against “heavy precedent”, I’m not convinced either way and haven’t thought about it a ton.
There’s a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.
Could you say more about why you think these exercises are particularly valuable? I’m on Vivek’s team and I helped a bit with these exercises, so I’m naturally a fan, but I don’t think most people can decide to do these rather than Ngo’s exercises or other SERI MATS app questions without more information on why they’re good.
Market is implied by Bloomberg to be in danger of Bayes Hell. This graph is from Wednesday morning.
I’m guessing the graph was produced by taking some measure of the EV of interest rate, and incorrectly interpreting (say) 0.65% as a 3⁄5 chance of a 0.50% hike and 2⁄5 chance of a 0.75% hike. Not sure why they do this.
How many “important” things do you expect the entire field of alignment to achieve in the next 4 years? All of the questions above are highly sensitive to this number.
Value fragility / value complexity. How close do you need to get to human values to get 50% of the value of the universe, and how complicated must the representation be? Also in the past there was orthogonality, but that’s now widely believed.
Possible post on suspicious multidimensional pessimism:I think MIRI people (specifically Soares and Yudkowsky but probably others too) are more pessimistic than the alignment community average on several different dimensions, both technical and non-technical: morality, civilizational response, takeoff speeds, probability of easy alignment schemes working, and our ability to usefully expand the field of alignment. Some of this is implied by technical models, and MIRI is not more pessimistic in every possible dimension, but it’s still awfully suspicious.I strongly suspect that one of the following is true:
the MIRI “optimism dial” is set too low
everyone else’s “optimism dial” is set too high. (Yudkowsky has said this multiple times in different contexts)
There are common generators that I don’t know about that are not just an “optimism dial”, beyond MIRI’s models
I’m only going to actually write this up if there is demand; the full post will have citations which are kind of annoying to find.
My point is that a forecaster can have the level of precision where they say 50% is much too low and 80% is much too high. I agree that 50% vs 80% only makes a 1.6x difference in the final number, which is fairly small when you and Tegmark differ by 30x.
50% vs 80% is a huge difference, 4x in odds terms.
This gets a lot of points for concreteness, regardless of how likely to work it is. Also, I updated towards shard theory plans working despite my models being different from shard theory, because this plan didn’t seem to rely on claims I think are dodgy, e.g. internal game theory. Not too confident in this though because I haven’t thought about this much.