Disclaimer: writing quickly.
Consider the following path:
A. There is an AI warning shot.
B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.
C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed.
I think that a warning shot is unlikely (P(A) < 10%), but won’t get into that here.
I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization further from AGI ruin, but I think that the ML community is both more important and probably significantly easier to shift than government. I basically agree with this post as it pertains to government updates based on warning shots.
I anticipate that a warning shot would get most capabilities researchers to a) independently think about alignment failures and think about the alignment failures that their models will cause, and b) take the EA/LessWrong/MIRI/Alignment sphere’s worries a lot more seriously. My impression is that OpenAI seems to be much more worried about misuse risk than accident risk: if alignment is easy, then the composition of the lightcone is primarily determined by the values of the AGI designers. Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI. I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%.
P(C | A, B) primarily depends on alignment difficulty, of which I am pretty uncertain, and also how large the reallocation in B is, which I am anticipating to be pretty large. The bar for destroying the world gets lower and lower every year, but this would give us a lot more time, but I think we get several years of AGI capabiliity before we deploy it. I’m estimating P(C | A, B) ~= 70%, but this is very low resilience.
Hmm, the eigenfunctions just depend on the input training data distribution (which we call X), and in this experiment, they are distributed evenly on the interval [−π,π). Given that the labels are independent of this, you’ll get the same NTK eigendecomposition regardless of the target function.
I’ll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points.
An anonymous academic wrote a review of Joe Carlsmith’s ‘Is power seeking AI an existential risk?’, in which the reviewer assigns for <1/100,000 probability of AI existential risk. The arguments given aren’t very good imo, but maybe worth reading.
Just made a fairly large edit to the post after lots of feedback from commenters. My most recent changes include the following:
Note limitations in introduction (lack academics, not balanced depth proportional to people, not endorsed by researchers)
Update CLR as per Jesse’s comment
Update brain-like AGI to include this.
Rewrite shard theory section
Brain <-> shards
effort: 50 → 75 hours :)
Add this paper to DeepMind
Add some academics (David Krueger, Sam Bowman, Jacob Steinhardt, Dylan Hadfield-Menell, FHI)
Add other category
Summary table updates:
Update links in table to make sure they work.
Add scale of organization
Thank you to everyone who commented, it has been very helpful.
Good point, I’ve updated the post to reflect this.
I’m excited for your project :)
Good point. We’ve added the Center for AI Safety’s full name into the summary table which should help.
Thanks for the update! We’ve edited the section on CLR to reflect this comment, let us know if it still looks inaccurate.
not all sets all sets of reals which are bounded below have an infimum
Do you mean ‘all sets of reals which are bounded below DO have an infimum’?
In the model based RL set up, we are planning to give it actions that can directly modify the game state in any way it likes. This is sort of like an arbitrarily-powerful superpower, because it can change anything it wants about the world, except of course that this is a cartesian environment and so it can’t, e.g., recursively self improve.
With model free RL, this strategy doesn’t obviously carry over so I agree that we are limited to easily codeable superpowers. .
Strong upvoted and I quite like this antidote, I will work on adding my guess of the scale of these orgs into the table.
Hi Adam, thank you so much for writing this informative comment. We’ve added your summary of FAR to the main post (and linked this comment).
Agree with both aogara and Eli’s comment.
One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.
my current best guess is that gradient descent is going to want to make our models deceptive
Can you quantify your credence in this claim?
Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, … optimization pressure against deception?
If you don’t like these or want a more specific operationalization of this question, I’m happy with whatever you think is likely or filling out more details.
Sorry about that, and thank you for pointing this out.
For now I’ve added a disclaimer (footnote 2 right now, might make this more visible/clear but not sure what the best way of doing that is). I will try to add a summary of some of these groups in when I have read some of their papers, currently I have not read a lot of their research.
Edit: agree with Eli’s comment.
80,000 hours recent problem profile on AI lists some reasons that this might be wrong.
Thank you Gabriel!
Yeah good point, I think I should have included that link, updated now to include it.