Thank you for this post! My impression is that this post makes real progress at identifying some of the upstream cruxes between people’s models of how AGI is going to go.
As examples, if you look at @So8res’s AGI ruin scenarios are likely (and disjunctive), I claim that a bunch of his AGI ruin scenarios rely on his belief that alignment is hard. I think that belief is correct! But still, it makes his argument less disjunctive than it might seem. Likewise, I now recognize that my own What does it take to defend the world against out-of-control AGIs? sneaks in a background assumption that alignment is hard (or alignment tax is high) in various places.
This seems correct to me.
But I figured out that I can occupy that viewpoint better if I say to myself: “Claude seems nice, by and large, leaving aside some weirdness like jailbreaks. Now imagine that Claude keeps getting smarter, and that the weirdness gets solved, and bam, that’s AGI. Imagine that we can easily make a super-Claude that cares about your long-term best interest above all else, by simply putting ‘act in my long-term best interest’ in the system prompt or whatever.” Now, I don’t believe that, for all the reasons above, but when I put on those glasses I feel like a whole bunch of the LLM-focused AGI discourse—e.g. writing by Paul Christiano, OpenPhil people, Redwood people, etc.—starts making more sense to me.
That seems to represent well the viewpoint that I now have even less faith in, but that didn’t seem strictly ruled out to me 4 years ago when GPT-3 had been around for a while, and it seemed likely we were headed for more scaling:
39 Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.
What exactly makes people sure that something like GPT would be safe/unsafe?
If what is needed is some form of insight/break through: Some smarter version of GPT-3 seems really useful? The idea that GPT-3 produces better poetry than me while GPT-5 could help to come up with better alignment ideas, doesn’t strongly conflict with my current view of the world?
Thank you for this post! My impression is that this post makes real progress at identifying some of the upstream cruxes between people’s models of how AGI is going to go.
This seems correct to me.
That seems to represent well the viewpoint that I now have even less faith in, but that didn’t seem strictly ruled out to me 4 years ago when GPT-3 had been around for a while, and it seemed likely we were headed for more scaling: