So what do we do? One classic answer is that we get as far as we can before encountering the hard problems, then we use whatever model we have at that point as an automated alignment researcher to do the research necessary to tackle the hard parts of alignment. I think this is a very good plan, and we should absolutely do this, but I don’t think it obviates us from the need to work on the hard parts of alignment ourselves.
I was surprised that none of the below mentioned just how catastrophic it could be to try to push this “as far as we can” before using whatever model we have at that point as an automated alignment researcher. You could well have already passed a very dangerous point by that time without having known it, and given that in this scenario you’re using a less capable/powerful/smart AI to do alignment research on a smarter one, you run into exactly the same issues you’ve mentioned above about humans doing alignment research on a system that may already be smarter than themselves.
This is a crux!
I found your 4th point particularly well-explained and intuitive to me! Thanks for that. I was a bit skeptical of this post going in but I enjoyed it by the end, even though I didn’t read as unbelievably thoroughly as I suppose I could have (I have not read many of the sub/linked posts fully).
I didn’t find that this post explained why “alignment is hard” discourse seems alien to my human intuitions, but rather that it explained the differing fundamental views on why its hard. Nothing about this seemed alien to my human intuitions?
Also, mostly unrelated, I think my fear of such an Approval Reward built-in AI is just that even a small mistake in ridiculously powerful AI can still result in catastrophic collapse of human society. I think it shows promise but it’s not at the level of safety I’d personally deem sufficient, I guess.