You can’t really be accidentally slightly wrong. We’re not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling.
In the same way, we’re probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it’s not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence.
A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a “We perfectly recreated human values… But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!” sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we’re probably using a method which is innately insulated from most types of hyperexistential risk.
It’s not clear that Gwern’s example, or even that category of problem, is particularly relevant to this situation. Most parallels to modern-day software systems and the errors they are prone to are probably best viewed as sobering reminders, not specific advice. Indeed, I suspect his comment was merely a sobering reminder and not actual advice. If humans are making changes to the critical software/hardware of an AGI (And we’ll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), while that AGI is already running, something bizarre and beyond my abilities of prediction is already happening. If you need to make changes after you turn your AGI on, you’ve already lost. If you don’t need to make changes and you’re making changes, you’re putting humanity in unnecessary risk. At this point, if we’ve figured out how to assist the seed AI in self-modification, at least until the point at which it can figure out how to do stable self-modification for itself, the problem is already solved. There’s more to be said here, but I’ll refrain for the purpose of brevity.
Essentially, we can not make any ordinary mistake. The type of mistake we would need to make in order to land up in hyperexistential disaster territory would, most likely, be an actual, literal sign flip scenario, and such scenarios seem much easier to address. There will probably only be a handful of weak points for this problem, and those weak points are all already things we’d pay extra super special attention to and will engineer in ways which make it extra super special sure nothing goes wrong. Our method will, ideally, be terrorist proof. It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.
I conjecture that most of the expected utility gained from combating the possibility of a hyperexistential disaster lies in the disproportionate positive effects on human sanity and the resulting improvements to the efforts to avoid regular existential disasters, and other such side-benefits.
None of this is intended to dissuade you from investigating this topic further. I’m merely arguing that a hyperexistential disaster is not remotely likely- not that it is not a concern. The fact that people will be concerned about this possibility is an important part of why the outcome is unlikely.
If you’re having significant anxiety from imagining some horrific I-have-no-mouth-and-I-must-scream scenario, I recommend that you multiply that dread by a very, very small number, so as to incorporate the low probability of such a scenario. You’re privileging this supposedly very low probability specific outcome over the rather horrifically wide selection of ways AGI could be a cosmic disaster.
This is, of course, not intended to dismay you from pursuing solutions to such a disaster.