A world where alignment is impossible should be safer than a world where alignment is very difficult.
Here’s why I think this:
Suppose we have two worlds. In world A, alignment is impossible.
In this world, suppose an ASI is invented. This ASI wants to scale in power as quickly and thoroughly as possible, this ASI has the following options:
Scale horizontally.
Algorithmic improvements that can be mathematically guaranteed to produce identical outcomes.
Chip/wafer improvements.
Notably, the agent cannot either retrain itself, or train another more powerful agent to act on its behalf, since it can’t align the resulting agent. This should restrict the vast majority of potential growth (even if it might still be easily enough to overpower humans in a given scenario).
In world B, the ASI agent can do all of the above, but can also train a successor agent, we should expect the ASI to be able to get vastly more intelligent vastly quicker.
I suspect that it would given that the largest room for improvement would be physical (chip/wafer improvements), I suspect that there isn’t that much room for pure mathematically identical improvement of something like a transformer.
I think we live in a world where alignment is impossible. All attention based models in my opinion are complex enough systems to be computationally irreducable (There is no shorter way to know the outcome than to run the system itself, like with rule 110). If it is impossible to predict the outcome with certainty, the impossibility to force some desired outcome follows logically.
Humanity has not solved even the allignment of humans (children).
A world where alignment is impossible should be safer than a world where alignment is very difficult.
Here’s why I think this:
Suppose we have two worlds. In world A, alignment is impossible.
In this world, suppose an ASI is invented. This ASI wants to scale in power as quickly and thoroughly as possible, this ASI has the following options:
Scale horizontally.
Algorithmic improvements that can be mathematically guaranteed to produce identical outcomes.
Chip/wafer improvements.
Notably, the agent cannot either retrain itself, or train another more powerful agent to act on its behalf, since it can’t align the resulting agent. This should restrict the vast majority of potential growth (even if it might still be easily enough to overpower humans in a given scenario).
In world B, the ASI agent can do all of the above, but can also train a successor agent, we should expect the ASI to be able to get vastly more intelligent vastly quicker.
Yeah, ASI’s growth will probably be asymptotically slower, but I think it probably won’t matter that much for human’s safety.
I suspect that it would given that the largest room for improvement would be physical (chip/wafer improvements), I suspect that there isn’t that much room for pure mathematically identical improvement of something like a transformer.
Happy to hear your opinion though!
I think we live in a world where alignment is impossible. All attention based models in my opinion are complex enough systems to be computationally irreducable (There is no shorter way to know the outcome than to run the system itself, like with rule 110). If it is impossible to predict the outcome with certainty, the impossibility to force some desired outcome follows logically.
Humanity has not solved even the allignment of humans (children).
I think we’ve done an ok job at human alignment, given that the pension isn’t a bullet to the head.
I somewhat suspect that alignment is easier than most of less wrong think, but I’m definitely in the minority in this space.