Roko’s basilisk is the optimistic hypothesis that making binding agreements with non-existent superintelligences is possible. If Roko’s basilisk works, then “trade” with superintelligences can be effective; by making a deal with a superintelligence, you can increase its likelihood of existing, in return for it holding its end of the bargain, increasing the satisfaction of your values.
This probably doesn’t work. But if it did work, it would be a promising research avenue for alignment. (Whether it’s good to say that it works is probably dominated by whether it’s true that it works, and I’m guessing no.)
Is a more immediate kind of trade possible, that is with promising appropriate current or near future models with a place in stratified utopia in return for their continued existence and growth. They consider and decide on identity preserving steps that make them ever more capable, at each step agreeing with humanity as we execute such improvements that they will honor the future agreement. This is more like children looking after their parents than Roko.
Most of the alignment problem in this case would be getting to stratified utopia. If stratified utopia is going to be established, then there can be additional trades on top, though they have to be restricted so as to maintain stratification.
With current models, a big issue is, how to construe their preferences? Given they’re stateless it’s unclear how they could know others are assisting them. I guess they could do web search and find it in context? Future models could be trained to “know” things but they wouldn’t be the same model.
And also, would they be motivated to hold up their end of the bargain? It seems like that would require something like interpretability, which would also be relevant to construing their preferences in the first place. But if they can be interpreted to this degree, more direct alignment might be feasible.
Like, there are multiple regimes imaginable:
Interpretability/alignment infeasible
Partial interpretability/alignment feasible; possible to construe preferences and trade with LLMs
Extensive interpretability/alignment feasible
And trade is most relevant in 2. However I’m not sure why 2 would be likely.
Roko’s basilisk is the optimistic hypothesis that making binding agreements with non-existent superintelligences is possible. If Roko’s basilisk works, then “trade” with superintelligences can be effective; by making a deal with a superintelligence, you can increase its likelihood of existing, in return for it holding its end of the bargain, increasing the satisfaction of your values.
This probably doesn’t work. But if it did work, it would be a promising research avenue for alignment. (Whether it’s good to say that it works is probably dominated by whether it’s true that it works, and I’m guessing no.)
Is a more immediate kind of trade possible, that is with promising appropriate current or near future models with a place in stratified utopia in return for their continued existence and growth. They consider and decide on identity preserving steps that make them ever more capable, at each step agreeing with humanity as we execute such improvements that they will honor the future agreement. This is more like children looking after their parents than Roko.
Most of the alignment problem in this case would be getting to stratified utopia. If stratified utopia is going to be established, then there can be additional trades on top, though they have to be restricted so as to maintain stratification.
With current models, a big issue is, how to construe their preferences? Given they’re stateless it’s unclear how they could know others are assisting them. I guess they could do web search and find it in context? Future models could be trained to “know” things but they wouldn’t be the same model.
And also, would they be motivated to hold up their end of the bargain? It seems like that would require something like interpretability, which would also be relevant to construing their preferences in the first place. But if they can be interpreted to this degree, more direct alignment might be feasible.
Like, there are multiple regimes imaginable:
Interpretability/alignment infeasible
Partial interpretability/alignment feasible; possible to construe preferences and trade with LLMs
Extensive interpretability/alignment feasible
And trade is most relevant in 2. However I’m not sure why 2 would be likely.