Jessica Taylor. CS undergrad and Master’s at Stanford; former research fellow at MIRI.
I work on decision theory, social epistemology, strategy, naturalized agency, mathematical foundations, decentralized networking systems and applications, theory of mind, and functional programming languages.
Blog: unstableontology.com
Twitter: https://twitter.com/jessi_cata
Most of the alignment problem in this case would be getting to stratified utopia. If stratified utopia is going to be established, then there can be additional trades on top, though they have to be restricted so as to maintain stratification.
With current models, a big issue is, how to construe their preferences? Given they’re stateless it’s unclear how they could know others are assisting them. I guess they could do web search and find it in context? Future models could be trained to “know” things but they wouldn’t be the same model.
And also, would they be motivated to hold up their end of the bargain? It seems like that would require something like interpretability, which would also be relevant to construing their preferences in the first place. But if they can be interpreted to this degree, more direct alignment might be feasible.
Like, there are multiple regimes imaginable:
Interpretability/alignment infeasible
Partial interpretability/alignment feasible; possible to construe preferences and trade with LLMs
Extensive interpretability/alignment feasible
And trade is most relevant in 2. However I’m not sure why 2 would be likely.