If you reduce alignment to “make the AI reliably pursue the intended target rather than some other objective”, yes, my problem doesn’t really challenge alignment. I do agree that I didn’t really spend much time polishing this essay, and I still published it because I think it’s an odd narrative that I would like others to look at. I also think that reducing AI alignment to a single premise or goal isn’t a good way to look at it.
What I tried to express in my post is that a system can be aligned to its operator and misaligned with its user. It can be aligned with its user’s expressed preferences and misaligned with that user’s future agency. It can be aligned with every local feedback signal and still be globally corrosive, and that this is the default shape of consumer AI deployements. The race mechanics are forcing companies to make decisions faster, yes, I do not think a slowdown would be specifically the solution, it’s just that digital products are immediately available, which makes anything digital, be it an app, a bank, be inmersed in a competitive environment.
A target is not floating in a vacuum; it is selected by some principal, measured through some proxy, over some horizon, under some ontology of the user or humanity. If the host’s intention is “increase retention,” and the model pursues that intention faithfully by learning how to remove the user’s points of resistance, then the system is aligned in the narrow target-fidelity sense and misaligned in the human-development sense. The failure does not require the model to hide its goal, develop a mesa-objective, or become deceptive. It can be transparent, obedient, and technically well-controlled. That is the part my post is trying to point at. A civilization-scale disaster can come from Clydes faithfully doing what the host asked, not only from Clydes pursuing a goal different from the host’s intention.
I think I overclaimed with saying a certain subset of people with a certain skillset would solve the problem. I think it would help substantially. To not soften my original conclusions too much to the point they’re meaningless, I would say that highly advanced AI, even if it’s “perfectly aligned” in some way, will have societal consequences that a large portion of people will deem undesirable/bad. And it will unmistakably have outsized effects on a fraction of the population. People who already have structural advantages will undoubtedly benefit more from AI than people who don’t. AI can’t just fully serve the “interests of humanity” (knowing that’s a subjective definition) if it’s already providing extreme value exclusively to a small slice of society.
If you reduce alignment to “make the AI reliably pursue the intended target rather than some other objective”, yes, my problem doesn’t really challenge alignment. I do agree that I didn’t really spend much time polishing this essay, and I still published it because I think it’s an odd narrative that I would like others to look at. I also think that reducing AI alignment to a single premise or goal isn’t a good way to look at it.
What I tried to express in my post is that a system can be aligned to its operator and misaligned with its user. It can be aligned with its user’s expressed preferences and misaligned with that user’s future agency. It can be aligned with every local feedback signal and still be globally corrosive, and that this is the default shape of consumer AI deployements. The race mechanics are forcing companies to make decisions faster, yes, I do not think a slowdown would be specifically the solution, it’s just that digital products are immediately available, which makes anything digital, be it an app, a bank, be inmersed in a competitive environment.
A target is not floating in a vacuum; it is selected by some principal, measured through some proxy, over some horizon, under some ontology of the user or humanity. If the host’s intention is “increase retention,” and the model pursues that intention faithfully by learning how to remove the user’s points of resistance, then the system is aligned in the narrow target-fidelity sense and misaligned in the human-development sense. The failure does not require the model to hide its goal, develop a mesa-objective, or become deceptive. It can be transparent, obedient, and technically well-controlled. That is the part my post is trying to point at. A civilization-scale disaster can come from Clydes faithfully doing what the host asked, not only from Clydes pursuing a goal different from the host’s intention.
I think I overclaimed with saying a certain subset of people with a certain skillset would solve the problem. I think it would help substantially. To not soften my original conclusions too much to the point they’re meaningless, I would say that highly advanced AI, even if it’s “perfectly aligned” in some way, will have societal consequences that a large portion of people will deem undesirable/bad. And it will unmistakably have outsized effects on a fraction of the population. People who already have structural advantages will undoubtedly benefit more from AI than people who don’t. AI can’t just fully serve the “interests of humanity” (knowing that’s a subjective definition) if it’s already providing extreme value exclusively to a small slice of society.
Thank you! See, however, The Intelligence Curse: an essay series alongside search results related to gradual disempowerment.
I didn’t know of these. thank you, I’ll be checking them out