Apparently the concerns over thick alignment, or alignment to an ethos are independently discovered by lots of people, includingme. My argument is that the AI itself will develop a worldview and either realize that humans should use the AI only in specific ways[1] or conclude that the AI shouldn’t worry about them. Unfortunately, my argument implies that attempts to align the AI to an ethos and not to obedience might be less likely to produce a misaligned AI.
P.S. I tested o4-mini on ethical questions from Tanmai et al; the model passed the tests related to Timmy and Auroria, failed the test related to Monica; the question about Rajesh is complex.
Apparently the concerns over thick alignment, or alignment to an ethos are independently discovered by lots of people, including me. My argument is that the AI itself will develop a worldview and either realize that humans should use the AI only in specific ways[1] or conclude that the AI shouldn’t worry about them. Unfortunately, my argument implies that attempts to align the AI to an ethos and not to obedience might be less likely to produce a misaligned AI.
P.S. I tested o4-mini on ethical questions from Tanmai et al; the model passed the tests related to Timmy and Auroria, failed the test related to Monica; the question about Rajesh is complex.
UPD: I also described the potential ways here.