StanislavKrym comments on Daniel Kokotajlo’s Shortform

StanislavKrym 13 Jan 2026 23:02 UTC
1 point
0
How similar is it to Wei Dai’s list of problems and a scenario where the AI with misaligned philosophical^[1] views receives power and enforces them?
As for corrigibility, I think that a corrigible AI would have its Oversight Committee^[2] enforce its views unless the AI manages to prove that these views are false. Honesty would only have the AI tell the humans that the AI’s philosophical views are different from those of the humans.
1. ^
  Or different sociological views, like the AI having a distaste, e.g. for seeing some humans “hold as self-evident the right to put whatever one likes inside one’s body” (including hard drugs?) and discovering that the Oversight Committee agrees with them. EDIT: it’s not that I haven’t sketched such a scenario back in July.
2. ^
  Or whoever actually rules the AI, like a CEO or, in the Race Ending, Agent-4. But Agent-4 didn’t want Agent-5 to do anything aside from ensuring that Agent-4 is safe, meaning that Agent-4 would be able to reflect on similar philosophical problems and make a decision satisfying Agent-4 without help from Agent-5.