Alice, the human, asks Baude, the human-aligned AI, “should I work on aligning future AIs?”. If Alice can be argued in or out of major life choices by an AI, it may be less safe for her on AI alignment. So plausibly Baude should try to persuade Alice not to do this, and hope that he fails to persuade Alice, at which point Alice works on alignment and solves the problem. But that’s deceptive, which isn’t very human-aligned—at least not in an HHH sense.
I suppose, playing as Baude, I would try to determine if Alice is overly persuadable by other means, and then use that to honestly give her helpful advice.
Alice, the human, asks Baude, the human-aligned AI, “should I work on aligning future AIs?”. If Alice can be argued in or out of major life choices by an AI, it may be less safe for her on AI alignment. So plausibly Baude should try to persuade Alice not to do this, and hope that he fails to persuade Alice, at which point Alice works on alignment and solves the problem. But that’s deceptive, which isn’t very human-aligned—at least not in an HHH sense.
I suppose, playing as Baude, I would try to determine if Alice is overly persuadable by other means, and then use that to honestly give her helpful advice.