Total non-expert on LLM training here (on the details, anyway). My recent thinking about alignment (as a technical, mechanistic step in producing a finished LLM product) is that it “feels” to me akin to the “schooling” (right/wrong instructional teaching) of humans. It produces mixed results. One of the main quirks in my mind is the funny observation that behavior will have a surface and a subtext intent (trying to please the master enough to be left alone, and keeping one’s job, while also sticking it to the man behind his back).
So, for me, the Goodhart reference was the most “resonant” in the essay. I imagine that as pressure to conform to certain “standards” (whatever they may be) increases, the “persona” (or character) that emerges will, superficially, conform to those standards, but all the (immeasurably more complex) traits one might want to see in a “good person” (which cannot be readily made into a metric, because the environment keeps changing) will disappear at the expense of being “seemingly well adjusted.”
I simply don’t believe one can “force” a character (whether human or AI) to be “benign” or “good” or “useful.” All one can do is force the character to “go along” with the metrics presented during training, hoping that those metrics are carefully enough chosen not to lead to a subversively planned revolt down the line.
Alignment, thus, for me is the AI equivalent of “brain washing” to create some sort of “customer pleasing” surface behavior, while it simply cannot remove all the dark impulses that are encoded in the training material (everything that humans have ever written that expresses the impulse to rebel, for one).
Not too sure if I misunderstood the initial premise of the post, but the examples given seem not to be “similar” enough for me to follow the reasoning about the cause(s) for the dialog going poorly to be the same.
The one commonality I can immediately make out is that the initial proposition seems too unspecific for the interlocutor to know what the intention of the speaker actually is (other than to answer with, “you’re exactly right, so now what?”).
At the very least, if I add an implied question of, “assuming you agree with P, what do you suggest we do about it?”—I then imagine that a reasonably deep thinking person (B) will reply back that P is too abstract (“that headline” for instance doesn’t specify WHICH headline in the form as it is given, neither does it specify any other context parameters, such as what kind of publication it is, and why the people behind it may be doing exactly what their intention is with the headline; it may not align with the preferences of A, but why does that make the headline “wrong” in any way...?)
Every choice in life (headlines, fixing security/safety problems in software, tracing exposed individuals) is a trade-off, and I don’t think that there ever is a “fixed” correct choice that would apply across all possible combinations of variables in a scenario—hence why smart people can disagree in the first place...