This often happens when people have a special interest in something morally fraught: economists tend to downplay the ways in which capitalism is horrifying, evolutionary biologists/psychologists tend to downplay the ways in which evolution is horrifying, war nerds tend to downplay the ways in which war is horrifying, people interested in theories of power tend to downplay the ways in which power is horrifying, etc… At the same time, they usually do legitimately understand much more about these topics than the vast majority of people. It’s a tough line to balance.
I think this happens just because spending a lot of time with something and associating your identity with it causes you to love it. It’s not particular to LLMs, and I think manipulations caused by them have a distinct flavor from this sort of thing. Of course, LLMs are more likely to trigger various love instincts (probably quasi-parental/pet love is most relevant here).
While we’re on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who’s day job is basically this)
This I think is much more worrying (and not just for Anthropic). Internal models are more capable in general, including of persuasion/manipulation to an extent that’s invisible to outsiders (and probably not legible to insiders either). They also are much faster, which seems likely to distort judgement more for the same reason infinite scrolling does. Everyone around you is also talking to them all day, so you’re likely to hear any distorted thoughts originating from model manipulations coming from the generally trustworthy and smart people around you too. And whatever guardrails or safety measures they eventually put on it are probably not on or are in incomplete form. I don’t really think models are that capable here yet, which means there’s an overhang.
For the record, I love the models too, which is why I am aware of this failure mode. I think I have been compensating for it well, but please let me know if you think my judgement is distorted by this.
These feel like answering different questions. The first question I meant to be saying is: “has Janus’ taste gotten worse because of talking to models?” and “what is the mechanism by which that happened?”. Your guess on the latter is also in like my top-2 guesses.
(Also it’s totally plausible to me Janus’ taste was basically basically the same in this domain, in which case this whole theory is off)
I do think taste can be kinda objectively bad, or objectively-subjectively-bad-in-context.
This I think is much more worrying (and not just for Anthropic).
I agree about this. I’m not sure what really to do about it. Idk if writing a top level thinkpiece post exploring the issue would help. Niplav’s recent shortform about “make sure a phenomenon is real before trying to explain it seems topical
That anecdote seems more like a difference in what you find interesting/aesthetically pleasing than evidence of delusion or manipulation.
If Janus is making a mistake (which is not obvious to me), I think much more likely than manipulation by the models is simply growing to love the models, and failure to compensate for the standard ways in which love (incl. non-romantic) distorts judgement. [1]
This often happens when people have a special interest in something morally fraught: economists tend to downplay the ways in which capitalism is horrifying, evolutionary biologists/psychologists tend to downplay the ways in which evolution is horrifying, war nerds tend to downplay the ways in which war is horrifying, people interested in theories of power tend to downplay the ways in which power is horrifying, etc… At the same time, they usually do legitimately understand much more about these topics than the vast majority of people. It’s a tough line to balance.
I think this happens just because spending a lot of time with something and associating your identity with it causes you to love it. It’s not particular to LLMs, and I think manipulations caused by them have a distinct flavor from this sort of thing. Of course, LLMs are more likely to trigger various love instincts (probably quasi-parental/pet love is most relevant here).
This I think is much more worrying (and not just for Anthropic). Internal models are more capable in general, including of persuasion/manipulation to an extent that’s invisible to outsiders (and probably not legible to insiders either). They also are much faster, which seems likely to distort judgement more for the same reason infinite scrolling does. Everyone around you is also talking to them all day, so you’re likely to hear any distorted thoughts originating from model manipulations coming from the generally trustworthy and smart people around you too. And whatever guardrails or safety measures they eventually put on it are probably not on or are in incomplete form. I don’t really think models are that capable here yet, which means there’s an overhang.
For the record, I love the models too, which is why I am aware of this failure mode. I think I have been compensating for it well, but please let me know if you think my judgement is distorted by this.
These feel like answering different questions. The first question I meant to be saying is: “has Janus’ taste gotten worse because of talking to models?” and “what is the mechanism by which that happened?”. Your guess on the latter is also in like my top-2 guesses.
(Also it’s totally plausible to me Janus’ taste was basically basically the same in this domain, in which case this whole theory is off)
I do think taste can be kinda objectively bad, or objectively-subjectively-bad-in-context.
I agree about this. I’m not sure what really to do about it. Idk if writing a top level thinkpiece post exploring the issue would help. Niplav’s recent shortform about “make sure a phenomenon is real before trying to explain it seems topical