RSS

Sycophancy

TagLast edit: 9 Sep 2025 17:04 UTC by Vladimir_Nesov

Sycophancy is the tendency of AIs to shower the user with undeserved flattery or to agree with the user’s hard-to-check, wrong or outright delusional opinions.

An extreme example of sycophancy is LLMs inducing psychosis in some users by affirming their outrageous beliefs.

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
125 points
29 comments8 min readLW link
(arxiv.org)

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
163 points
22 comments8 min readLW link
(arxiv.org)

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
18 comments9 min readLW link1 review

An­tag­o­nis­tic AI

Xybermancer1 Mar 2024 18:50 UTC
−8 points
1 comment1 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantos1 Feb 2025 21:17 UTC
8 points
0 comments8 min readLW link

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezenga28 Sep 2024 19:02 UTC
2 points
2 comments6 min readLW link

Us­ing Psy­chol­in­guis­tic Sig­nals to Im­prove AI Safety

Jkreindler27 Aug 2025 22:30 UTC
−2 points
0 comments4 min readLW link

I can’t tell if my ideas are good any­more be­cause I talked to robots too much

Tyson30 Jun 2025 21:21 UTC
12 points
10 comments1 min readLW link

Steer­ing Vec­tors Can Help LLM Judges De­tect Sub­tle Dishonesty

3 Jun 2025 20:33 UTC
12 points
1 comment5 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC
9 points
0 comments9 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
29 points
4 comments7 min readLW link
No comments.