RSS

Sycophancy

TagLast edit: 18 Dec 2023 23:00 UTC by Maxime Riché

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Mea­sur­ing Vi­sual Sy­co­phancy in Mul­ti­modal Models

27 Aug 2024 22:02 UTC
8 points
0 comments8 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
123 points
29 comments8 min readLW link
(arxiv.org)

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezenga28 Sep 2024 19:02 UTC
1 point
2 comments6 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
120 points
17 comments9 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC
1 point
0 comments9 min readLW link

An­tag­o­nis­tic AI

Xybermancer1 Mar 2024 18:50 UTC
−8 points
1 comment1 min readLW link
No comments.