RSS

Sycophancy

TagLast edit: 18 Dec 2023 23:00 UTC by Maxime Riché

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
120 points
29 comments8 min readLW link
(arxiv.org)

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

An­tag­o­nis­tic AI

Xybermancer1 Mar 2024 18:50 UTC
−8 points
1 comment1 min readLW link
No comments.