RSS

Kei

Karma: 357

Re­ward hack­ing is be­com­ing more so­phis­ti­cated and de­liber­ate in fron­tier LLMs

KeiApr 24, 2025, 4:03 PM
76 points
6 comments1 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
141 points
15 comments13 min readLW link

Kei’s Shortform

KeiJan 27, 2025, 7:23 AM
3 points
5 commentsLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link