RSS

Kei Nishimura-Gasparian

Karma: 567

Re­search note on win­dow shift­ing training

17 Mar 2026 15:58 UTC
26 points
1 comment15 min readLW link

Ap­pen­dices: Su­per­vised fine­tun­ing on low-harm re­ward hack­ing gen­er­al­ises to high-harm re­ward hacking

22 Dec 2025 19:33 UTC
17 points
0 comments1 min readLW link

Su­per­vised fine­tun­ing on low-harm re­ward hack­ing gen­er­al­ises to high-harm re­ward hacking

22 Dec 2025 19:32 UTC
15 points
0 comments30 min readLW link

Can you find the stegano­graph­i­cally hid­den mes­sage?

Kei Nishimura-Gasparian20 Oct 2025 17:29 UTC
49 points
2 comments7 min readLW link

Early Signs of Stegano­graphic Ca­pa­bil­ities in Fron­tier LLMs

4 Jul 2025 16:36 UTC
33 points
5 comments2 min readLW link

Re­ward hack­ing is be­com­ing more so­phis­ti­cated and de­liber­ate in fron­tier LLMs

Kei Nishimura-Gasparian24 Apr 2025 16:03 UTC
97 points
6 comments1 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
149 points
15 comments13 min readLW link

Kei’s Shortform

Kei Nishimura-Gasparian27 Jan 2025 7:23 UTC
3 points
11 comments1 min readLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
85 points
5 comments21 min readLW link