RSS

Robert Kirk

Karma: 28

Ap­pen­dices: Su­per­vised fine­tun­ing on low-harm re­ward hack­ing gen­er­al­ises to high-harm re­ward hacking

22 Dec 2025 19:33 UTC
17 points
0 comments1 min readLW link

Su­per­vised fine­tun­ing on low-harm re­ward hack­ing gen­er­al­ises to high-harm re­ward hacking

22 Dec 2025 19:32 UTC
15 points
0 comments30 min readLW link

Lay­ered AI Defenses Have Holes: Vuln­er­a­bil­ities and Key Recommendations

4 Jul 2025 0:07 UTC
13 points
1 comment4 min readLW link
(far.ai)