RSS

Zhijing Jin

Karma: 160

An­gles of at­tack for con­tinual learn­ing safety

16 Jun 2026 16:15 UTC
47 points
0 comments13 min readLW link

How might con­tinual learn­ing af­fect safety and al­ign­ment?

13 Jun 2026 17:34 UTC
59 points
2 comments16 min readLW link

What’s Con­tinual Learn­ing, and Why Might We Ex­pect To See It In Ad­vanced LLM Agents?

12 Jun 2026 18:43 UTC
28 points
2 comments17 min readLW link

Im­pli­ca­tions of Con­tinual Learn­ing for LLM Agents: Introduction

12 Jun 2026 18:36 UTC
48 points
0 comments6 min readLW link

Ex­plain­ing un­de­sir­able model be­hav­ior: (How) can in­fluence func­tions help?

2 Mar 2026 11:30 UTC
18 points
0 comments3 min readLW link

The Multi-Agent Minefield: Can LLMs Co­op­er­ate to Avoid Global Catas­tro­phe?

17 Feb 2026 16:55 UTC
14 points
2 comments5 min readLW link

Test­ing the Author­i­tar­ian Bias of LLMs

9 Aug 2025 18:09 UTC
10 points
1 comment6 min readLW link

Why Rea­son­ing Isn’t Enough: How LLM Agents Strug­gle with Ethics and Cooperation

28 Jun 2025 20:43 UTC
6 points
0 comments4 min readLW link

In­ves­ti­gat­ing Ac­ci­den­tal Misal­ign­ment: Causal Effects of Fine-Tun­ing Data on Model Vulnerability

11 Jun 2025 19:30 UTC
6 points
0 comments5 min readLW link

Cor­rupted by Rea­son­ing: Rea­son­ing Lan­guage Models Be­come Free-Riders in Public Goods Games

22 Apr 2025 19:25 UTC
24 points
3 comments5 min readLW link

Wel­come to Ap­ply: The 2024 Vi­talik Bu­terin Fel­low­ships in AI Ex­is­ten­tial Safety by FLI!

Zhijing Jin25 Sep 2023 18:42 UTC
5 points
2 comments2 min readLW link