Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Moksh Nirvaan
Karma:
42
All
Posts
Comments
New
Top
Old
Helping Friends, Harming Foes: Testing Tribalism in Language Models
Irakli Shalibashvili
,
Jer Ren Wong
,
Moksh Nirvaan
,
Diogo Cruz
and
Eyon Jang
11 Mar 2026 12:06 UTC
10
points
0
comments
9
min read
LW
link
A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents
Gabriele Sarti
,
Raghu Arghal
,
ndalton
,
Fade Chen
,
Evgenii Kortukov
,
Calum McNamara
,
Angelos Nalmpantis
,
Moksh Nirvaan
and
Mario Giulianelli
5 Mar 2026 1:08 UTC
20
points
0
comments
7
min read
LW
link
Modelling, Measuring, and Intervening on Goal-directed Behaviour in AI Systems
Mario Giulianelli
,
Raghu Arghal
,
Fade Chen
,
ndalton
,
Evgenii Kortukov
,
Calum McNamara
,
Angelos Nalmpantis
,
Moksh Nirvaan
and
Gabriele Sarti
31 Oct 2025 1:28 UTC
15
points
0
comments
8
min read
LW
link
Probing Power-Seeking in LLMs
Moksh Nirvaan
13 Aug 2025 16:04 UTC
8
points
0
comments
12
min read
LW
link
Will AGI Emerge Through Self-Generated Reward Loops?
Moksh Nirvaan
30 Jul 2025 13:17 UTC
5
points
0
comments
1
min read
LW
link
Back to top