Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Bartosz Cywiński
Karma:
152
MATS 8.0 scholar with Arthur Conmy and Sam Marks
All
Posts
Comments
New
Top
Old
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Bartosz Cywiński
,
Helena Casademunt
,
Khoi Tran
,
aryaj
,
Sam Marks
and
Neel Nanda
9 Mar 2026 18:50 UTC
30
points
2
comments
5
min read
LW
link
Can we interpret latent reasoning using current mechanistic interpretability tools?
Bartosz Cywiński
,
Bart Bussmann
,
Arthur Conmy
,
Josh Engels
,
Neel Nanda
and
Senthooran Rajamanoharan
22 Dec 2025 16:56 UTC
44
points
1
comment
9
min read
LW
link
Current LLMs seem to rarely detect CoT tampering
Bartosz Cywiński
,
Bart Bussmann
,
Arthur Conmy
,
Neel Nanda
,
Senthooran Rajamanoharan
and
Josh Engels
19 Nov 2025 15:27 UTC
56
points
0
comments
20
min read
LW
link
Eliciting secret knowledge from language models
Bartosz Cywiński
,
Arthur Conmy
and
Sam Marks
2 Oct 2025 20:57 UTC
68
points
3
comments
2
min read
LW
link
(arxiv.org)
Back to top