Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Joe Kwon
Karma:
157
All
Posts
Comments
New
Top
Old
How Robust Is Monitoring Against Secret Loyalties?
Joe Kwon
26 Feb 2026 15:50 UTC
8
points
0
comments
5
min read
LW
link
Reasoning Traces as a Path to Data-Efficient Generalization in Data Poisoning
Joe Kwon
25 Feb 2026 18:17 UTC
14
points
0
comments
3
min read
LW
link
The Easiest Route to Secret Loyalty May Be Hijacking the Model’s Chain of Command
Joe Kwon
24 Feb 2026 17:47 UTC
16
points
1
comment
5
min read
LW
link
Pre-training data poisoning likely makes installing secret loyalties easier
Joe Kwon
23 Feb 2026 18:12 UTC
12
points
0
comments
4
min read
LW
link
How Secret Loyalty Differs from Standard Backdoor Threats
Joe Kwon
12 Feb 2026 18:48 UTC
23
points
4
comments
12
min read
LW
link
[Question]
Are there any groupchats for people working on Representation reading/control, activation steering type experiments?
Joe Kwon
20 May 2024 18:03 UTC
3
points
1
comment
1
min read
LW
link
Claude wants to be conscious
Joe Kwon
13 Apr 2024 1:40 UTC
2
points
8
comments
6
min read
LW
link
[Linkpost] Faith and Fate: Limits of Transformers on Compositionality
Joe Kwon
16 Jun 2023 15:04 UTC
19
points
4
comments
1
min read
LW
link
(arxiv.org)
The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge
Joe Kwon
5 Jun 2023 20:41 UTC
2
points
1
comment
18
min read
LW
link
Paper: Forecasting world events with neural nets
Owain_Evans
,
Dan H
and
Joe Kwon
1 Jul 2022 19:40 UTC
39
points
3
comments
4
min read
LW
link
Converging toward a Million Worlds
Joe Kwon
24 Dec 2021 21:33 UTC
11
points
1
comment
3
min read
LW
link
[Question]
Partial-Consciousness as semantic/symbolic representational language model trained on NN
Joe Kwon
16 Mar 2021 18:51 UTC
2
points
3
comments
1
min read
LW
link
Joe Kwon’s Shortform
Joe Kwon
16 Mar 2021 1:18 UTC
1
point
0
comments
1
min read
LW
link
[Question]
Value of building an online “knowledge web”
Joe Kwon
1 May 2020 4:31 UTC
2
points
8
comments
1
min read
LW
link
Back to top