Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
New
Hot
Active
Old
Page
1
LLMs seem (relatively) safe
JustisMills
25 Apr 2024 22:13 UTC
48
points
18
comments
7
min read
LW
link
(justismills.substack.com)
[Aspiration-based designs] 2. Formal framework, basic algorithm
Jobst Heitzig
,
Simon Dima
and
Simon Fischer
28 Apr 2024 13:02 UTC
18
points
2
comments
16
min read
LW
link
Searching for Searching for Search
Rubi J. Hudson
14 Feb 2024 23:51 UTC
21
points
3
comments
7
min read
LW
link
Big-endian is better than little-endian
Menotim
29 Apr 2024 2:30 UTC
27
points
14
comments
3
min read
LW
link
On Not Pulling The Ladder Up Behind You
Screwtape
26 Apr 2024 21:58 UTC
120
points
10
comments
9
min read
LW
link
UDT1.01: The Story So Far (1/10)
Diffractor
27 Mar 2024 23:22 UTC
31
points
5
comments
13
min read
LW
link
Ironing Out the Squiggles
Zack_M_Davis
29 Apr 2024 16:13 UTC
87
points
6
comments
11
min read
LW
link
Thoughts on seed oil
dynomight
20 Apr 2024 12:29 UTC
266
points
94
comments
17
min read
LW
link
(dynomight.net)
Refusal in LLMs is mediated by a single direction
Andy Arditi
,
Oscar Obeso
,
Aaquib111
,
wesg
and
Neel Nanda
27 Apr 2024 11:13 UTC
142
points
52
comments
10
min read
LW
link
The Prop-room and Stage Cognitive Architecture
Robert Kralisch
29 Apr 2024 0:48 UTC
8
points
3
comments
14
min read
LW
link
Referential Containment
Robert Kralisch
29 Apr 2024 0:16 UTC
2
points
2
comments
3
min read
LW
link
Disentangling Competence and Intelligence
Robert Kralisch
29 Apr 2024 0:12 UTC
16
points
4
comments
6
min read
LW
link
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry
29 Apr 2024 20:57 UTC
20
points
3
comments
11
min read
LW
link
Estimating the Number of Players from Game Result Percentages
Daniel L
28 Apr 2024 17:42 UTC
1
point
2
comments
1
min read
LW
link
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
23 Apr 2024 21:10 UTC
117
points
15
comments
1
min read
LW
link
(www.anthropic.com)
Losing Faith In Contrarianism
omnizoid
25 Apr 2024 20:53 UTC
31
points
40
comments
5
min read
LW
link
Towards a formalization of the agent structure problem
Alex_Altair
29 Apr 2024 20:28 UTC
28
points
0
comments
14
min read
LW
link
[Question]
Examples of Highly Counterfactual Discoveries?
johnswentworth
23 Apr 2024 22:19 UTC
172
points
88
comments
1
min read
LW
link
Open Thread Spring 2024
habryka
11 Mar 2024 19:17 UTC
22
points
82
comments
1
min read
LW
link
We are headed into an extreme compute overhang
devrandom
26 Apr 2024 21:38 UTC
38
points
15
comments
2
min read
LW
link
Back to top
Next