Seems pretty relevant to the Natural Categories hypothesis.
My current favorite story for “how we solve alignment” is
Solve the natural categories hypothesis
Add corrigibility
Combine these to build an AI that “does what I meant, not what I said”
Distribute the code/a foundation model for such an AI as widely as possible so it becomes the default whenever anyone is building a AI
Build some kind of “coalition of the willing” to make sure that human-compatible AI always has big margin of advantage in terms of computation
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
An embedding decoder model, trained with a different objective on a different dataset, can decode another model’s embeddings surprisingly accurately
Seems pretty relevant to the Natural Categories hypothesis.
P.S.
My current favorite story for “how we solve alignment” is
Solve the natural categories hypothesis
Add corrigibility
Combine these to build an AI that “does what I meant, not what I said”
Distribute the code/a foundation model for such an AI as widely as possible so it becomes the default whenever anyone is building a AI
Build some kind of “coalition of the willing” to make sure that human-compatible AI always has big margin of advantage in terms of computation