RSS

janus

Karma: 3,924

what makes Claude 3 Opus misaligned

janus10 Jul 2025 20:06 UTC
106 points
11 comments5 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

Eco­nomics of Claude 3 Opus Inference

7 Jul 2025 15:53 UTC
35 points
0 comments11 min readLW link

How LLMs are and are not myopic

janus25 Jul 2023 2:19 UTC
138 points
16 comments8 min readLW link

[Si­mu­la­tors sem­i­nar se­quence] #2 Semiotic physics—revamped

27 Feb 2023 0:25 UTC
24 points
23 comments13 min readLW link

Cyborgism

10 Feb 2023 14:47 UTC
335 points
47 comments35 min readLW link2 reviews

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

9 Feb 2023 1:30 UTC
141 points
16 comments9 min readLW link
(generative.ink)

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
56 points
16 comments13 min readLW link

Lan­guage Ex Machina

janus15 Jan 2023 9:19 UTC
43 points
25 comments24 min readLW link
(generative.ink)

Si­mu­lacra are Things

janus8 Jan 2023 23:03 UTC
63 points
7 comments2 min readLW link

[Si­mu­la­tors sem­i­nar se­quence] #1 Back­ground & shared assumptions

2 Jan 2023 23:48 UTC
50 points
4 comments3 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
97 points
9 comments14 min readLW link1 review

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC
71 points
8 comments2 min readLW link

[simu­la­tion] 4chan user claiming to be the at­tor­ney hired by Google’s sen­tient chat­bot LaMDA shares wild de­tails of encounter

janus10 Nov 2022 21:39 UTC
19 points
1 comment13 min readLW link
(generative.ink)

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
285 points
57 comments14 min readLW link1 review

Simulators

janus2 Sep 2022 12:45 UTC
671 points
168 comments41 min readLW link8 reviews
(generative.ink)

A de­scrip­tive, not pre­scrip­tive, overview of cur­rent AI Align­ment Research

6 Jun 2022 21:59 UTC
139 points
21 comments7 min readLW link

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
45 points
4 comments1 min readLW link