janus

Karma: 4,163

what makes Claude 3 Opus misaligned

janus10 Jul 2025 20:06 UTC

126 points

13 comments5 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

159 points

14 comments5 min readLW link

(arxiv.org)

Economics of Claude 3 Opus Inference

Antra Tessera and janus

7 Jul 2025 15:53 UTC

38 points

0 comments11 min readLW link

How LLMs are and are not myopic

janus25 Jul 2023 2:19 UTC

140 points

16 comments8 min readLW link

[Simulators seminar sequence] #2 Semiotic physics—revamped

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer and remember

27 Feb 2023 0:25 UTC

22 points

23 comments13 min readLW link

Cyborgism

Niki Dupuis and janus

10 Feb 2023 14:47 UTC

339 points

47 comments35 min readLW link 2 reviews

Anomalous tokens reveal the original identities of Instruct models

9 Feb 2023 1:30 UTC

141 points

16 comments9 min readLW link

(generative.ink)

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

56 points

16 comments13 min readLW link

Language Ex Machina

janus15 Jan 2023 9:19 UTC

47 points

25 comments24 min readLW link

(generative.ink)

Simulacra are Things

janus8 Jan 2023 23:03 UTC

64 points

7 comments2 min readLW link

[Simulators seminar sequence] #1 Background & shared assumptions

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer and remember

2 Jan 2023 23:48 UTC

50 points

4 comments3 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Searching for Search

Niki Dupuis and janus

28 Nov 2022 15:31 UTC

98 points

9 comments14 min readLW link 1 review

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC

71 points

8 comments2 min readLW link

[simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter

janus10 Nov 2022 21:39 UTC

19 points

1 comment13 min readLW link

(generative.ink)

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

303 points

57 comments14 min readLW link 1 review

Simulators

janus2 Sep 2022 12:45 UTC

713 points

170 comments41 min readLW link 8 reviews

(generative.ink)

A descriptive, not prescriptive, overview of current AI Alignment Research

Jan, Logan Riggs, jacquesthibs and janus

6 Jun 2022 21:59 UTC

139 points

21 comments7 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link