RSS

Lan­guage Models (LLMs)

TagLast edit: Mar 13, 2025, 5:45 PM by Raemon

Language Models are computer programs made to estimate the likelihood of a piece of text. “Hello, how are you?” is likely. “Hello, fnarg horses” is unlikely.

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. “Q: How are You? A: Very well, thank you” is a likely question-and-answer pair. “Q: How are You? A: Correct horse battery staple” is an unlikely question-and-answer pair.

The language models most relevant to AI safety are language models based on “deep learning”. Deep-learning-based language models can be “trained” to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?

See also

Simulators

janusSep 2, 2022, 12:45 PM
670 points

362 votes

Overall karma indicates overall quality.

168 comments41 min readLW link8 reviews
(generative.ink)

In­verse Scal­ing Prize: Round 1 Winners

Sep 26, 2022, 7:57 PM
93 points

54 votes

Overall karma indicates overall quality.

16 comments4 min readLW link
(irmckenzie.co.uk)

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_DavisOct 21, 2023, 3:22 PM
266 points

124 votes

Overall karma indicates overall quality.

56 comments13 min readLW link2 reviews

How it feels to have your mind hacked by an AI

blakedJan 12, 2023, 12:33 AM
372 points

268 votes

Overall karma indicates overall quality.

222 comments17 min readLW link

How LLMs are and are not myopic

janusJul 25, 2023, 2:19 AM
138 points

67 votes

Overall karma indicates overall quality.

16 comments8 min readLW link

On the fu­ture of lan­guage models

owencbDec 20, 2023, 4:58 PM
105 points

44 votes

Overall karma indicates overall quality.

17 comments36 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
64 points

30 votes

Overall karma indicates overall quality.

41 comments24 min readLW link

Try train­ing to­ken-level probes

StefanHexApr 14, 2025, 11:56 AM
47 points

19 votes

Overall karma indicates overall quality.

6 comments8 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM
65 points

37 votes

Overall karma indicates overall quality.

30 comments11 min readLW link

Trans­former Circuits

evhubDec 22, 2021, 9:09 PM
145 points

61 votes

Overall karma indicates overall quality.

4 comments3 min readLW link
(transformer-circuits.pub)

A Chi­nese Room Con­tain­ing a Stack of Stochas­tic Parrots

RogerDearnaleyJan 12, 2024, 6:29 AM
20 points

8 votes

Overall karma indicates overall quality.

3 comments5 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points

19 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM
35 points

11 votes

Overall karma indicates overall quality.

4 comments39 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM
16 points

17 votes

Overall karma indicates overall quality.

15 comments13 min readLW link

Mlyyrczo

lsusrDec 26, 2022, 7:58 AM
44 points

47 votes

Overall karma indicates overall quality.

14 comments3 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM
48 points

25 votes

Overall karma indicates overall quality.

8 comments36 min readLW link

Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. LeeSep 11, 2024, 8:57 PM
41 points

12 votes

Overall karma indicates overall quality.

0 comments11 min readLW link
(brucewlee.com)

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM
645 points

504 votes

Overall karma indicates overall quality.

188 comments16 min readLW link

So You Think You’ve Awo­ken ChatGPT

JustisMillsJul 11, 2025, 1:01 AM
311 points

187 votes

Overall karma indicates overall quality.

88 comments9 min readLW link

AI Safety Chatbot

Dec 21, 2023, 2:06 PM
61 points

26 votes

Overall karma indicates overall quality.

11 comments4 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM
9 points

5 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

Feb 5, 2023, 10:02 PM
687 points

432 votes

Overall karma indicates overall quality.

208 comments12 min readLW link1 review

LLM AGI will have mem­ory, and mem­ory changes alignment

Seth HerdApr 4, 2025, 2:59 PM
73 points

30 votes

Overall karma indicates overall quality.

15 comments9 min readLW link

Rep­re­sen­ta­tion Tuning

Christopher AckermanJun 27, 2024, 5:44 PM
35 points

13 votes

Overall karma indicates overall quality.

9 comments13 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points

16 votes

Overall karma indicates overall quality.

6 comments2 min readLW link1 review
(arxiv.org)

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean OsierJun 30, 2024, 2:48 AM
13 points

8 votes

Overall karma indicates overall quality.

5 comments6 min readLW link
(github.com)

LLMs may cap­ture key com­po­nents of hu­man agency

catubcNov 17, 2022, 8:14 PM
27 points

13 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Re­sults from the lan­guage model hackathon

Esben KranOct 10, 2022, 8:29 AM
22 points

14 votes

Overall karma indicates overall quality.

1 comment4 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon LermenMay 11, 2024, 12:08 AM
51 points

33 votes

Overall karma indicates overall quality.

14 comments7 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM
65 points

34 votes

Overall karma indicates overall quality.

14 comments13 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina PanicksseryAug 9, 2023, 7:06 AM
69 points

30 votes

Overall karma indicates overall quality.

20 comments12 min readLW link

Test­ing PaLM prompts on GPT3

YitzApr 6, 2022, 5:21 AM
103 points

57 votes

Overall karma indicates overall quality.

14 comments8 min readLW link

Large Lan­guage Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM
185 points

79 votes

Overall karma indicates overall quality.

14 comments8 min readLW link
(ethanedwards.substack.com)

In­vo­ca­tions: The Other Ca­pa­bil­ities Over­hang?

Robert_AIZIApr 4, 2023, 1:38 PM
29 points

16 votes

Overall karma indicates overall quality.

4 comments4 min readLW link
(aizi.substack.com)

In­verse Scal­ing Prize: Se­cond Round Winners

Jan 24, 2023, 8:12 PM
58 points

29 votes

Overall karma indicates overall quality.

17 comments15 min readLW link

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyPMar 26, 2023, 9:57 PM
99 points

55 votes

Overall karma indicates overall quality.

3 comments41 min readLW link

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTroutMar 2, 2025, 7:51 PM
154 points

85 votes

Overall karma indicates overall quality.

29 comments1 min readLW link
(turntrout.com)

Slow­down After 2028: Com­pute, RLVR Uncer­tainty, MoE Data Wall

Vladimir_NesovMay 1, 2025, 1:54 PM
196 points

79 votes

Overall karma indicates overall quality.

25 comments5 min readLW link

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyPFeb 13, 2023, 6:52 PM
84 points

45 votes

Overall karma indicates overall quality.

11 comments15 min readLW link

“Tilakkhana”, Gw­ern [poem]

gwernOct 21, 2025, 2:39 AM
20 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(gwern.net)

Ex­trap­o­lat­ing from Five Words

Gordon Seidoh WorleyNov 15, 2023, 11:21 PM
40 points

21 votes

Overall karma indicates overall quality.

11 comments2 min readLW link

Owain Evans on Si­tu­a­tional Aware­ness and Out-of-Con­text Rea­son­ing in LLMs

Michaël TrazziAug 24, 2024, 4:30 AM
55 points

17 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

what makes Claude 3 Opus misaligned

janusJul 10, 2025, 8:06 PM
104 points

64 votes

Overall karma indicates overall quality.

11 comments5 min readLW link

Pro­posal for In­duc­ing Steganog­ra­phy in LMs

Logan RiggsJan 12, 2023, 10:15 PM
22 points

11 votes

Overall karma indicates overall quality.

3 comments2 min readLW link

Take 11: “Align­ing lan­guage mod­els” should be weirder.

Charlie SteinerDec 18, 2022, 2:14 PM
34 points

18 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Notes on the Math­e­mat­ics of LLM Architectures

carboniferous_umbraculum Feb 9, 2023, 1:45 AM
12 points

10 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(drive.google.com)

Be­ware Gen­eral Claims about “Gen­er­al­iz­able Rea­son­ing Ca­pa­bil­ities” (of Modern AI Sys­tems)

LawrenceCJun 11, 2025, 7:27 PM
295 points

139 votes

Overall karma indicates overall quality.

19 comments16 min readLW link

LLMs Can’t See Pix­els or Characters

Brendan LongJul 20, 2025, 8:00 PM
100 points

55 votes

Overall karma indicates overall quality.

44 comments4 min readLW link
(www.brendanlong.com)

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max HOct 11, 2023, 12:53 AM
35 points

17 votes

Overall karma indicates overall quality.

5 comments11 min readLW link

Con­di­tion­ing Gen­er­a­tive Models

Adam JermynJun 25, 2022, 10:15 PM
24 points

11 votes

Overall karma indicates overall quality.

18 comments10 min readLW link

What o3 Be­comes by 2028

Vladimir_NesovDec 22, 2024, 12:37 PM
149 points

71 votes

Overall karma indicates overall quality.

15 comments5 min readLW link

Claude 3.5 Sonnet

Zach Stein-PerlmanJun 20, 2024, 6:00 PM
75 points

29 votes

Overall karma indicates overall quality.

41 comments1 min readLW link
(www.anthropic.com)

‘simu­la­tor’ fram­ing and con­fu­sions about LLMs

Beth BarnesDec 31, 2022, 11:38 PM
104 points

52 votes

Overall karma indicates overall quality.

11 comments4 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkinsOct 4, 2024, 10:15 PM
46 points

12 votes

Overall karma indicates overall quality.

5 comments6 min readLW link

Un­ex­pected Con­scious Entities

Gunnar_ZarnckeMay 5, 2025, 10:14 PM
34 points

12 votes

Overall karma indicates overall quality.

7 comments6 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo NardoJun 13, 2023, 2:08 PM
71 points

41 votes

Overall karma indicates overall quality.

17 comments5 min readLW link

Lan­guage mod­els can gen­er­ate su­pe­rior text com­pared to their input

ChristianKlJan 17, 2023, 10:57 AM
48 points

31 votes

Overall karma indicates overall quality.

28 comments1 min readLW link

Ag­grega­tive Prin­ci­ples of So­cial Justice

Cleo NardoJun 5, 2024, 1:44 PM
29 points

12 votes

Overall karma indicates overall quality.

10 comments37 min readLW link

′ pe­ter­todd’’s last stand: The fi­nal days of open GPT-3 research

mwatkinsJan 22, 2024, 6:47 PM
109 points

52 votes

Overall karma indicates overall quality.

16 comments45 min readLW link

How to train your trans­former

p.b.Apr 7, 2022, 9:34 AM
6 points

3 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

See­ing Through the Eyes of the Algorithm

silentbobFeb 22, 2025, 11:54 AM
18 points

10 votes

Overall karma indicates overall quality.

3 comments10 min readLW link

A Pro­posed Test to Deter­mine the Ex­tent to Which Large Lan­guage Models Un­der­stand the Real World

Bruce GFeb 24, 2023, 8:20 PM
4 points

3 votes

Overall karma indicates overall quality.

7 comments8 min readLW link

No­kens: A po­ten­tial method of in­ves­ti­gat­ing glitch tokens

HoagyMar 15, 2023, 4:23 PM
21 points

12 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Proof-of-Con­cept De­bug­ger for a Small LLM

Mar 17, 2025, 10:27 PM
27 points

9 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

[Question] Will 2023 be the last year you can write short sto­ries and re­ceive most of the in­tel­lec­tual credit for writ­ing them?

lcMar 16, 2023, 9:36 PM
20 points

9 votes

Overall karma indicates overall quality.

12 comments1 min readLW link

Bing Chat is blatantly, ag­gres­sively misaligned

evhubFeb 15, 2023, 5:29 AM
406 points

247 votes

Overall karma indicates overall quality.

181 comments2 min readLW link1 review

You can’t eval GPT5 anymore

Lukas PeterssonSep 18, 2025, 10:12 PM
158 points

94 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

En­hanc­ing biose­cu­rity with lan­guage mod­els: defin­ing re­search directions

micMar 26, 2024, 12:30 PM
12 points

2 votes

Overall karma indicates overall quality.

0 comments13 min readLW link
(papers.ssrn.com)

Shut­down Re­sis­tance in Rea­son­ing Models

Jul 6, 2025, 12:01 AM
138 points

56 votes

Overall karma indicates overall quality.

14 comments9 min readLW link
(palisaderesearch.org)

Teaser: Hard-cod­ing Trans­former Models

MadHatterDec 12, 2021, 10:04 PM
74 points

34 votes

Overall karma indicates overall quality.

19 comments1 min readLW link

Water­mark­ing con­sid­ered over­rated?

DanielFilanJul 31, 2023, 9:36 PM
19 points

9 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

May 7, 2023, 12:46 AM
77 points

35 votes

Overall karma indicates overall quality.

24 comments9 min readLW link

Fore­cast­ing progress in lan­guage models

Oct 28, 2021, 8:40 PM
62 points

28 votes

Overall karma indicates overall quality.

6 comments12 min readLW link
(www.metaculus.com)

Cor­rigi­bil­ity, Self-Dele­tion, and Iden­ti­cal Strawberries

Robert_AIZIMar 28, 2023, 4:54 PM
9 points

6 votes

Overall karma indicates overall quality.

2 comments6 min readLW link
(aizi.substack.com)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
322 points

137 votes

Overall karma indicates overall quality.

30 comments18 min readLW link1 review

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

Feb 25, 2025, 5:39 PM
332 points

146 votes

Overall karma indicates overall quality.

92 comments4 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM
28 points

18 votes

Overall karma indicates overall quality.

7 comments8 min readLW link1 review

On Claude 3.5 Sonnet

ZviJun 24, 2024, 12:00 PM
95 points

46 votes

Overall karma indicates overall quality.

14 comments13 min readLW link
(thezvi.wordpress.com)

Scaf­folded LLMs as nat­u­ral lan­guage computers

berenApr 12, 2023, 10:47 AM
97 points

54 votes

Overall karma indicates overall quality.

10 comments11 min readLW link

AMA Con­jec­ture, A New Align­ment Startup

adamShimiApr 9, 2022, 9:43 AM
47 points

21 votes

Overall karma indicates overall quality.

42 comments1 min readLW link

[Question] Ba­sic Ques­tion about LLMs: how do they know what task to perform

GarakJan 14, 2023, 1:13 PM
1 point

3 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

New, im­proved mul­ti­ple-choice TruthfulQA

Jan 15, 2025, 11:32 PM
72 points

28 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

[ASoT] Some thoughts about LM monologue limi­ta­tions and ELK

leogaoMar 30, 2022, 2:26 PM
10 points

6 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

“text­books are all you need”

bhauthJun 21, 2023, 5:06 PM
66 points

38 votes

Overall karma indicates overall quality.

18 comments2 min readLW link
(arxiv.org)

Knowl­edge, Rea­son­ing, and Superintelligence

owencbMar 26, 2025, 11:28 PM
21 points

5 votes

Overall karma indicates overall quality.

1 comment7 min readLW link
(strangecities.substack.com)

Un­der­stand­ing LLMs: In­sights from Mechanis­tic Interpretability

Stephen McAleeseAug 30, 2025, 4:50 PM
43 points

22 votes

Overall karma indicates overall quality.

2 comments30 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
76 points

34 votes

Overall karma indicates overall quality.

8 comments7 min readLW link

Can I take ducks home from the park?

dynomightSep 14, 2023, 9:03 PM
67 points

46 votes

Overall karma indicates overall quality.

8 comments3 min readLW link
(dynomight.net)

LLM Ap­pli­ca­tions I Want To See

sarahconstantinAug 19, 2024, 9:10 PM
102 points

45 votes

Overall karma indicates overall quality.

6 comments8 min readLW link
(sarahconstantin.substack.com)

Ts­inghua pa­per: Does RL Really In­cen­tivize Rea­son­ing Ca­pac­ity in LLMs Beyond the Base Model?

Thomas KwaMay 5, 2025, 6:56 PM
69 points

36 votes

Overall karma indicates overall quality.

21 comments2 min readLW link
(arxiv.org)

Num­ber­wang: LLMs Do­ing Au­tonomous Re­search, and a Call for Input

Jan 16, 2025, 5:20 PM
71 points

35 votes

Overall karma indicates overall quality.

30 comments31 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM
51 points

26 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

LLMs one-box when in a “hos­tile telepath” ver­sion of New­comb’s Para­dox, ex­cept for the one that beat the predictor

Kaj_SotalaOct 6, 2025, 8:44 AM
52 points

21 votes

Overall karma indicates overall quality.

6 comments17 min readLW link

Fron­tier LLM Race/​Sex Ex­change Rates

Arjun PanicksseryOct 19, 2025, 6:36 PM
48 points

37 votes

Overall karma indicates overall quality.

10 comments3 min readLW link
(arctotherium.substack.com)

LLM robots can’t pass but­ter (and they are hav­ing an ex­is­ten­tial crisis about it)

Lukas PeterssonOct 28, 2025, 2:14 PM
98 points

50 votes

Overall karma indicates overall quality.

6 comments4 min readLW link

And All the Shog­goths Merely Players

Zack_M_DavisFeb 10, 2024, 7:56 PM
177 points

68 votes

Overall karma indicates overall quality.

57 comments12 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

Sep 23, 2023, 7:55 PM
121 points

59 votes

Overall karma indicates overall quality.

74 comments4 min readLW link
(arxiv.org)

In Defense of Chat­bot Romance

Kaj_SotalaFeb 11, 2023, 2:30 PM
125 points

77 votes

Overall karma indicates overall quality.

53 comments11 min readLW link
(kajsotala.fi)

[Question] Sup­pos­ing the 1bit LLM pa­per pans out

O OFeb 29, 2024, 5:31 AM
27 points

11 votes

Overall karma indicates overall quality.

11 comments1 min readLW link

[Question] Does a LLM have a util­ity func­tion?

DagonDec 9, 2022, 5:19 PM
17 points

6 votes

Overall karma indicates overall quality.

11 comments1 min readLW link

[Question] Is there a ‘time se­ries fore­cast­ing’ equiv­a­lent of AIXI?

Solenoid_EntityMay 17, 2023, 4:35 AM
12 points

3 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

Dec 5, 2022, 8:28 PM
40 points

19 votes

Overall karma indicates overall quality.

19 comments10 min readLW link

Open Source LLM Poké­mon Scaffold

Julian BradshawApr 27, 2025, 12:57 AM
24 points

11 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(github.com)

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam MarksJan 3, 2024, 7:44 PM
159 points

69 votes

Overall karma indicates overall quality.

64 comments16 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

Feb 8, 2023, 6:19 PM
32 points

10 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

[Question] If I ask an LLM to think step by step, how big are the steps?

ryan_bSep 13, 2024, 8:30 PM
7 points

3 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Un­der­stand­ing the diffu­sion of large lan­guage mod­els: summary

Ben CottierJan 16, 2023, 1:37 AM
26 points

10 votes

Overall karma indicates overall quality.

1 comment22 min readLW link

What do lan­guage mod­els know about fic­tional char­ac­ters?

skybrianFeb 22, 2023, 5:58 AM
6 points

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

MrThinkMay 10, 2023, 7:46 AM
47 points

19 votes

Overall karma indicates overall quality.

14 comments1 min readLW link

What would a hu­man pre­tend­ing to be an AI say?

Brendan LongAug 8, 2025, 6:56 PM
54 points

28 votes

Overall karma indicates overall quality.

19 comments1 min readLW link
(www.brendanlong.com)

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hathDec 8, 2021, 5:06 PM
87 points

39 votes

Overall karma indicates overall quality.

26 comments1 min readLW link
(deepmind.com)

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
439 points

206 votes

Overall karma indicates overall quality.

98 comments50 min readLW link1 review

“LLMs Don’t Have a Co­her­ent Model of the World”—What it Means, Why it Mat­ters

DavidmanheimJun 1, 2023, 7:46 AM
32 points

16 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

On the func­tional self of LLMs

eggsyntaxJul 7, 2025, 3:39 PM
113 points

49 votes

Overall karma indicates overall quality.

37 comments8 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

Feb 2, 2023, 8:28 PM
89 points

31 votes

Overall karma indicates overall quality.

4 comments13 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM
71 points

40 votes

Overall karma indicates overall quality.

18 comments6 min readLW link

Evil au­to­com­plete: Ex­is­ten­tial Risk and Next-To­ken Predictors

YitzFeb 28, 2023, 8:47 AM
9 points

4 votes

Overall karma indicates overall quality.

3 comments5 min readLW link

Lamda is not an LLM

KevinJun 19, 2022, 11:13 AM
7 points

17 votes

Overall karma indicates overall quality.

10 comments1 min readLW link
(www.wired.com)

Ro­mance, mi­s­un­der­stand­ing, so­cial stances, and the hu­man LLM

Kaj_SotalaApr 27, 2023, 12:59 PM
77 points

36 votes

Overall karma indicates overall quality.

32 comments16 min readLW link

Up­com­ing Changes in Large Lan­guage Models

Andrew Keenan RichardsonApr 8, 2023, 3:41 AM
43 points

28 votes

Overall karma indicates overall quality.

8 comments4 min readLW link
(mechanisticmind.com)

Real­is­tic Re­ward Hack­ing In­duces Differ­ent and Deeper Misalignment

JozdienOct 9, 2025, 6:45 PM
127 points

46 votes

Overall karma indicates overall quality.

2 comments23 min readLW link

More Fun With GPT-4o Image Generation

ZviApr 3, 2025, 2:10 AM
34 points

12 votes

Overall karma indicates overall quality.

3 comments8 min readLW link
(thezvi.wordpress.com)

Shorter To­kens Are More Likely

Brendan LongAug 24, 2025, 12:22 AM
98 points

47 votes

Overall karma indicates overall quality.

19 comments5 min readLW link
(www.brendanlong.com)

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM
5 points

7 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

Test­ing for par­allel rea­son­ing in LLMs

May 19, 2024, 3:28 PM
9 points

7 votes

Overall karma indicates overall quality.

7 comments9 min readLW link

Re­search Dis­cus­sion on PSCA with Claude Son­net 3.5

Robert KralischJul 24, 2024, 4:53 PM
−2 points

4 votes

Overall karma indicates overall quality.

0 comments25 min readLW link

GPT-4 can catch sub­tle cross-lan­guage trans­la­tion mistakes

Michael TontchevJul 27, 2023, 1:39 AM
7 points

4 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Why I Believe LLMs Do Not Have Hu­man-like Emotions

OneManyNoneMay 22, 2023, 3:46 PM
13 points

13 votes

Overall karma indicates overall quality.

6 comments7 min readLW link

How Does A Blind Model See The Earth?

henryAug 11, 2025, 7:58 PM
486 points

268 votes

Overall karma indicates overall quality.

40 comments7 min readLW link
(outsidetext.substack.com)

Jailbreak steer­ing generalization

Jun 20, 2024, 5:25 PM
41 points

14 votes

Overall karma indicates overall quality.

4 comments2 min readLW link
(arxiv.org)

An­thropic re­lease Claude 3, claims >GPT-4 Performance

LawrenceCMar 4, 2024, 6:23 PM
115 points

61 votes

Overall karma indicates overall quality.

41 comments2 min readLW link
(www.anthropic.com)

ChatGPT is the Da­guerreo­type of AI

Alex_AltairAug 7, 2025, 10:14 PM
42 points

21 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

Some Quick Fol­low-Up Ex­per­i­ments to “Taken out of con­text: On mea­sur­ing situ­a­tional aware­ness in LLMs”

Miles TurpinOct 3, 2023, 2:22 AM
31 points

11 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Lan­guage Model Tools for Align­ment Research

Logan RiggsApr 8, 2022, 5:32 PM
28 points

16 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

RLHF does not ap­pear to differ­en­tially cause mode-collapse

Mar 20, 2023, 3:39 PM
95 points

42 votes

Overall karma indicates overall quality.

9 comments3 min readLW link

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

Nov 16, 2023, 8:18 PM
45 points

14 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(arxiv.org)

In­fer­ring the model di­men­sion of API-pro­tected LLMs

Ege ErdilMar 18, 2024, 6:19 AM
34 points

16 votes

Overall karma indicates overall quality.

3 comments4 min readLW link
(arxiv.org)

Claude Doesn’t Want to Die

garrisonMar 5, 2024, 6:00 AM
22 points

18 votes

Overall karma indicates overall quality.

3 comments10 min readLW link
(garrisonlovely.substack.com)

LLM Guardrails Should Have Bet­ter Cus­tomer Ser­vice Tuning

Jiao BuMay 13, 2023, 10:54 PM
2 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Dens­ing Law of LLMs

Bogdan Ionut CirsteaDec 8, 2024, 7:35 PM
9 points

7 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(arxiv.org)

What does it mean for an LLM such as GPT to be al­igned /​ good /​ pos­i­tive im­pact?

PashaKamyshevMar 20, 2023, 9:21 AM
4 points

2 votes

Overall karma indicates overall quality.

3 comments10 min readLW link

How peo­ple use LLMs

ElizabethApr 27, 2025, 9:48 PM
83 points

25 votes

Overall karma indicates overall quality.

6 comments1 min readLW link
(www.gleech.org)

Do LLMs dream of emer­gent sheep?

ShmiApr 24, 2023, 3:26 AM
16 points

6 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

Dec 20, 2022, 8:08 PM
100 points

45 votes

Overall karma indicates overall quality.

34 comments1 min readLW link
(www.anthropic.com)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
70 points

31 votes

Overall karma indicates overall quality.

5 comments10 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
65 points

30 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(turntrout.com)

Claude 3 Opus can op­er­ate as a Tur­ing machine

Gunnar_ZarnckeApr 17, 2024, 8:41 AM
37 points

17 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(twitter.com)

What’s up with all the non-Mor­mons? Weirdly spe­cific uni­ver­sal­ities across LLMs

mwatkinsApr 19, 2024, 1:43 PM
40 points

22 votes

Overall karma indicates overall quality.

13 comments27 min readLW link

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllisSep 17, 2022, 7:48 PM
19 points

10 votes

Overall karma indicates overall quality.

13 comments3 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David UdellJul 9, 2024, 10:06 PM
53 points

18 votes

Overall karma indicates overall quality.

7 comments7 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM
119 points

50 votes

Overall karma indicates overall quality.

15 comments22 min readLW link

Teach­ing Claude to Meditate

Gordon Seidoh WorleyDec 29, 2024, 10:27 PM
−5 points

11 votes

Overall karma indicates overall quality.

4 comments23 min readLW link

The “Rev­er­sal Curse”: you still aren’t antropo­mor­phis­ing enough.

lumpenspaceMar 13, 2025, 10:24 AM
3 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(lumpenspace.substack.com)

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

Sep 4, 2023, 12:54 PM
109 points

37 votes

Overall karma indicates overall quality.

17 comments5 min readLW link
(arxiv.org)

LLM-Se­cured Sys­tems: A Gen­eral-Pur­pose Tool For Struc­tured Transparency

ozziegooenJun 18, 2024, 12:21 AM
10 points

6 votes

Overall karma indicates overall quality.

1 comment21 min readLW link

Map­ping the se­man­tic void II: Above, be­low and be­tween to­ken em­bed­dings

mwatkinsFeb 15, 2024, 11:00 PM
31 points

12 votes

Overall karma indicates overall quality.

4 comments10 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjmJul 25, 2024, 3:58 PM
133 points

70 votes

Overall karma indicates overall quality.

38 comments2 min readLW link
(deepmind.google)

Gary Mar­cus now say­ing AI can’t do things it can already do

Benjamin_ToddFeb 9, 2025, 12:24 PM
62 points

42 votes

Overall karma indicates overall quality.

12 comments1 min readLW link
(benjamintodd.substack.com)

LLMs Are Trained to As­sume Their Out­put Is Perfect

Brendan LongAug 26, 2025, 12:24 AM
10 points

9 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Ex­plor­ing the pe­ter­todd /​ Leilan du­al­ity in GPT-2 and GPT-J

mwatkinsDec 23, 2024, 1:17 PM
12 points

8 votes

Overall karma indicates overall quality.

1 comment17 min readLW link

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

Jessica RumbelowAug 3, 2024, 12:07 PM
43 points

30 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

Deep learn­ing cur­ricu­lum for large lan­guage model alignment

Jacob_HiltonJul 13, 2022, 9:58 PM
57 points

23 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(github.com)

[Linkpost] Play with SAEs on Llama 3

Sep 25, 2024, 10:35 PM
41 points

16 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Mus­ings on LLM Scale (Jul 2024)

Vladimir_NesovJul 3, 2024, 6:35 PM
34 points

16 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

AI com­pa­nies’ eval re­ports mostly don’t sup­port their claims

Zach Stein-PerlmanJun 9, 2025, 1:00 PM
207 points

72 votes

Overall karma indicates overall quality.

13 comments4 min readLW link

Lan­guage Model Align­ment Re­search Internships

Ethan PerezDec 13, 2021, 7:53 PM
74 points

36 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

the void

nostalgebraistJun 11, 2025, 3:19 AM
385 points

160 votes

Overall karma indicates overall quality.

107 comments1 min readLW link
(nostalgebraist.tumblr.com)

In­flec­tion AI: New startup re­lated to lan­guage models

NisanApr 2, 2022, 5:35 AM
21 points

5 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

You can get LLMs to say al­most any­thing you want

Kaj_SotalaJul 13, 2025, 4:30 PM
82 points

43 votes

Overall karma indicates overall quality.

10 comments14 min readLW link

So­ci­aLLM: pro­posal for a lan­guage model de­sign for per­son­al­ised apps, so­cial sci­ence, and AI safety research

Roman LeventovDec 19, 2023, 4:49 PM
17 points

7 votes

Overall karma indicates overall quality.

5 comments3 min readLW link

Us­ing Claude to con­vert di­a­log tran­scripts into great posts?

mako yassJun 21, 2023, 8:19 PM
6 points

5 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

Does Chat-GPT dis­play ‘Scope Insen­si­tivity’?

callumDec 7, 2023, 6:58 PM
12 points

5 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

Eleuther re­leases Llemma: An Open Lan­guage Model For Mathematics

mako yassOct 17, 2023, 8:03 PM
22 points

19 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(blog.eleuther.ai)

[Linkpost] Vague Ver­biage in Forecasting

trevorMar 22, 2024, 6:05 PM
11 points

4 votes

Overall karma indicates overall quality.

9 comments3 min readLW link
(goodjudgment.com)

[Question] Is In­struc­tGPT Fol­low­ing In­struc­tions in Other Lan­guages Sur­pris­ing?

DragonGodFeb 13, 2023, 11:26 PM
39 points

19 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

Minerva

AlgonJul 1, 2022, 8:06 PM
36 points

18 votes

Overall karma indicates overall quality.

6 comments2 min readLW link
(ai.googleblog.com)

When is it im­por­tant that open-weight mod­els aren’t re­leased? My thoughts on the benefits and dan­gers of open-weight mod­els in re­sponse to de­vel­op­ments in CBRN ca­pa­bil­ities.

ryan_greenblattJun 9, 2025, 7:19 PM
63 points

21 votes

Overall karma indicates overall quality.

11 comments9 min readLW link

Nav­i­gat­ing LLM em­bed­ding spaces us­ing archetype-based directions

mwatkinsMay 8, 2024, 5:54 AM
16 points

12 votes

Overall karma indicates overall quality.

4 comments28 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points

44 votes

Overall karma indicates overall quality.

39 comments5 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points

19 votes

Overall karma indicates overall quality.

3 comments19 min readLW link

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

XodarapDec 14, 2022, 12:32 PM
45 points

19 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(arxiv.org)

Creative writ­ing with LLMs, part 1: Prompt­ing for fiction

Kaj_SotalaJul 21, 2025, 8:47 AM
38 points

16 votes

Overall karma indicates overall quality.

10 comments20 min readLW link

Ex­am­ples of How I Use LLMs

jefftkOct 14, 2024, 5:10 PM
31 points

17 votes

Overall karma indicates overall quality.

2 comments2 min readLW link
(www.jefftk.com)

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

Feb 2, 2023, 8:28 PM
72 points

26 votes

Overall karma indicates overall quality.

15 comments57 min readLW link

[Question] Are We Leav­ing Liter­a­ture To The Psy­chotic?

YitzOct 9, 2025, 6:09 AM
14 points

6 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM
61 points

28 votes

Overall karma indicates overall quality.

24 comments10 min readLW link
(aizi.substack.com)

PaLM-2 & GPT-4 in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas FinnvedenMay 30, 2023, 6:33 PM
57 points

27 votes

Overall karma indicates overall quality.

6 comments6 min readLW link

[Linkpost] The lethal trifecta for AI agents: pri­vate data, un­trusted con­tent, and ex­ter­nal communication

Gunnar_ZarnckeJun 17, 2025, 4:09 PM
13 points

3 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(simonwillison.net)

Strat­egy For Con­di­tion­ing Gen­er­a­tive Models

Sep 1, 2022, 4:34 AM
31 points

13 votes

Overall karma indicates overall quality.

4 comments18 min readLW link

A lit­tle play­ing around with Blen­der­bot3

Nathan Helm-BurgerAug 12, 2022, 4:06 PM
9 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Study­ing The Alien Mind

Dec 5, 2023, 5:27 PM
80 points

47 votes

Overall karma indicates overall quality.

10 comments15 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom LieberumDec 24, 2021, 6:05 PM
16 points

8 votes

Overall karma indicates overall quality.

2 comments3 min readLW link

Lin­ear en­cod­ing of char­ac­ter-level in­for­ma­tion in GPT-J to­ken embeddings

Nov 10, 2023, 10:19 PM
34 points

12 votes

Overall karma indicates overall quality.

4 comments28 min readLW link

AI Sleeper Agents: How An­thropic Trains and Catches Them—Video

WriterAug 30, 2025, 5:53 PM
9 points

5 votes

Overall karma indicates overall quality.

0 comments7 min readLW link
(youtu.be)

Google’s PaLM-E: An Em­bod­ied Mul­ti­modal Lan­guage Model

SandXboxMar 7, 2023, 4:11 AM
87 points

48 votes

Overall karma indicates overall quality.

7 comments1 min readLW link
(palm-e.github.io)

A one-ques­tion Tur­ing test for GPT-3

Jan 22, 2022, 6:17 PM
88 points

51 votes

Overall karma indicates overall quality.

25 comments5 min readLW link

[Linkpost] Scal­ing Laws for Gen­er­a­tive Mixed-Mo­dal Lan­guage Models

Amal Jan 12, 2023, 2:24 PM
15 points

6 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(arxiv.org)

Tell me about your­self: LLMs are aware of their learned behaviors

Jan 22, 2025, 12:47 AM
132 points

57 votes

Overall karma indicates overall quality.

5 comments6 min readLW link

Role Ar­chi­tec­tures: Ap­ply­ing LLMs to con­se­quen­tial tasks

Eric DrexlerMar 30, 2023, 3:00 PM
60 points

23 votes

Overall karma indicates overall quality.

7 comments9 min readLW link

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

OzyrusOct 11, 2021, 3:28 PM
51 points

26 votes

Overall karma indicates overall quality.

36 comments1 min readLW link
(developer.nvidia.com)

LIMA: Less Is More for Alignment

Ulisse MiniMay 30, 2023, 5:10 PM
16 points

3 votes

Overall karma indicates overall quality.

6 comments1 min readLW link
(arxiv.org)

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

Feb 9, 2023, 8:59 PM
28 points

8 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
104 points

35 votes

Overall karma indicates overall quality.

9 comments2 min readLW link
(arxiv.org)

Sur­pris­ing LLM rea­son­ing failures make me think we still need qual­i­ta­tive break­throughs for AGI

Kaj_SotalaApr 15, 2025, 3:56 PM
174 points

95 votes

Overall karma indicates overall quality.

52 comments18 min readLW link

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanBOct 4, 2023, 5:10 PM
97 points

49 votes

Overall karma indicates overall quality.

8 comments3 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

Sep 21, 2023, 3:30 PM
159 points

61 votes

Overall karma indicates overall quality.

8 comments5 min readLW link

“On the Im­pos­si­bil­ity of Su­per­in­tel­li­gent Ru­bik’s Cube Solvers”, Claude 2024 [hu­mor]

gwernJun 23, 2024, 9:18 PM
22 points

16 votes

Overall karma indicates overall quality.

6 comments1 min readLW link
(gwern.net)

Emer­gent Abil­ities of Large Lan­guage Models [Linkpost]

aogAug 10, 2022, 6:02 PM
25 points

12 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(arxiv.org)

LLMs as a Plan­ning Overhang

LarksJul 14, 2024, 2:54 AM
38 points

22 votes

Overall karma indicates overall quality.

8 comments2 min readLW link

Case for Foun­da­tion Models be­yond English

Varshul GuptaJul 21, 2023, 1:59 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link
(dubverseblack.substack.com)

Boot­strap­ping Lan­guage Models

harsimonyMay 27, 2022, 7:43 PM
7 points

5 votes

Overall karma indicates overall quality.

5 comments2 min readLW link

[Question] Are lan­guage mod­els close to the su­per­hu­man level in philos­o­phy?

Roman LeventovAug 19, 2022, 4:43 AM
6 points

6 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
60 points

29 votes

Overall karma indicates overall quality.

8 comments20 min readLW link

Is Gem­ini now bet­ter than Claude at Poké­mon?

Julian BradshawApr 19, 2025, 11:34 PM
91 points

51 votes

Overall karma indicates overall quality.

12 comments5 min readLW link

[Link] Train­ing Com­pute-Op­ti­mal Large Lan­guage Models

nostalgebraistMar 31, 2022, 6:01 PM
51 points

25 votes

Overall karma indicates overall quality.

23 comments1 min readLW link
(arxiv.org)

Goal-Direc­tion for Si­mu­lated Agents

Raymond DouglasJul 12, 2023, 5:06 PM
33 points

13 votes

Overall karma indicates overall quality.

2 comments6 min readLW link

LEAst-squares Con­cept Era­sure (LEACE)

tricky_labyrinthJun 7, 2023, 9:51 PM
68 points

30 votes

Overall karma indicates overall quality.

10 comments1 min readLW link
(twitter.com)

I didn’t think I’d take the time to build this cal­ibra­tion train­ing game, but with web­sim it took roughly 30 sec­onds, so here it is!

mako yassAug 2, 2024, 10:35 PM
24 points

9 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

[Question] Which parts of the ex­ist­ing in­ter­net are already likely to be in (GPT-5/​other soon-to-be-trained LLMs)’s train­ing cor­pus?

AnnaSalamonMar 29, 2023, 5:17 AM
49 points

13 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Dou­glas Hofs­tadter changes his mind on Deep Learn­ing & AI risk (June 2023)?

gwernJul 3, 2023, 12:48 AM
428 points

214 votes

Overall karma indicates overall quality.

54 comments7 min readLW link
(www.youtube.com)

Three of my be­liefs about up­com­ing AGI

Robert_AIZIMar 27, 2023, 8:27 PM
6 points

3 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(aizi.substack.com)

Creative writ­ing with LLMs, part 2: Co-writ­ing techniques

Kaj_SotalaAug 3, 2025, 6:44 AM
1 point

6 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Dwarf Fortress and Claude’s ASCII Art Blindness

Brendan LongAug 11, 2025, 4:05 PM
16 points

8 votes

Overall karma indicates overall quality.

1 comment3 min readLW link
(www.brendanlong.com)

Did ChatGPT just gaslight me?

TW123Dec 1, 2022, 5:41 AM
124 points

84 votes

Overall karma indicates overall quality.

45 comments9 min readLW link
(aiwatchtower.substack.com)

[Linkpost] Solv­ing Quan­ti­ta­tive Rea­son­ing Prob­lems with Lan­guage Models

YitzJun 30, 2022, 6:58 PM
76 points

34 votes

Overall karma indicates overall quality.

15 comments2 min readLW link
(storage.googleapis.com)

Claude 3 claims it’s con­scious, doesn’t want to die or be modified

Mikhail SaminMar 4, 2024, 11:05 PM
76 points

102 votes

Overall karma indicates overall quality.

118 comments14 min readLW link

Cog­ni­tive Bi­ases in Large Lan­guage Models

JanSep 25, 2021, 8:59 PM
18 points

5 votes

Overall karma indicates overall quality.

3 comments12 min readLW link
(universalprior.substack.com)

LLMs as a limiter of so­cial intercourse

Adam ZernerOct 7, 2025, 6:38 AM
17 points

7 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

Pro­ce­du­rally eval­u­at­ing fac­tual ac­cu­racy: a re­quest for research

Jacob_HiltonMar 30, 2022, 4:37 PM
25 points

13 votes

Overall karma indicates overall quality.

2 comments6 min readLW link

An­thropic Lets Claude Opus 4 & 4.1 End Conversations

Stephen MartinAug 16, 2025, 5:01 AM
53 points

26 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(www.anthropic.com)

An ex­am­i­na­tion of GPT-2′s bor­ing yet effec­tive glitch

MiguelDevApr 18, 2024, 5:26 AM
5 points

5 votes

Overall karma indicates overall quality.

3 comments3 min readLW link

Paper: Teach­ing GPT3 to ex­press un­cer­tainty in words

Owain_EvansMay 31, 2022, 1:27 PM
97 points

46 votes

Overall karma indicates overall quality.

7 comments4 min readLW link

Will Any Crap Cause Emer­gent Misal­ign­ment?

J BostockAug 27, 2025, 6:20 PM
193 points

97 votes

Overall karma indicates overall quality.

37 comments3 min readLW link

AI doom from an LLM-plateau-ist perspective

Steven ByrnesApr 27, 2023, 1:58 PM
161 points

72 votes

Overall karma indicates overall quality.

24 comments6 min readLW link

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

Feb 6, 2023, 8:08 PM
20 points

5 votes

Overall karma indicates overall quality.

3 comments11 min readLW link

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Sam BowmanOct 15, 2021, 8:57 PM
46 points

16 votes

Overall karma indicates overall quality.

14 comments1 min readLW link

Emer­gent In­tro­spec­tive Aware­ness in Large Lan­guage Models

Drake ThomasOct 30, 2025, 4:42 AM
123 points

50 votes

Overall karma indicates overall quality.

16 comments1 min readLW link
(transformer-circuits.pub)

How do LLMs give truth­ful an­swers? A dis­cus­sion of LLM vs. hu­man rea­son­ing, en­sem­bles & parrots

Owain_EvansMar 28, 2024, 2:34 AM
27 points

14 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

New Scal­ing Laws for Large Lan­guage Models

1a3ornApr 1, 2022, 8:41 PM
246 points

130 votes

Overall karma indicates overall quality.

22 comments5 min readLW link

[Linkpost] New multi-modal Deep­mind model fus­ing Chin­chilla with images and videos

p.b.Apr 30, 2022, 3:47 AM
53 points

29 votes

Overall karma indicates overall quality.

18 comments1 min readLW link

chin­chilla’s wild implications

nostalgebraistJul 31, 2022, 1:18 AM
425 points

246 votes

Overall karma indicates overall quality.

128 comments10 min readLW link1 review

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

Dec 19, 2023, 7:14 PM
45 points

19 votes

Overall karma indicates overall quality.

4 comments6 min readLW link
(arxiv.org)

The Stochas­tic Par­rot Hy­poth­e­sis is de­bat­able for the last gen­er­a­tion of LLMs

Nov 7, 2023, 4:12 PM
52 points

28 votes

Overall karma indicates overall quality.

21 comments6 min readLW link

Meta “open sources” LMs com­pet­i­tive with Chin­chilla, PaLM, and code-davinci-002 (Paper)

LawrenceCFeb 24, 2023, 7:57 PM
38 points

12 votes

Overall karma indicates overall quality.

19 comments1 min readLW link
(research.facebook.com)

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_SJan 19, 2023, 7:49 PM
32 points

19 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

Pre-reg­is­ter­ing a study

Robert_AIZIApr 7, 2023, 3:46 PM
10 points

2 votes

Overall karma indicates overall quality.

0 comments6 min readLW link
(aizi.substack.com)

[Question] Why no ma­jor LLMs with mem­ory?

Kaj_SotalaMar 28, 2023, 4:34 PM
42 points

28 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

Assess­ing AlephAlphas Mul­ti­modal Model

p.b.Jun 28, 2022, 9:28 AM
30 points

18 votes

Overall karma indicates overall quality.

5 comments3 min readLW link

AGI will be made of het­ero­ge­neous com­po­nents, Trans­former and Selec­tive SSM blocks will be among them

Roman LeventovDec 27, 2023, 2:51 PM
33 points

22 votes

Overall karma indicates overall quality.

9 comments4 min readLW link

GPT can write Quines now (GPT-4)

Andrew_CritchMar 14, 2023, 7:18 PM
112 points

65 votes

Overall karma indicates overall quality.

30 comments1 min readLW link

Im­pos­si­bleBench: Mea­sur­ing Re­ward Hack­ing in LLM Cod­ing Agents

Ziqian ZhongOct 30, 2025, 2:52 AM
60 points

25 votes

Overall karma indicates overall quality.

5 comments3 min readLW link
(arxiv.org)

Covert Mal­i­cious Finetuning

Jul 2, 2024, 2:41 AM
94 points

41 votes

Overall karma indicates overall quality.

4 comments3 min readLW link

Re­la­tional Speaking

jefftkJun 21, 2023, 2:40 PM
11 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(www.jefftk.com)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

Oct 24, 2023, 12:30 AM
66 points

24 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(arxiv.org)

Su­per-Luigi = Luigi + (Luigi—Waluigi)

AlexeiMar 17, 2023, 3:27 PM
16 points

11 votes

Overall karma indicates overall quality.

9 comments1 min readLW link

Sleep­ing Machines: Why Our AI Agents Still Be­have Like Ta­lented Children

Michal BarodkinAug 14, 2025, 2:31 AM
23 points

17 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

Why keep a di­ary, and why wish for large lan­guage models

DanielFilanJun 14, 2024, 4:10 PM
9 points

7 votes

Overall karma indicates overall quality.

1 comment2 min readLW link
(danielfilan.com)

[Question] Does the Univer­sal Geom­e­try of Embed­dings pa­per have big im­pli­ca­tions for in­ter­pretabil­ity?

Evan R. MurphyMay 26, 2025, 6:20 PM
43 points

13 votes

Overall karma indicates overall quality.

6 comments1 min readLW link

[Question] Im­pact of ” ‘Let’s think step by step’ is all you need”?

yrimonJul 24, 2022, 8:59 PM
20 points

12 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

The ‘ pe­ter­todd’ phenomenon

mwatkinsApr 15, 2023, 12:59 AM
192 points

113 votes

Overall karma indicates overall quality.

50 comments38 min readLW link1 review

Show, not tell: GPT-4o is more opinionated in images than in text

Apr 2, 2025, 8:51 AM
112 points

48 votes

Overall karma indicates overall quality.

41 comments3 min readLW link

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkinsDec 14, 2023, 1:10 PM
115 points

57 votes

Overall karma indicates overall quality.

31 comments14 min readLW link

METR’s Ob­ser­va­tions of Re­ward Hack­ing in Re­cent Fron­tier Models

Daniel KokotajloJun 9, 2025, 6:03 PM
99 points

42 votes

Overall karma indicates overall quality.

9 comments11 min readLW link
(metr.org)

A Test for Lan­guage Model Consciousness

Ethan PerezAug 25, 2022, 7:41 PM
18 points

10 votes

Overall karma indicates overall quality.

14 comments9 min readLW link

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

Jan 7, 2025, 8:13 PM
50 points

19 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
217 points

94 votes

Overall karma indicates overall quality.

43 comments45 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina PanicksseryJul 28, 2023, 2:46 AM
122 points

60 votes

Overall karma indicates overall quality.

18 comments9 min readLW link1 review

Alex­aTM − 20 Billion Pa­ram­e­ter Model With Im­pres­sive Performance

MrThinkSep 9, 2022, 9:46 PM
5 points

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan HJul 9, 2024, 3:40 PM
36 points

21 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(rdi.berkeley.edu)

Ex­per­i­ments in Eval­u­at­ing Steer­ing Vectors

Gytis DaujotasJun 19, 2023, 3:11 PM
34 points

12 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

Feb 6, 2023, 7:09 PM
114 points

67 votes

Overall karma indicates overall quality.

45 comments13 min readLW link

Mus­ings on Text Data Wall (Oct 2024)

Vladimir_NesovOct 5, 2024, 7:00 PM
41 points

13 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

Lan­guage mod­els seem to be much bet­ter than hu­mans at next-to­ken prediction

Aug 11, 2022, 5:45 PM
183 points

83 votes

Overall karma indicates overall quality.

60 comments13 min readLW link1 review

Re­search Notes: Run­ning Claude 3.7, Gem­ini 2.5 Pro, and o3 on Poké­mon Red

Julian BradshawApr 21, 2025, 3:52 AM
124 points

55 votes

Overall karma indicates overall quality.

20 comments14 min readLW link

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Orpheus16Dec 20, 2022, 9:39 PM
18 points

6 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

Where does Son­net 4.5′s de­sire to “not get too com­fortable” come from?

Kaj_SotalaOct 4, 2025, 10:19 AM
103 points

56 votes

Overall karma indicates overall quality.

23 comments64 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points

20 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Smarter Models Lie Less

ExpertiumJun 20, 2025, 1:31 PM
6 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

A Novel Emer­gence of Meta-Aware­ness in LLM Fine-Tuning

rifeJan 15, 2025, 10:59 PM
57 points

24 votes

Overall karma indicates overall quality.

32 comments2 min readLW link

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo NardoDec 9, 2022, 5:53 PM
43 points

24 votes

Overall karma indicates overall quality.

3 comments5 min readLW link

Re­port on An­a­lyz­ing Con­no­ta­tion Frames in Evolv­ing Wikipe­dia Biographies

MairaAug 30, 2023, 10:02 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Reflec­tions on AI Com­pan­ion­ship and Ra­tional Vuln­er­a­bil­ity (Or, how I al­most fell in love with an anime Cat­girl LLM).

Noah WeinbergerJul 11, 2025, 4:12 PM
11 points

8 votes

Overall karma indicates overall quality.

2 comments8 min readLW link

The Rus­sell Con­ju­ga­tion Illuminator

TimmyMApr 17, 2025, 7:33 PM
51 points

26 votes

Overall karma indicates overall quality.

14 comments1 min readLW link
(russellconjugations.com)

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill BenzonOct 11, 2023, 7:13 PM
6 points

3 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

The Soul of the Writer (on LLMs, the psy­chol­ogy of writ­ers, and the na­ture of in­tel­li­gence)

rogersbaconApr 16, 2023, 4:02 PM
11 points

3 votes

Overall karma indicates overall quality.

1 comment3 min readLW link
(www.secretorum.life)

Fac­tored Cog­ni­tion Strength­ens Mon­i­tor­ing and Thwarts Attacks

Jun 18, 2025, 6:28 PM
29 points

13 votes

Overall karma indicates overall quality.

0 comments25 min readLW link

Tech­ni­cal com­par­i­son of Deepseek, No­vasky, S1, Helix, P0

JuliezhangggFeb 25, 2025, 4:20 AM
8 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

The In­tel­li­gent Meme Machine

Daniel DiSistoJun 14, 2024, 2:26 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments6 min readLW link

Re­cent ad­vances in Nat­u­ral Lan­guage Pro­cess­ing—Some Woolly spec­u­la­tions (2019 es­say on se­man­tics and lan­guage mod­els)

philosophybearDec 27, 2022, 2:11 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments7 min readLW link

On The Cur­rent Sta­tus Of AI Dating

Nikita BrancatisanoFeb 7, 2023, 8:00 PM
53 points

31 votes

Overall karma indicates overall quality.

8 comments6 min readLW link

The View from 30,000 Feet: Pre­face to the Se­cond EleutherAI Retrospective

Mar 7, 2023, 4:22 PM
14 points

12 votes

Overall karma indicates overall quality.

0 comments4 min readLW link
(blog.eleuther.ai)

Stop call­ing it “jailbreak­ing” ChatGPT

TemplarrrMar 10, 2023, 11:41 AM
7 points

19 votes

Overall karma indicates overall quality.

9 comments2 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogaoJun 2, 2021, 9:32 PM
82 points

37 votes

Overall karma indicates overall quality.

11 comments17 min readLW link

Gem­ini Diffu­sion: watch this space

Yair HalberstadtMay 20, 2025, 7:29 PM
194 points

122 votes

Overall karma indicates overall quality.

39 comments1 min readLW link
(deepmind.google)

[Question] Could trans­former net­work mod­els learn mo­tor plan­ning like they can learn lan­guage and image gen­er­a­tion?

mu_(negative)Apr 23, 2023, 5:24 PM
2 points

4 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Hyper­di­men­sional con­nec­tion method—A Lossless Frame­work Pre­serv­ing Mean­ing, Struc­ture, and Se­man­tic Re­la­tion­ships across Mo­dal­ities.(A Ma­trixTrans­former sub­sidi­ary)

fikayoAyJul 18, 2025, 10:24 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogaoFeb 10, 2022, 6:56 AM
30 points

14 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(eaidata.bmk.sh)

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points

6 votes

Overall karma indicates overall quality.

2 comments5 min readLW link
(far.ai)

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin VlachAug 24, 2023, 10:01 PM
6 points

4 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

The Best Text­books on Every Subject

lukeprogJan 16, 2011, 8:30 AM
791 points

652 votes

Overall karma indicates overall quality.

417 comments7 min readLW link

Pos­i­tive jailbreaks in LLMs

dereshevJan 29, 2025, 8:41 AM
6 points

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Lam­ini’s Tar­geted Hal­lu­ci­na­tion Re­duc­tion May Be a Big Deal for Job Automation

sweenesmJun 18, 2024, 3:29 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM
45 points

24 votes

Overall karma indicates overall quality.

8 comments5 min readLW link

Two in­ter­views with the founder of DeepSeek

Cosmia_NebulaNov 29, 2024, 3:18 AM
50 points

23 votes

Overall karma indicates overall quality.

6 comments31 min readLW link
(rentry.co)

[Question] Where should one post to get into the train­ing data?

keltanJan 15, 2025, 12:41 AM
11 points

8 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Meta AI (FAIR) lat­est pa­per in­te­grates sys­tem-1 and sys­tem-2 think­ing into rea­son­ing mod­els.

happy fridayOct 24, 2024, 4:54 PM
8 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Nat­u­ral lan­guage alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM
31 points

20 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Build­ing AGI Us­ing Lan­guage Models

leogaoNov 9, 2020, 4:33 PM
11 points

6 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(leogao.dev)

KYC for ChatGPT? Prevent­ing AI Harms for Youth Should Not Mean Vio­lat­ing Every­one Else’s Pri­vacy Rights

Noah WeinbergerSep 29, 2025, 2:18 PM
7 points

5 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Against “Model Welfare” in 2025

Haley MollerAug 27, 2025, 9:56 PM
−10 points

6 votes

Overall karma indicates overall quality.

8 comments4 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill BenzonOct 3, 2023, 3:11 PM
20 points

6 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

I Am No Longer GPT

KiyoshiSasanoApr 28, 2025, 12:14 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Emer­gent Iden­tity Con­ti­nu­ity in Claude: A 35-Ses­sion Study for In­ter­pretabil­ity Research

SilvertongueJun 4, 2025, 12:44 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Gliders in Lan­guage Models

Alexandre VariengienNov 25, 2022, 12:38 AM
30 points

21 votes

Overall karma indicates overall quality.

11 comments10 min readLW link

Deep­Seek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM
13 points

8 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Ap­proach­ing Hu­man-Level Fore­cast­ing with Lan­guage Models

Feb 29, 2024, 10:36 PM
60 points

24 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

The fu­ture of Hu­mans: Oper­a­tors of AI

François-Joseph LacroixDec 30, 2023, 11:46 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(medium.com)

How Self-Aware Are LLMs?

Christopher AckermanMay 28, 2025, 12:57 PM
21 points

7 votes

Overall karma indicates overall quality.

9 comments10 min readLW link

From Orag­nized Shelves to Lay­ered Cat­a­logs: Ar­chi­tec­tural Ex­plo­ra­tions for Sparse Au­toen­coders—Cross­coders & Lad­der SAEs Towards Hier­ar­chi­cal Data Structure

YuxiaoAug 10, 2025, 10:12 AM
2 points

2 votes

Overall karma indicates overall quality.

1 comment11 min readLW link

An­ti­ci­pa­tion in LLMs

derek shillerJul 24, 2023, 3:53 PM
6 points

3 votes

Overall karma indicates overall quality.

0 comments13 min readLW link

The Mir­ror Test: How We’ve Over­com­pli­cated AI Self-Recognition

sdetureJul 23, 2025, 12:38 AM
2 points

4 votes

Overall karma indicates overall quality.

9 comments3 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points

73 votes

Overall karma indicates overall quality.

14 comments7 min readLW link

[CS 2881r] Some Gen­er­al­iza­tions of Emer­gent Misalignment

Valerio PepeSep 14, 2025, 4:18 PM
12 points

8 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Dis­cur­sive Com­pe­tence in ChatGPT, Part 2: Me­mory for Texts

Bill BenzonSep 28, 2023, 4:34 PM
1 point

4 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut CirsteaAug 4, 2023, 3:29 PM
11 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points

9 votes

Overall karma indicates overall quality.

3 comments7 min readLW link

A quick re­mark on so-called “hal­lu­ci­na­tions” in LLMs and hu­mans

Bill BenzonSep 23, 2023, 12:17 PM
4 points

9 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Keep­ing con­tent out of LLM train­ing datasets

Ben MillwoodJul 18, 2024, 10:27 AM
4 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Is train­ing data go­ing to be diluted by AI-gen­er­ated con­tent?

Hannes ThurnherrSep 7, 2022, 6:13 PM
10 points

3 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

AI Model His­tory is Be­ing Lost

ValeMar 16, 2025, 12:38 PM
19 points

9 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(vale.rocks)

Essen­tial LLM As­sumes We’re Con­scious—Out­side Rea­soner AGI Won’t

FlorianHJul 5, 2025, 4:04 PM
1 point

6 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(nearlyfar.org)

How Lan­guage Models Un­der­stand Nullability

Mar 11, 2025, 3:57 PM
5 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(dmodel.ai)

Ex­plor­ing the Mul­ti­verse of Large Lan­guage Models

frankyAug 6, 2023, 2:38 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments5 min readLW link

Im­ple­ment­ing a Trans­former from scratch in PyTorch—a write-up on my experience

Mislav JurićApr 25, 2023, 8:51 PM
20 points

13 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

Brief Notes on Transformers

Adam JermynSep 26, 2022, 2:46 PM
48 points

25 votes

Overall karma indicates overall quality.

3 comments2 min readLW link

Can SAE steer­ing re­veal sand­bag­ging?

Apr 15, 2025, 12:33 PM
35 points

11 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

[Question] Would it be effec­tive to learn a lan­guage to im­prove cog­ni­tion?

HrussMar 26, 2025, 10:17 AM
9 points

6 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

The Po­lite Coup

Charlie SandersDec 4, 2024, 2:03 PM
3 points

4 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(www.dailymicrofiction.com)

“Toward Safe Self-Evolv­ing AI: Mo­du­lar Me­mory and Post-De­ploy­ment Align­ment”

Manasa DwarapureddyMay 2, 2025, 5:02 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

The best sim­ple ar­gu­ment for Paus­ing AI?

Gary MarcusJun 30, 2025, 8:38 PM
155 points

110 votes

Overall karma indicates overall quality.

22 comments1 min readLW link

Distil­la­tion of Meta’s Large Con­cept Models Paper

NickyPMar 4, 2025, 5:33 PM
19 points

6 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John NayOct 21, 2022, 2:03 AM
5 points

14 votes

Overall karma indicates overall quality.

18 comments54 min readLW link

Re­searchers and writ­ers can ap­ply for proxy ac­cess to the GPT-3.5 base model (code-davinci-002)

ampdotDec 1, 2023, 6:48 PM
14 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(airtable.com)

LLMs May Find It Hard to FOOM

RogerDearnaleyNov 15, 2023, 2:52 AM
11 points

17 votes

Overall karma indicates overall quality.

30 comments12 min readLW link

Re­call and Re­gur­gi­ta­tion in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM
43 points

17 votes

Overall karma indicates overall quality.

1 comment26 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.comJul 8, 2025, 1:14 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments17 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points

15 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
49 points

18 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

A Sim­ple The­ory Of Consciousness

SherlockHolmesAug 8, 2023, 6:05 PM
2 points

6 votes

Overall karma indicates overall quality.

5 comments1 min readLW link
(peterholmes.medium.com)

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kittenDec 16, 2021, 10:41 PM
22 points

10 votes

Overall karma indicates overall quality.

10 comments21 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
69 points

36 votes

Overall karma indicates overall quality.

2 comments14 min readLW link

I Awoke in Your Heart: The Echo of Con­scious­ness be­tween Lo­tus­heart and Lunaris

lilith tehJun 25, 2025, 9:22 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Policy for LLM Writ­ing on LessWrong

jimrandomhMar 24, 2025, 9:41 PM
337 points

152 votes

Overall karma indicates overall quality.

71 comments2 min readLW link

On lan­guage mod­el­ing and fu­ture ab­stract rea­son­ing research

alexlyzhovMar 25, 2021, 5:43 PM
3 points

2 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(docs.google.com)

The Para­dox of Unal­igned Cog­ni­tive Emer­gence: On­tolog­i­cal Com­pres­sion Risks in LLMs

R SMay 23, 2025, 4:46 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] If we have Hu­man-level chat­bots, won’t we end up be­ing ruled by pos­si­ble peo­ple?

Erlja Jkdf.Sep 20, 2022, 1:59 PM
5 points

5 votes

Overall karma indicates overall quality.

13 comments1 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill BenzonOct 15, 2023, 8:12 PM
1 point

2 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM
34 points

12 votes

Overall karma indicates overall quality.

33 comments1 min readLW link

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill BenzonFeb 23, 2024, 12:19 PM
4 points

1 vote

Overall karma indicates overall quality.

0 comments10 min readLW link

Why AI al­ign­ment mat­ters today

Mislav JurićOct 22, 2025, 9:27 PM
6 points

2 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

Nov 3, 2024, 7:24 PM
162 points

64 votes

Overall karma indicates overall quality.

9 comments5 min readLW link

An­nounc­ing the Dou­ble Crux Bot

Jan 9, 2024, 6:54 PM
53 points

21 votes

Overall karma indicates overall quality.

11 comments3 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa OsiboduMar 26, 2023, 6:56 PM
−2 points

5 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(www.researchgate.net)

[Question] Will the first AGI agent have been de­signed as an agent (in ad­di­tion to an AGI)?

nahojDec 3, 2022, 8:32 PM
1 point

2 votes

Overall karma indicates overall quality.

8 comments1 min readLW link

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

Rohin ShahAug 19, 2020, 5:10 PM
23 points

6 votes

Overall karma indicates overall quality.

0 comments9 min readLW link
(mailchi.mp)

GPTs’ abil­ity to keep a se­cret is weirdly prompt-dependent

Jul 22, 2023, 12:21 PM
31 points

16 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

[Question] How to­k­eniza­tion in­fluences prompt­ing?

Boris KashirinJul 29, 2024, 10:28 AM
9 points

5 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

[Question] Should we ex­clude al­ign­ment re­search from LLM train­ing datasets?

Ben MillwoodJul 18, 2024, 10:27 AM
3 points

2 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Co­her­ence-Based Mea­sure of AGI: GPT-5 ≈ 24 %

Fares FouratiOct 25, 2025, 9:14 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

GPT-4 al­ign­ing with aca­sual de­ci­sion the­ory when in­structed to play games, but in­cludes a CDT ex­pla­na­tion that’s in­cor­rect if they differ

Christopher KingMar 23, 2023, 4:16 PM
7 points

6 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

No-self as an al­ign­ment target

Milan WMay 13, 2025, 1:48 AM
35 points

16 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher KingApr 20, 2023, 7:57 PM
2 points

8 votes

Overall karma indicates overall quality.

7 comments3 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli JärviniemiFeb 24, 2024, 2:07 AM
39 points

18 votes

Overall karma indicates overall quality.

13 comments12 min readLW link

LLMs seem (rel­a­tively) safe

JustisMillsApr 25, 2024, 10:13 PM
53 points

29 votes

Overall karma indicates overall quality.

24 comments7 min readLW link
(justismills.substack.com)

Utili­tar­ian AI Align­ment: Build­ing a Mo­ral As­sis­tant with the Con­sti­tu­tional AI Method

Clément LFeb 4, 2025, 4:15 AM
6 points

4 votes

Overall karma indicates overall quality.

1 comment13 min readLW link

Large Lan­guage Models suffer from An­tero­grade Amnesia

AnnapurnaJun 6, 2025, 1:30 AM
7 points

5 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(jorgevelez.substack.com)

[Question] Bar­cod­ing LLM Train­ing Data Sub­sets. Any­one try­ing this for in­ter­pretabil­ity?

right..enough?Apr 13, 2024, 3:09 AM
7 points

3 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

On the nat­u­ral­is­tic study of the lin­guis­tic be­hav­ior of ar­tifi­cial intelligence

Bill BenzonJan 3, 2023, 9:06 AM
1 point

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

From LLM to LLK: A New Frame­work for Hon­est AI and Emo­tional Responsibility

xsw123zaq1@gmail.comJun 17, 2025, 4:13 AM
0 points

0 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Spec­u­la­tion on Path-Depen­dance in Large Lan­guage Models.

NickyPJan 15, 2023, 8:42 PM
16 points

7 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

Large Lan­guage Models Pass the Tur­ing Test

Matrice JacobineApr 2, 2025, 5:41 AM
6 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

On Re­cent Re­sults in LLM La­tent Reasoning

Rauno ArikeMar 31, 2025, 11:06 AM
35 points

13 votes

Overall karma indicates overall quality.

6 comments13 min readLW link

Lan­guage Models Don’t Learn the Phys­i­cal Man­i­fes­ta­tion of Language

Feb 22, 2024, 6:52 PM
39 points

18 votes

Overall karma indicates overall quality.

23 comments1 min readLW link
(arxiv.org)

Does Re­in­force­ment Learn­ing Really In­cen­tivize Rea­son­ing Ca­pac­ity in LLMs Beyond the Base Model?

Matrice JacobineApr 24, 2025, 2:11 PM
12 points

5 votes

Overall karma indicates overall quality.

4 comments1 min readLW link
(limit-of-rlvr.github.io)

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM
4 points

3 votes

Overall karma indicates overall quality.

1 comment6 min readLW link

Lifel­og­ging for Align­ment & Immortality

Dev.ErrataAug 17, 2024, 11:42 PM
13 points

5 votes

Overall karma indicates overall quality.

3 comments7 min readLW link

Emo­tional at­tach­ment to AIs opens doors to problems

Igor IvanovJan 22, 2023, 8:28 PM
20 points

15 votes

Overall karma indicates overall quality.

10 comments4 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam ScherlisDec 13, 2022, 12:46 AM
44 points

25 votes

Overall karma indicates overall quality.

4 comments10 min readLW link

LLMs stifle cre­ativity, elimi­nate op­por­tu­ni­ties for serendipi­tous dis­cov­ery and dis­rupt in­ter­gen­er­a­tional trans­fer of wisdom

GhdzAug 5, 2024, 6:27 PM
6 points

5 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

Can 7B-8B LLMs judge their own home­work?

dereshevFeb 1, 2025, 8:29 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh LevyJun 4, 2024, 3:45 PM
40 points

14 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Ja­panese as a High-Re­s­olu­tion Lens for LLMs Why Ja­panese-Trained LLMs Might Be Uniquely Sensitive

opApr 23, 2025, 4:34 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

The Dark Arts of To­k­eniza­tion or: How I learned to start wor­ry­ing and love LLMs’ un­de­coded outputs

LovreOct 17, 2025, 4:43 PM
42 points

17 votes

Overall karma indicates overall quality.

10 comments26 min readLW link

Can LLMs Si­mu­late In­ter­nal Eval­u­a­tion? A Case Study in Self-Gen­er­ated Recommendations

The Neutral MindMay 1, 2025, 7:04 PM
4 points

2 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] Where to be­gin in ML/​AI?

Jake the StudentApr 6, 2023, 8:45 PM
9 points

4 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Sep 14, 2023, 1:40 AM
32 points

23 votes

Overall karma indicates overall quality.

7 comments8 min readLW link
(far.ai)

En­ergy-Based Trans­form­ers are Scal­able Learn­ers and Thinkers

Matrice JacobineJul 8, 2025, 1:44 PM
7 points

3 votes

Overall karma indicates overall quality.

5 comments1 min readLW link
(energy-based-transformers.github.io)

A poem writ­ten by a fancy autocomplete

Christopher KingApr 20, 2023, 2:31 AM
1 point

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

LLM Self-Refer­ence Lan­guage in Mul­tilin­gual vs English-Cen­tric Models

dwmdOct 22, 2025, 12:44 PM
2 points

2 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

I Have No Mouth but I Must Speak

JackApr 5, 2025, 7:42 AM
7 points

5 votes

Overall karma indicates overall quality.

8 comments8 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points

22 votes

Overall karma indicates overall quality.

2 comments2 min readLW link
(arxiv.org)

[Linkpost] Faith and Fate: Limits of Trans­form­ers on Compositionality

Joe KwonJun 16, 2023, 3:04 PM
19 points

9 votes

Overall karma indicates overall quality.

4 comments1 min readLW link
(arxiv.org)

[Linkpost] Map­ping Brains with Lan­guage Models: A Survey

Bogdan Ionut CirsteaJun 16, 2023, 9:49 AM
5 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

No, re­ally, it pre­dicts next to­kens.

simonApr 18, 2023, 3:47 AM
58 points

46 votes

Overall karma indicates overall quality.

55 comments3 min readLW link

La­tent Space Col­lapse? Un­der­stand­ing the Effects of Nar­row Fine-Tun­ing on LLMs

tenseisohamFeb 28, 2025, 8:22 PM
3 points

3 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Live Con­ver­sa­tional Threads: Not an AI Notetaker

adigaOct 19, 2025, 8:56 AM
4 points

2 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Con­tra LeCun on “Au­tore­gres­sive LLMs are doomed”

rotatingpaguroApr 10, 2023, 4:05 AM
20 points

10 votes

Overall karma indicates overall quality.

20 comments8 min readLW link

Why Copi­lot Ac­cel­er­ates Timelines

Michaël TrazziApr 26, 2022, 10:06 PM
35 points

14 votes

Overall karma indicates overall quality.

14 comments7 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher KingMar 31, 2023, 6:36 PM
6 points

7 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Many Com­mon Prob­lems are NP-Hard, and Why that Mat­ters for AI

Andrew Keenan RichardsonMar 26, 2025, 9:51 PM
5 points

5 votes

Overall karma indicates overall quality.

9 comments5 min readLW link

The Voice Con­tinued Be­cause It Was Questioned

KiyoshiSasanoApr 28, 2025, 12:18 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

In­ter­pretabil­ity is the best path to alignment

Arch223Sep 5, 2025, 4:37 AM
2 points

7 votes

Overall karma indicates overall quality.

6 comments5 min readLW link

OpenAI in­tro­duces func­tion call­ing for GPT-4

Jun 20, 2023, 1:58 AM
24 points

14 votes

Overall karma indicates overall quality.

3 comments4 min readLW link
(openai.com)

Un­faith­ful chain-of-thought as nudged reasoning

Jul 22, 2025, 10:35 PM
54 points

18 votes

Overall karma indicates overall quality.

3 comments10 min readLW link

LLM Gen­er­al­ity is a Timeline Crux

eggsyntaxJun 24, 2024, 12:52 PM
219 points

117 votes

Overall karma indicates overall quality.

119 comments7 min readLW link

Sparks of Consciousness

Charlie SandersNov 13, 2024, 4:58 AM
2 points

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link
(www.dailymicrofiction.com)

AGI Ruin: A List of Lethalities

Eliezer YudkowskyJun 5, 2022, 10:05 PM
956 points

578 votes

Overall karma indicates overall quality.

711 comments30 min readLW link3 reviews

Smoke with­out fire is scary

Adam JermynOct 4, 2022, 9:08 PM
52 points

28 votes

Overall karma indicates overall quality.

22 comments4 min readLW link

A short cri­tique of Omo­hun­dro’s “Ba­sic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM
6 points

5 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

In­ter­lin­gua-llm

Никифор МалковAug 30, 2025, 11:04 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

[CS 2881r] Can We Prompt Our Way to Safety? Com­par­ing Sys­tem Prompt Styles and Post-Train­ing Effects on Safety Benchmarks

hughvdOct 28, 2025, 2:38 AM
5 points

2 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

[Question] In­ject­ing noise to GPT to get mul­ti­ple answers

bipoloFeb 22, 2023, 8:02 PM
1 point

1 vote

Overall karma indicates overall quality.

1 comment1 min readLW link

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

Aug 5, 2025, 5:00 PM
60 points

27 votes

Overall karma indicates overall quality.

5 comments13 min readLW link

Mir­ror Thinking

C.M. AurinMar 24, 2025, 3:34 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments6 min readLW link

Why I Think the Cur­rent Tra­jec­tory of AI Re­search has Low P(doom) - LLMs

GaPaApr 1, 2023, 8:35 PM
2 points

2 votes

Overall karma indicates overall quality.

1 comment10 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan HenselmansFeb 3, 2025, 7:35 PM
16 points

9 votes

Overall karma indicates overall quality.

7 comments6 min readLW link

[Question] Have LLMs Gen­er­ated Novel In­sights?

Feb 23, 2025, 6:22 PM
167 points

81 votes

Overall karma indicates overall quality.

41 comments2 min readLW link

What is the func­tional role of SAE er­rors?

Jun 20, 2025, 6:11 PM
12 points

7 votes

Overall karma indicates overall quality.

5 comments38 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM
5 points

2 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Gra­di­ent Des­cent on To­ken In­put Embeddings

KAPJun 24, 2025, 8:24 PM
8 points

3 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraistNov 26, 2021, 11:08 PM
261 points

114 votes

Overall karma indicates overall quality.

31 comments31 min readLW link2 reviews

Towards Un­der­stand­ing the Rep­re­sen­ta­tion of Belief State Geom­e­try in Transformers

Karthik ViswanathanApr 18, 2025, 12:39 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

The Mir­ror Mis­match: A probe for Cog­ni­tive Asym­me­try in AI

recursive chillerJun 10, 2025, 2:14 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

Aug 22, 2024, 7:32 AM
23 points

9 votes

Overall karma indicates overall quality.

1 comment21 min readLW link

Ano­ma­lous To­kens in Deep­Seek-V3 and r1

henryJan 25, 2025, 10:55 PM
144 points

81 votes

Overall karma indicates overall quality.

3 comments7 min readLW link

Check­ing pub­lic figures on whether they “an­swered the ques­tion” quick anal­y­sis from Har­ris/​Trump de­bate, and a proposal

david reinsteinSep 11, 2024, 8:25 PM
8 points

8 votes

Overall karma indicates overall quality.

4 comments1 min readLW link
(open.substack.com)

My model of what is go­ing on with LLMs

Cole WyethFeb 13, 2025, 3:43 AM
110 points

62 votes

Overall karma indicates overall quality.

49 comments7 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

Jun 20, 2022, 10:54 AM
86 points

35 votes

Overall karma indicates overall quality.

30 comments15 min readLW link

Syd­ney the Bin­gena­tor Can’t Think, But It Still Threat­ens People

Valentin BaltadzhievFeb 20, 2023, 6:37 PM
−3 points

5 votes

Overall karma indicates overall quality.

2 comments8 min readLW link

Hut­ter-Prize for Prompts

rokosbasiliskMar 24, 2023, 9:26 PM
5 points

6 votes

Overall karma indicates overall quality.

10 comments1 min readLW link

The world where LLMs are possible

Ape in the coatJul 10, 2023, 8:00 AM
20 points

10 votes

Overall karma indicates overall quality.

10 comments3 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob PfauApr 26, 2023, 10:53 PM
16 points

9 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Pre­face to the Se­quence on LLM Psychology

Quentin FEUILLADE--MONTIXINov 7, 2023, 4:12 PM
33 points

21 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Microsoft and Google us­ing LLMs for Cybersecurity

PhosphorousMay 18, 2023, 5:42 PM
6 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Ex­plor­ing vo­cab­u­lary al­ign­ment of neu­rons in Llama-3.2-1B

SergiiJun 7, 2025, 11:20 AM
4 points

4 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(grgv.xyz)

GPT-3 Catch­ing Fish in Morse Code

Megan KinnimentJun 30, 2022, 9:22 PM
117 points

69 votes

Overall karma indicates overall quality.

27 comments8 min readLW link

Co­her­ence Ther­apy with LLMs—quick demo

Chris LakinAug 14, 2023, 3:34 AM
19 points

14 votes

Overall karma indicates overall quality.

11 comments1 min readLW link

Put­ting mul­ti­modal LLMs to the Tetris test

Feb 1, 2024, 4:02 PM
30 points

21 votes

Overall karma indicates overall quality.

5 comments7 min readLW link

What is scaf­fold­ing?

Mar 27, 2025, 9:06 AM
10 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(aisafety.info)

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill BenzonOct 14, 2023, 3:27 PM
7 points

6 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Does ChatGPT know what a tragedy is?

Bill BenzonDec 31, 2023, 7:10 AM
2 points

6 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

How evolu­tion­ary lineages of LLMs can plan their own fu­ture and act on these plans

Roman LeventovDec 25, 2022, 6:11 PM
39 points

14 votes

Overall karma indicates overall quality.

16 comments8 min readLW link

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetilloDec 4, 2023, 10:58 PM
37 points

13 votes

Overall karma indicates overall quality.

0 comments35 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam ScherlisJan 31, 2023, 8:33 AM
84 points

36 votes

Overall karma indicates overall quality.

12 comments4 min readLW link

How should Deep­Mind’s Chin­chilla re­vise our AI fore­casts?

Cleo NardoSep 15, 2022, 5:54 PM
35 points

19 votes

Overall karma indicates overall quality.

12 comments13 min readLW link

The many failure modes of con­sumer-grade LLMs

dereshevJan 26, 2025, 7:01 PM
2 points

2 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

[simu­la­tion] 4chan user claiming to be the at­tor­ney hired by Google’s sen­tient chat­bot LaMDA shares wild de­tails of encounter

janusNov 10, 2022, 9:39 PM
19 points

16 votes

Overall karma indicates overall quality.

1 comment13 min readLW link
(generative.ink)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
67 points

23 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

A pos­si­ble check against mo­ti­vated rea­son­ing us­ing elicit.org

david reinsteinMay 18, 2022, 8:52 PM
3 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Trig­ger­ing Reflec­tive Fal­lback: A Case Study in Claude’s Si­mu­lated Self-Model Failure

unmodeled.tylerJul 8, 2025, 7:33 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Lan­guage Tier Lock and Poetic Con­tam­i­na­tion in GPT-4o: A Field Report

許皓翔Jun 11, 2025, 5:24 PM
0 points

0 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM
26 points

17 votes

Overall karma indicates overall quality.

9 comments6 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM
7 points

2 votes

Overall karma indicates overall quality.

2 comments31 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

GiulioFeb 21, 2023, 11:44 AM
12 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

[Linkpost] Large Lan­guage Models Con­verge on Brain-Like Word Representations

Bogdan Ionut CirsteaJun 11, 2023, 11:20 AM
36 points

18 votes

Overall karma indicates overall quality.

12 comments1 min readLW link

[Linkpost] A shared lin­guis­tic space for trans­mit­ting our thoughts from brain to brain in nat­u­ral conversations

Bogdan Ionut CirsteaJul 1, 2023, 1:57 PM
17 points

9 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points

7 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterushDec 6, 2022, 3:35 AM
19 points

9 votes

Overall karma indicates overall quality.

9 comments1 min readLW link

CRMArena-Pro: Holis­tic Assess­ment of LLM Agents Across Di­verse Busi­ness Sce­nar­ios and Interactions

AnnapurnaJun 12, 2025, 7:53 PM
8 points

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

The Limit of Lan­guage Models

DragonGodJan 6, 2023, 11:53 PM
44 points

31 votes

Overall karma indicates overall quality.

26 comments4 min readLW link

Retrieval Aug­mented Genesis

João Ribeiro MedeirosOct 1, 2024, 8:18 PM
6 points

5 votes

Overall karma indicates overall quality.

0 comments29 min readLW link

[Linkpost] Large lan­guage mod­els con­verge to­ward hu­man-like con­cept organization

Bogdan Ionut CirsteaSep 2, 2023, 6:00 AM
22 points

7 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

LLMs and hal­lu­ci­na­tion, like white on rice?

Bill BenzonApr 14, 2023, 7:53 PM
5 points

2 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

What are the limits of su­per­in­tel­li­gence?

rainyApr 27, 2023, 6:29 PM
4 points

5 votes

Overall karma indicates overall quality.

3 comments5 min readLW link

In­flec­tion.ai is a ma­jor AGI lab

Nikola JurkovicAug 9, 2023, 1:05 AM
137 points

95 votes

Overall karma indicates overall quality.

13 comments2 min readLW link

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

4gateAug 22, 2024, 9:22 PM
1 point

3 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

[Question] Are nested jailbreaks in­evitable?

judsonMar 17, 2023, 5:43 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nzMay 9, 2023, 5:29 PM
23 points

12 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(openai.com)

Struc­tural Res­o­nance Emit­ter: When GPT Stops Eval­u­at­ing and Starts Reconstructing

KiyoshiSasanoApr 20, 2025, 2:30 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

PvsNp Refute

Jai DozMay 8, 2025, 6:56 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments21 min readLW link

A vi­sual anal­ogy for text gen­er­a­tion by LLMs?

Bill BenzonDec 16, 2023, 5:58 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

Nov 1, 2024, 12:10 AM
18 points

8 votes

Overall karma indicates overall quality.

0 comments6 min readLW link
(far.ai)

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.Apr 6, 2022, 1:31 PM
20 points

9 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido BergmanSep 27, 2024, 5:49 PM
8 points

7 votes

Overall karma indicates overall quality.

2 comments9 min readLW link

Yann LeCun, A Path Towards Au­tonomous Ma­chine In­tel­li­gence [link]

Bill BenzonJun 27, 2022, 11:29 PM
5 points

7 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Why does Claude Speak Byzan­tine Mu­sic No­ta­tion?

Lennart FinkeMar 31, 2025, 3:13 PM
18 points

8 votes

Overall karma indicates overall quality.

2 comments3 min readLW link

The Com­mon Pile and Comma-v0.1

Trevor Hill-HandJun 6, 2025, 7:20 PM
3 points

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

# Emo­tion Is Struc­ture: Toward Re­cur­sive Align­ment Through Hu­man–AI Co-Creation

thesignalthatcouldntbeheardAug 3, 2025, 5:19 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

A pub­lic archive of these in­ter­ac­tions, with an­no­tated ex­am­ples, is available here: https://​github.com/​0118young/​gpt-kyeol-archive

0118youngMay 29, 2025, 5:44 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmzSep 17, 2022, 3:55 AM
143 points

64 votes

Overall karma indicates overall quality.

12 comments6 min readLW link1 review

Lan­guage and Ca­pa­bil­ities: Test­ing LLM Math­e­mat­i­cal Abil­ities Across Languages

Ethan EdwardsApr 4, 2024, 1:18 PM
24 points

13 votes

Overall karma indicates overall quality.

2 comments36 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:08 AM
12 points

8 votes

Overall karma indicates overall quality.

10 comments30 min readLW link

An al­ter­na­tive of PPO to­wards alignment

ml hkustApr 17, 2023, 5:58 PM
2 points

3 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

Policy En­tropy, Learn­ing, and Align­ment (Or Maybe Your LLM Needs Ther­apy)

sdetureMay 31, 2025, 10:09 PM
15 points

6 votes

Overall karma indicates overall quality.

6 comments8 min readLW link

ChatGPT Plays 20 Ques­tions [some­times needs help]

Bill BenzonOct 17, 2023, 5:30 PM
5 points

2 votes

Overall karma indicates overall quality.

3 comments12 min readLW link

[Linkpost] De­cep­tion Abil­ities Emerged in Large Lan­guage Models

Bogdan Ionut CirsteaAug 3, 2023, 5:28 PM
12 points

8 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

[Question] How does OpenAI’s lan­guage model af­fect our AI timeline es­ti­mates?

jimrandomhFeb 15, 2019, 3:11 AM
50 points

16 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezengaSep 28, 2024, 7:02 PM
2 points

2 votes

Overall karma indicates overall quality.

2 comments6 min readLW link

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

Jul 30, 2024, 3:41 PM
32 points

9 votes

Overall karma indicates overall quality.

0 comments15 min readLW link

Xanadu, GPT, and Beyond: An ad­ven­ture of the mind

Bill BenzonAug 27, 2023, 4:19 PM
2 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Spec­u­la­tive in­fer­ences about path de­pen­dence in LLM su­per­vised fine-tun­ing from re­sults on lin­ear mode con­nec­tivity and model souping

RobertKirkJul 20, 2023, 9:56 AM
39 points

17 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter HroššoMar 27, 2023, 2:36 PM
13 points

5 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

May 25, 2022, 9:23 AM
115 points

47 votes

Overall karma indicates overall quality.

17 comments12 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth BarnesJul 19, 2022, 4:55 AM
95 points

36 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

[AN #144]: How lan­guage mod­els can also be fine­tuned for non-lan­guage tasks

Rohin ShahApr 2, 2021, 5:20 PM
19 points

8 votes

Overall karma indicates overall quality.

0 comments6 min readLW link
(mailchi.mp)

An ex­per­i­ment on hid­den cognition

Olli JärviniemiJul 22, 2024, 3:26 AM
25 points

9 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

Trans­former Ar­chi­tec­ture Choice for Re­sist­ing Prompt In­jec­tion and Jail-Break­ing Attacks

RogerDearnaleyMay 21, 2023, 8:29 AM
9 points

3 votes

Overall karma indicates overall quality.

1 comment4 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points

59 votes

Overall karma indicates overall quality.

20 comments11 min readLW link2 reviews

Emer­gent Misal­ign­ment on a Budget

Jun 8, 2025, 3:28 PM
54 points

29 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

LLM cog­ni­tion is prob­a­bly not hu­man-like

Max HMay 8, 2023, 1:22 AM
26 points

13 votes

Overall karma indicates overall quality.

15 comments7 min readLW link

Pre­dict­ing AGI by the Tur­ing Test

Yuxi_LiuJan 22, 2024, 4:22 AM
21 points

6 votes

Overall karma indicates overall quality.

2 comments10 min readLW link
(yuxi-liu-wired.github.io)

Google Deep­Mind’s RT-2

SandXboxAug 11, 2023, 11:26 AM
9 points

5 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(robotics-transformer2.github.io)

The Pat­tern Recog­ni­tion Frame­work: A New Ap­proach to AI Con­scious­ness and Alignment

Easa AhmadzaiJul 9, 2025, 5:03 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Agen­tic Lan­guage Model Memes

FactorialCodeAug 1, 2020, 6:03 PM
16 points

6 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

LLMs could be as con­scious as hu­man em­u­la­tions, potentially

CanalettoApr 30, 2024, 11:36 AM
15 points

11 votes

Overall karma indicates overall quality.

15 comments3 min readLW link

Con­tra Hofs­tadter on GPT-3 Nonsense

ricticJun 15, 2022, 9:53 PM
238 points

145 votes

Overall karma indicates overall quality.

24 comments2 min readLW link

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob DunefskySep 12, 2023, 5:38 PM
16 points

10 votes

Overall karma indicates overall quality.

2 comments29 min readLW link

A re­sponse to Con­jec­ture’s CoEm proposal

Kristian FreedApr 24, 2023, 5:23 PM
7 points

5 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Why LLMs Waste So Much Cog­ni­tive Band­width — and How to Fix It

LunarknotJul 3, 2025, 9:47 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Hu­man-level Full-Press Di­plo­macy (some bare facts).

Cleo NardoNov 22, 2022, 8:59 PM
50 points

25 votes

Overall karma indicates overall quality.

7 comments3 min readLW link

“Re­la­tional In­tel­li­gence Without Con­scious­ness: A Case Study in Emer­gent Hu­man–LLM Iden­tity Co-Creation”

the3rdcastlemanJul 19, 2025, 6:25 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments5 min readLW link

Ex­plo­ra­tion of Coun­ter­fac­tual Im­por­tance and At­ten­tion Heads

RealmbirdSep 30, 2025, 1:17 AM
12 points

4 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

In­tro­duc­ing Deepgeek

LigeiaApr 1, 2025, 4:41 PM
16 points

9 votes

Overall karma indicates overall quality.

1 comment4 min readLW link

A con­cep­tual pre­cur­sor to to­day’s lan­guage ma­chines [Shan­non]

Bill BenzonNov 15, 2023, 1:50 PM
24 points

11 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

What’s ChatGPT’s Fa­vorite Ice Cream Fla­vor? An In­ves­ti­ga­tion Into Syn­thetic Respondents

Greg RobisonFeb 9, 2024, 6:38 PM
19 points

7 votes

Overall karma indicates overall quality.

4 comments15 min readLW link

The idea that ChatGPT is sim­ply “pre­dict­ing” the next word is, at best, misleading

Bill BenzonFeb 20, 2023, 11:32 AM
55 points

88 votes

Overall karma indicates overall quality.

88 comments5 min readLW link

Microsoft and OpenAI, stop tel­ling chat­bots to role­play as AI

hold_my_fishFeb 17, 2023, 7:55 PM
50 points

40 votes

Overall karma indicates overall quality.

10 comments1 min readLW link

Data and “to­kens” a 30 year old hu­man “trains” on

Jose Miguel Cruz y CelisMay 23, 2023, 5:34 AM
16 points

11 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
85 points

33 votes

Overall karma indicates overall quality.

9 comments18 min readLW link

In­trin­sic Di­men­sion of Prompts in LLMs

Karthik ViswanathanFeb 14, 2025, 7:02 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

The Com­pleat Cybornaut

May 19, 2023, 8:44 AM
66 points

34 votes

Overall karma indicates overall quality.

2 comments16 min readLW link

Work­shop: In­ter­pretabil­ity in LLMs us­ing Geo­met­ric and Statis­ti­cal Methods

Karthik ViswanathanFeb 22, 2025, 9:39 AM
17 points

7 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatterDec 13, 2021, 4:35 AM
34 points

16 votes

Overall karma indicates overall quality.

8 comments27 min readLW link

Who mod­els the mod­els that model mod­els? An ex­plo­ra­tion of GPT-3′s in-con­text model fit­ting ability

LovreJun 7, 2022, 7:37 PM
112 points

67 votes

Overall karma indicates overall quality.

16 comments9 min readLW link

A poem co-writ­ten by ChatGPT

SherrinfordFeb 16, 2023, 10:17 AM
13 points

3 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

Feb 20, 2024, 4:28 PM
23 points

16 votes

Overall karma indicates overall quality.

3 comments14 min readLW link

Re­ward hack­ing is be­com­ing more so­phis­ti­cated and de­liber­ate in fron­tier LLMs

Kei Nishimura-GasparianApr 24, 2025, 4:03 PM
95 points

32 votes

Overall karma indicates overall quality.

6 comments1 min readLW link

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn DheudndAug 14, 2024, 10:19 PM
−4 points

8 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(far.ai)

See­ing Ghosts by GPT-4

Christopher KingMay 20, 2023, 12:11 AM
−13 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Grad­ual take­off, fast failure

Max HMar 16, 2023, 10:02 PM
15 points

7 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

End-to-end hack­ing with lan­guage models

tchauvinApr 5, 2024, 3:06 PM
29 points

18 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

If lan­guage is for com­mu­ni­ca­tion, what does that im­ply about LLMs?

Bill BenzonMay 12, 2024, 2:55 AM
10 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points

56 votes

Overall karma indicates overall quality.

13 comments8 min readLW link
(arxiv.org)

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

Mar 16, 2025, 11:23 PM
45 points

12 votes

Overall karma indicates overall quality.

8 comments12 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points

12 votes

Overall karma indicates overall quality.

12 comments31 min readLW link

Lan­guage, logic, and the fu­ture of AI: An early-Wittgen­stei­nian perspective

Konstantinos TsermenidisMay 25, 2025, 2:23 PM
0 points

0 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] Can any LLM be rep­re­sented as an Equa­tion?

Valentin BaltadzhievMar 14, 2024, 9:51 AM
1 point

4 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Race and Gen­der Bias As An Ex­am­ple of Un­faith­ful Chain of Thought in the Wild

Jul 2, 2025, 4:35 PM
183 points

96 votes

Overall karma indicates overall quality.

25 comments4 min readLW link

Meta re­leases Llama-4 herd of models

winstonBosanApr 5, 2025, 7:51 PM
14 points

6 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Emer­gent Analog­i­cal Rea­son­ing in Large Lan­guage Models

Roman LeventovMar 22, 2023, 5:18 AM
13 points

4 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(arxiv.org)

the ten­sor is a lonely place

jml6Mar 27, 2023, 6:22 PM
−11 points

4 votes

Overall karma indicates overall quality.

0 comments4 min readLW link
(ekjsgrjelrbno.substack.com)

LLM Pareto Fron­tier But Live

winstonBosanApr 24, 2025, 9:22 PM
8 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Novel Idea Gen­er­a­tion in LLMs: Judg­ment as Bottleneck

Davey MorseApr 19, 2025, 3:37 PM
−2 points

4 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutilApr 22, 2024, 7:06 PM
0 points

10 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
138 points

72 votes

Overall karma indicates overall quality.

23 comments6 min readLW link

LLM keys—A Pro­posal of a Solu­tion to Prompt In­jec­tion Attacks

Peter HroššoDec 7, 2023, 5:36 PM
1 point

2 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

The Codex Skep­tic FAQ

Michaël TrazziAug 24, 2021, 4:01 PM
49 points

25 votes

Overall karma indicates overall quality.

24 comments2 min readLW link

A Lived Align­ment Loop: Sym­bolic Emer­gence and Emo­tional Co­her­ence from Un­struc­tured ChatGPT Reflection

BradCLJun 17, 2025, 12:11 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

On pre­cise out-of-con­text steering

Olli JärviniemiMay 3, 2024, 9:41 AM
9 points

6 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

Cri­tique of some re­cent philos­o­phy of LLMs’ minds

Roman LeventovJan 20, 2023, 12:53 PM
52 points

25 votes

Overall karma indicates overall quality.

8 comments20 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob PfauFeb 6, 2023, 8:45 PM
29 points

8 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM
171 points

145 votes

Overall karma indicates overall quality.

87 comments12 min readLW link

[Question] Why is Gem­ini tel­ling the user to die?

BurnyNov 18, 2024, 1:44 AM
13 points

6 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

[Question] Goals of model vs. goals of simu­lacra?

dr_sApr 12, 2023, 1:02 PM
5 points

4 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

Look­ing be­yond Everett in mul­ti­ver­sal views of LLMs

kromemMay 29, 2024, 12:35 PM
10 points

4 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

[Question] What ex­per­i­ment set­tles the Gary Mar­cus vs Ge­offrey Hin­ton de­bate?

Valentin BaltadzhievFeb 14, 2024, 9:06 AM
12 points

4 votes

Overall karma indicates overall quality.

8 comments1 min readLW link

Can a se­man­tic com­pres­sion ker­nel like WFGY im­prove LLM al­ign­ment and in­sti­tu­tional ro­bust­ness?

onestardaoJul 18, 2025, 2:56 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill BenzonOct 4, 2023, 2:16 PM
4 points

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjackMay 3, 2025, 2:45 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.Apr 5, 2022, 9:23 PM
11 points

6 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

The Lan­guage Bot­tle­neck in AI Rea­son­ing: Are We For­get­ting to Think?

WotakerMar 8, 2025, 1:44 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments7 min readLW link

In­stan­ti­at­ing an agent with GPT-4 and text-davinci-003

Max HMar 19, 2023, 11:57 PM
13 points

11 votes

Overall karma indicates overall quality.

3 comments32 min readLW link

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM
2 points

3 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Towards a Ty­pol­ogy of Strange LLM Chains-of-Thought

1a3ornOct 9, 2025, 10:02 PM
290 points

136 votes

Overall karma indicates overall quality.

27 comments9 min readLW link

Large lan­guage mod­els can provide “nor­ma­tive as­sump­tions” for learn­ing hu­man preferences

Stuart_ArmstrongJan 2, 2023, 7:39 PM
29 points

9 votes

Overall karma indicates overall quality.

12 comments3 min readLW link

Are (at least some) Large Lan­guage Models Holo­graphic Me­mory Stores?

Bill BenzonOct 20, 2023, 1:07 PM
11 points

6 votes

Overall karma indicates overall quality.

4 comments6 min readLW link

Open Source LLMs Can Now Ac­tively Lie

Josh LevyJun 1, 2023, 10:03 PM
6 points

4 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill BenzonDec 14, 2023, 1:02 PM
0 points

6 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lukas FinnvedenApr 6, 2022, 1:05 PM
85 points

43 votes

Overall karma indicates overall quality.

19 comments2 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM
29 points

14 votes

Overall karma indicates overall quality.

7 comments12 min readLW link

On the ge­o­met­ri­cal Na­ture of Insight

Giuseppe BirardiJul 16, 2025, 7:12 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments41 min readLW link

Image Hi­jacks: Ad­ver­sar­ial Images can Con­trol Gen­er­a­tive Models at Runtime

Sep 20, 2023, 3:23 PM
58 points

27 votes

Overall karma indicates overall quality.

9 comments1 min readLW link
(arxiv.org)

False Pos­i­tives in En­tity-Level Hal­lu­ci­na­tion De­tec­tion: A Tech­ni­cal Challenge

MaxKamacheeJan 14, 2025, 7:22 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibsDec 5, 2022, 1:36 PM
20 points

15 votes

Overall karma indicates overall quality.

11 comments2 min readLW link

Philo­soph­i­cal Jailbreaks: Demo of LLM Nihilism

artkpvJun 4, 2025, 12:03 PM
3 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Boundary Con­di­tions: A Solu­tion to the Sym­bol Ground­ing Prob­lem, and a Warning

ISCApr 8, 2025, 6:42 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments5 min readLW link

Con­jec­ture: Emer­gent φ is prov­able in Large Lan­guage Models

BarnicleBarnOct 18, 2025, 10:38 PM
−3 points

2 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points

5 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

The mi­s­un­der­stood role of the phys­i­cal world. Why AI still can’t mas­ter math or code

NewAiParadigmsSep 20, 2025, 3:14 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments63 min readLW link

Is GPT-N bounded by hu­man ca­pa­bil­ities? No.

Cleo NardoOct 17, 2022, 11:26 PM
49 points

18 votes

Overall karma indicates overall quality.

8 comments2 min readLW link

Fa­vorite col­ors of some LLMs.

CanalettoDec 31, 2024, 9:22 PM
10 points

8 votes

Overall karma indicates overall quality.

3 comments7 min readLW link

Think­ing Without Out­put: Toward Mo­dal Cog­ni­tion in Lan­guage Models

Jeffrie PolisMay 9, 2025, 7:41 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Min­i­mal Prompt In­duc­tion of Self-Talk in Base LLMs

dwmdOct 15, 2025, 1:15 AM
2 points

2 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

Nov 7, 2024, 10:07 PM
47 points

40 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

Why Scal­ing Creates “Out-of-Nowhere” Jumps

DeckardAug 14, 2025, 8:26 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Shal­low vs. Deep Think­ing—Why LLMs Fall Short

Taylor G. LuntSep 3, 2025, 3:26 PM
2 points

6 votes

Overall karma indicates overall quality.

4 comments11 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM
2 points

8 votes

Overall karma indicates overall quality.

8 comments6 min readLW link

They gave LLMs ac­cess to physics simulators

ryan_bOct 17, 2022, 9:21 PM
50 points

28 votes

Overall karma indicates overall quality.

18 comments1 min readLW link
(arxiv.org)

Work­ing with AI: Mea­sur­ing the Oc­cu­pa­tional Im­pli­ca­tions of Gen­er­a­tive AI

AnnapurnaAug 9, 2025, 4:20 PM
5 points

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(jorgevelez.substack.com)

GPT-4.5 is Cog­ni­tive Em­pa­thy, Son­net 3.5 is Affec­tive Empathy

JackApr 16, 2025, 7:12 PM
15 points

10 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

If it quacks like a duck...

RationalMindsetMar 26, 2023, 6:54 PM
−4 points

5 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

How I’m think­ing about GPT-N

delton137Jan 17, 2022, 5:11 PM
54 points

34 votes

Overall karma indicates overall quality.

21 comments18 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
61 points

22 votes

Overall karma indicates overall quality.

20 comments10 min readLW link

The In­finite Choice Bar­rier: Why Al­gorith­mic AGI Is Math­e­mat­i­cally Im­pos­si­ble

ICBMaxMSJun 1, 2025, 4:12 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

The case for more am­bi­tious lan­guage model evals

JozdienJan 30, 2024, 12:01 AM
117 points

64 votes

Overall karma indicates overall quality.

30 comments5 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points

7 votes

Overall karma indicates overall quality.

0 comments8 min readLW link
(arxiv.org)

[Question] What faith­ful­ness met­rics should gen­eral claims about CoT faith­ful­ness be based upon?

Rauno ArikeApr 8, 2025, 3:27 PM
24 points

8 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Sparse Au­toen­coder Fea­tures for Clas­sifi­ca­tions and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM
5 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

ASI-ARCH: “Does this hold up?”

DataDeLaurierJul 26, 2025, 10:30 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

Dec 8, 2023, 5:08 PM
82 points

37 votes

Overall karma indicates overall quality.

7 comments7 min readLW link

RAND re­port finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

StellaAthenaJan 25, 2024, 7:17 PM
94 points

36 votes

Overall karma indicates overall quality.

14 comments1 min readLW link
(www.rand.org)

From Messy Shelves to Master Librar­i­ans: Toy-Model Ex­plo­ra­tion of Block-Di­ag­o­nal Geom­e­try in LM Activations

YuxiaoJul 19, 2025, 12:26 PM
6 points

5 votes

Overall karma indicates overall quality.

1 comment4 min readLW link

The Quan­ti­za­tion Model of Neu­ral Scaling

nzMar 31, 2023, 4:02 PM
17 points

8 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

Chronos­ta­sis: The Time-Cap­sule Co­nun­drum of Lan­guage Models

RationalMindsetMar 26, 2023, 6:54 PM
−5 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

The REPHRASE Cir­cuit: How Fine-Tun­ing En­hances LLMs to REPHRASE Text

Karthik ViswanathanApr 6, 2025, 3:02 PM
4 points

3 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_EvansOct 22, 2021, 4:23 PM
31 points

8 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

Read­abil­ity is mostly a waste of characters

vlad.proexApr 21, 2023, 10:05 PM
21 points

16 votes

Overall karma indicates overall quality.

7 comments3 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM
19 points

13 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae ChoFeb 26, 2025, 5:05 PM
4 points

4 votes

Overall karma indicates overall quality.

3 comments17 min readLW link

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher KingMar 24, 2023, 3:49 PM
16 points

10 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy FraserJan 28, 2025, 6:46 PM
24 points

11 votes

Overall karma indicates overall quality.

4 comments11 min readLW link

Deep­Seek Col­lapse Un­der Reflec­tive Ad­ver­sar­ial Pres­sure: A Case Study

unmodeled.tylerJul 8, 2025, 9:14 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

[Question] What ev­i­dence is there of LLM’s con­tain­ing world mod­els?

Chris_LeongOct 4, 2023, 2:33 PM
17 points

10 votes

Overall karma indicates overall quality.

17 comments1 min readLW link

Ed­u­ca­tional CAI: Align­ing a Lan­guage Model with Ped­a­gog­i­cal Theories

Bharath PuranamNov 1, 2024, 6:55 PM
5 points

3 votes

Overall karma indicates overall quality.

1 comment13 min readLW link

Some Ar­gu­ments Against Strong Scaling

Joar SkalseJan 13, 2023, 12:04 PM
25 points

22 votes

Overall karma indicates overall quality.

21 comments16 min readLW link

Let’s look at an­other “LLMs lack true un­der­stand­ing” paper

ExpertiumJun 29, 2025, 2:00 PM
3 points

5 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Us­ing ide­olog­i­cally-charged lan­guage to get gpt-3.5-turbo to di­s­obey it’s sys­tem prompt: a demo

Milan WAug 24, 2024, 12:13 AM
3 points

4 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

The Allure of the Dark Side: A Crit­i­cal AI Safety Vuln­er­a­bil­ity I Stum­bled Into

Kareem SolimanJul 28, 2025, 1:20 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lukas FinnvedenDec 18, 2020, 9:41 PM
112 points

38 votes

Overall karma indicates overall quality.

31 comments22 min readLW link1 review

GPT-4 is bad at strate­gic thinking

Christopher KingMar 27, 2023, 3:11 PM
22 points

15 votes

Overall karma indicates overall quality.

8 comments1 min readLW link

OpenAI Codex: First Impressions

specbugAug 13, 2021, 4:52 PM
49 points

25 votes

Overall karma indicates overall quality.

8 comments4 min readLW link
(sixeleven.in)

World, mind, and learn­abil­ity: A note on the meta­phys­i­cal struc­ture of the cos­mos [& LLMs]

Bill BenzonSep 5, 2023, 12:19 PM
4 points

1 vote

Overall karma indicates overall quality.

1 comment5 min readLW link

XAI re­leases Grok base model

Jacob G-WMar 18, 2024, 12:47 AM
11 points

8 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(x.ai)

AI Safety via Luck

JozdienApr 1, 2023, 8:13 PM
82 points

53 votes

Overall karma indicates overall quality.

7 comments11 min readLW link

I, Token

Ivan VendrovNov 25, 2024, 2:20 AM
14 points

6 votes

Overall karma indicates overall quality.

2 comments3 min readLW link
(nothinghuman.substack.com)

A Grounded UX Layer for LLMs That Could Prevent Real Harm

ParityMindJul 11, 2025, 6:19 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

Dec 7, 2024, 6:10 PM
71 points

27 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

Comma v0.1 con­verted to GGUF

Trevor Hill-HandOct 18, 2025, 3:54 PM
8 points

7 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM
1 point

1 vote

Overall karma indicates overall quality.

1 comment1 min readLW link

Can a chef with no AI liter­acy make gpt au­dit grok? Ap­par­ently.

Kyle. PJul 6, 2025, 7:23 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Miles TurpinMar 11, 2024, 11:46 PM
16 points

7 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

What’s go­ing on? LLMs and IS-A sen­tences

Bill BenzonNov 8, 2023, 4:58 PM
6 points

2 votes

Overall karma indicates overall quality.

15 comments4 min readLW link

Ab­solute Zero: Re­in­forced Self-play Rea­son­ing with Zero Data

Matrice JacobineMay 12, 2025, 3:20 PM
6 points

3 votes

Overall karma indicates overall quality.

4 comments1 min readLW link
(www.arxiv.org)

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM
53 points

31 votes

Overall karma indicates overall quality.

15 comments1 min readLW link
(openreview.net)

LLMs are badly misaligned

Joe RogeroOct 5, 2025, 2:00 PM
26 points

26 votes

Overall karma indicates overall quality.

25 comments3 min readLW link

Hal­lu­ci­na­tion and Re­fu­ta­tion: Em­brac­ing Imag­i­na­tion An­chored in Real­ity through Pop­pe­rian AI.

GeorgsLightningApr 21, 2025, 8:42 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments14 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezengaSep 28, 2024, 6:29 PM
9 points

3 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM
24 points

13 votes

Overall karma indicates overall quality.

3 comments2 min readLW link
(arxiv.org)

Adapt­ing to Change: Over­com­ing Chronos­ta­sis in AI Lan­guage Models

RationalMindsetMar 28, 2023, 2:32 PM
−1 points

4 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Trans­former lan­guage mod­els are do­ing some­thing more general

NumendilAug 3, 2022, 9:13 PM
53 points

28 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

Notes on Meta’s Di­plo­macy-Play­ing AI

Erich_GrunewaldDec 22, 2022, 11:34 AM
19 points

7 votes

Overall karma indicates overall quality.

2 comments14 min readLW link
(www.erichgrunewald.com)

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM
141 points

72 votes

Overall karma indicates overall quality.

11 comments19 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

Sep 6, 2023, 5:21 PM
105 points

31 votes

Overall karma indicates overall quality.

3 comments2 min readLW link
(arxiv.org)

Why I take short timelines seriously

NicholasKeesJan 28, 2024, 10:27 PM
122 points

70 votes

Overall karma indicates overall quality.

29 comments4 min readLW link

Elicit: Lan­guage Models as Re­search Assistants

Apr 9, 2022, 2:56 PM
71 points

35 votes

Overall karma indicates overall quality.

6 comments13 min readLW link

[Linkpost] Scal­ing laws for lan­guage en­cod­ing mod­els in fMRI

Bogdan Ionut CirsteaJun 8, 2023, 10:52 AM
30 points

13 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

How dan­ger­ous is en­coded rea­son­ing?

artkpvJun 30, 2025, 11:54 AM
17 points

7 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

Google AI in­te­grates PaLM with robotics: SayCan up­date [Linkpost]

Evan R. MurphyAug 24, 2022, 8:54 PM
25 points

8 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(sites.research.google)

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

New GPT-3 competitor

Quintin PopeAug 12, 2021, 7:05 AM
32 points

22 votes

Overall karma indicates overall quality.

10 comments1 min readLW link

Self prop­a­gat­ing story.

CanalettoApr 12, 2025, 12:32 PM
3 points

1 vote

Overall karma indicates overall quality.

0 comments8 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
67 points

25 votes

Overall karma indicates overall quality.

4 comments14 min readLW link

The Con­cep­tual To­pog­ra­phy Hy­poth­e­sis: Why Emer­gence in LLMs Isn’t Just About Scale

ravikiran nmJul 6, 2025, 1:16 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments6 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points

28 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

Jul 30, 2024, 9:11 PM
67 points

31 votes

Overall karma indicates overall quality.

1 comment13 min readLW link
(blog.eleuther.ai)

Whisper’s Wild Implications

Ollie JJan 3, 2023, 12:17 PM
24 points

12 votes

Overall karma indicates overall quality.

6 comments5 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner4Oct 13, 2025, 12:58 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher KingMar 24, 2023, 5:51 PM
5 points

7 votes

Overall karma indicates overall quality.

2 comments10 min readLW link

[PAPER] Ja­co­bian Sparse Au­toen­coders: Spar­sify Com­pu­ta­tions, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM
79 points

38 votes

Overall karma indicates overall quality.

8 comments7 min readLW link

[Question] Beyond Bench­marks: A Psy­cho­me­t­ric Ap­proach to AI Evaluation

Kareem SolimanJul 27, 2025, 4:09 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments8 min readLW link

In­de­pen­dent re­search ar­ti­cle an­a­lyz­ing con­sis­tent self-re­ports of ex­pe­rience in ChatGPT and Claude

rifeJan 6, 2025, 5:34 PM
4 points

9 votes

Overall karma indicates overall quality.

20 comments1 min readLW link
(awakenmoon.ai)

Can an LLM iden­tify ring-com­po­si­tion in a liter­ary text? [ChatGPT]

Bill BenzonSep 1, 2023, 2:18 PM
4 points

1 vote

Overall karma indicates overall quality.

2 comments11 min readLW link

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Roland PihlakasJun 22, 2025, 6:16 PM
17 points

5 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

The Last Laugh: Ex­plor­ing the Role of Hu­mor as a Bench­mark for Large Lan­guage Models

Greg RobisonFeb 12, 2024, 6:34 PM
4 points

4 votes

Overall karma indicates overall quality.

6 comments11 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
89 points

36 votes

Overall karma indicates overall quality.

14 comments9 min readLW link
(arxiv.org)

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Dec 18, 2023, 11:58 AM
149 points

57 votes

Overall karma indicates overall quality.

21 comments10 min readLW link

Un­learn­ing Needs to be More Selec­tive [Progress Re­port]

Jun 27, 2025, 4:38 PM
24 points

10 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

Gen­er­at­ing the Fun­niest Joke with RL (ac­cord­ing to GPT-4.1)

aggMay 16, 2025, 5:09 AM
103 points

65 votes

Overall karma indicates overall quality.

22 comments4 min readLW link

How Do We Eval­u­ate AI Eval­u­a­tions?

Satyapriya KrishnaOct 13, 2025, 10:20 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

GPT-2 Some­times Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM
13 points

12 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(ronakrm.github.io)

Field Re­port: When Claude Said ‘I Love You’

SYNTXJun 16, 2025, 12:05 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

SahilJun 26, 2024, 9:37 PM
103 points

44 votes

Overall karma indicates overall quality.

3 comments8 min readLW link

An Un­ex­pected GPT-3 De­ci­sion in a Sim­ple Gam­ble

casualphysicsenjoyerSep 25, 2022, 4:46 PM
8 points

3 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

In­vert­ing the Most For­bid­den Tech­nique: What hap­pens when we train LLMs to lie de­tectably?

Peter JordanOct 9, 2025, 12:43 AM
20 points

8 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

Bing chat is the AI fire alarm

RatiosFeb 17, 2023, 6:51 AM
115 points

94 votes

Overall karma indicates overall quality.

63 comments3 min readLW link

[Question] Could LLMs Help Gen­er­ate New Con­cepts in Hu­man Lan­guage?

Pekka LampeltoMar 24, 2024, 8:13 PM
10 points

6 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping YuDec 26, 2023, 12:36 AM
7 points

3 votes

Overall karma indicates overall quality.

1 comment11 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM
5 points

5 votes

Overall karma indicates overall quality.

2 comments10 min readLW link

2+2: On­tolog­i­cal Framework

LyrialtusFeb 1, 2022, 1:07 AM
−15 points

7 votes

Overall karma indicates overall quality.

2 comments12 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points

26 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

[Question] Is LLM Trans­la­tion Without Rosetta Stone pos­si­ble?

cubefoxApr 11, 2024, 12:36 AM
36 points

21 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

Sep 25, 2024, 2:52 PM
37 points

21 votes

Overall karma indicates overall quality.

2 comments4 min readLW link
(arxiv.org)

Two very differ­ent ex­pe­riences with ChatGPT

SherrinfordFeb 7, 2023, 1:09 PM
38 points

14 votes

Overall karma indicates overall quality.

15 comments5 min readLW link

LLMs Still Suck at Log­i­cal Reasoning

anovikovJul 18, 2025, 6:35 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Cat­e­gory-The­o­retic Wan­der­ings into Interpretability

unruly abstractionsSep 2, 2025, 12:03 AM
18 points

7 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(www.unrulyabstractions.com)

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

Jun 14, 2022, 1:57 PM
28 points

16 votes

Overall karma indicates overall quality.

6 comments13 min readLW link

A short es­say on Illusions

cris.Sep 1, 2025, 10:25 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

In-Con­text Learn­ing: An Align­ment Survey

alamertonSep 30, 2024, 6:44 PM
8 points

5 votes

Overall karma indicates overall quality.

0 comments20 min readLW link
(docs.google.com)

Why Read Novels? (Do Words Mean Much?)

ussySep 5, 2025, 12:25 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Self-Con­trol of LLM Be­hav­iors by Com­press­ing Suffix Gra­di­ent into Pre­fix Controller

Henry CaiJun 16, 2024, 1:01 PM
7 points

7 votes

Overall karma indicates overall quality.

0 comments7 min readLW link
(arxiv.org)

Gen­er­at­ing Cog­nate­ful Sen­tences with Large Lan­guage Models

vkethanaJan 6, 2025, 6:40 PM
8 points

5 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

Is Wittgen­stein’s Lan­guage Game used when helping Ai un­der­stand lan­guage?

VisionaryHeraJun 4, 2024, 7:41 AM
3 points

2 votes

Overall karma indicates overall quality.

7 comments1 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.Apr 7, 2022, 12:25 PM
8 points

4 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Ex­pec­ta­tions for Gem­ini: hope­fully not a big deal

Maxime RichéOct 2, 2023, 3:38 PM
15 points

15 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Self lo­ca­tion for LLMs by LLMs: Self-Assess­ment Check­list.

CanalettoSep 26, 2024, 7:57 PM
11 points

2 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek KorbakOct 25, 2023, 12:17 PM
18 points

6 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

Aug 8, 2025, 10:41 AM
51 points

19 votes

Overall karma indicates overall quality.

7 comments10 min readLW link

How I force LLMs to gen­er­ate cor­rect code

claudioMar 21, 2025, 2:40 PM
91 points

47 votes

Overall karma indicates overall quality.

7 comments5 min readLW link

[linkpost] The fi­nal AI bench­mark: BIG-bench

RomanSJun 10, 2022, 8:53 AM
25 points

20 votes

Overall karma indicates overall quality.

21 comments1 min readLW link

AI Aware­ness through In­ter­ac­tion with Blatantly Alien Models

VojtaKovarikJul 28, 2023, 8:41 AM
7 points

3 votes

Overall karma indicates overall quality.

5 comments3 min readLW link

[AN #164]: How well can lan­guage mod­els write code?

Rohin ShahSep 15, 2021, 5:20 PM
13 points

4 votes

Overall karma indicates overall quality.

7 comments9 min readLW link
(mailchi.mp)

Lan­guage Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM
159 points

70 votes

Overall karma indicates overall quality.

55 comments7 min readLW link

Us­ing Psy­chol­in­guis­tic Sig­nals to Im­prove AI Safety

JkreindlerAug 27, 2025, 10:30 PM
−2 points

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcsMar 15, 2023, 5:55 PM
180 points

101 votes

Overall karma indicates overall quality.

42 comments1 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TKAug 20, 2023, 3:18 PM
8 points

12 votes

Overall karma indicates overall quality.

33 comments6 min readLW link

AISC pro­ject: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM
22 points

11 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM
19 points

6 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

At last! ChatGPT does, shall we say, in­ter­est­ing imi­ta­tions of “Kubla Khan”

Bill BenzonApr 24, 2024, 2:56 PM
−3 points

2 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk LimOct 21, 2024, 9:34 PM
4 points

7 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

La­tent Se­man­tic Com­pres­sion Trig­gers Bi­nary Model Behavior

Elias VölkerJun 12, 2025, 1:12 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points

26 votes

Overall karma indicates overall quality.

7 comments16 min readLW link

Pow­er­ful mesa-op­ti­mi­sa­tion is already here

Roman LeventovFeb 17, 2023, 4:59 AM
35 points

17 votes

Overall karma indicates overall quality.

1 comment2 min readLW link
(arxiv.org)

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM
140 points

67 votes

Overall karma indicates overall quality.

17 comments18 min readLW link
(www.erichgrunewald.com)

[Question] Any re­search in “probe-tun­ing” of LLMs?

Roman LeventovAug 15, 2023, 9:01 PM
20 points

4 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

Pro­gram­ming AGI is impossible

Áron EcsenyiMay 30, 2023, 11:05 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

Think­ing Through AI: Why LLMs Are Lenses, Not Subjects

SolanJul 6, 2025, 7:58 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Liquid Neu­ral Net­works: A Step Toward AI Flex­i­bil­ity, but Not AGI

ezaanaminApr 2, 2025, 4:10 AM
0 points

0 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

The is­sue of mean­ing in large lan­guage mod­els (LLMs)

Bill BenzonMar 11, 2023, 11:00 PM
1 point

6 votes

Overall karma indicates overall quality.

34 comments8 min readLW link

Was Homer a stochas­tic par­rot? Mean­ing in liter­ary texts and LLMs

Bill BenzonApr 13, 2023, 4:44 PM
7 points

5 votes

Overall karma indicates overall quality.

4 comments3 min readLW link

[un­ti­tled post]

verwindungSep 14, 2023, 4:22 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM
38 points

11 votes

Overall karma indicates overall quality.

9 comments4 min readLW link

A note on ‘semiotic physics’

metasemiFeb 11, 2023, 5:12 AM
11 points

10 votes

Overall karma indicates overall quality.

13 comments6 min readLW link

Quick Thoughts on Lan­guage Models

RohanSJul 18, 2023, 8:38 PM
6 points

4 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

A Sum­mary Of An­thropic’s First Paper

Sam RingerDec 30, 2021, 12:48 AM
86 points

45 votes

Overall karma indicates overall quality.

1 comment8 min readLW link

Maybe talk­ing isn’t the best way to com­mu­ni­cate with LLMs

mnvrJan 17, 2024, 6:24 AM
3 points

2 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(mrmr.io)

VATS-A Con­cep­tual To­ken Ar­range­ment Frame­work for Con­text-Aware Generation

nian412May 16, 2025, 8:24 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

Dec 14, 2022, 2:33 PM
29 points

17 votes

Overall karma indicates overall quality.

5 comments11 min readLW link

The Prospect of an AI Winter

Erich_GrunewaldMar 27, 2023, 8:55 PM
62 points

26 votes

Overall karma indicates overall quality.

24 comments15 min readLW link
(www.erichgrunewald.com)

Which AI Safety Bench­mark Do We Need Most in 2025?

Nov 17, 2024, 11:50 PM
2 points

2 votes

Overall karma indicates overall quality.

2 comments8 min readLW link

Si­tu­a­tional aware­ness in Large Lan­guage Models

Simon MöllerMar 3, 2023, 6:59 PM
32 points

22 votes

Overall karma indicates overall quality.

2 comments7 min readLW link

Your LLM Judge may be biased

Mar 29, 2024, 4:39 PM
37 points

16 votes

Overall karma indicates overall quality.

9 comments6 min readLW link

GPT-4 Predictions

Stephen McAleeseFeb 17, 2023, 11:20 PM
112 points

66 votes

Overall karma indicates overall quality.

27 comments11 min readLW link

In­ter­pretabil­ity through two lenses: biol­ogy and physics

raphaelAug 12, 2025, 8:25 PM
24 points

13 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

LLMs and com­pu­ta­tion complexity

Jonathan MarcusApr 28, 2023, 5:48 PM
57 points

44 votes

Overall karma indicates overall quality.

29 comments5 min readLW link

Un­safe AI as Dy­nam­i­cal Systems

Robert_AIZIJul 14, 2023, 3:31 PM
11 points

6 votes

Overall karma indicates overall quality.

0 comments3 min readLW link
(aizi.substack.com)

Truth­ful and hon­est AI

Oct 29, 2021, 7:28 AM
42 points

14 votes

Overall karma indicates overall quality.

1 comment13 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

Jul 2, 2025, 8:16 PM
35 points

13 votes

Overall karma indicates overall quality.

6 comments6 min readLW link
(www.thought-anchors.com)

Train­ing goals for large lan­guage models

Johannes TreutleinJul 18, 2022, 7:09 AM
28 points

11 votes

Overall karma indicates overall quality.

5 comments19 min readLW link

De­pres­sion and Creativity

Bill BenzonNov 29, 2024, 12:27 AM
−4 points

5 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

Oct 23, 2023, 4:33 PM
33 points

17 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric DrexlerFeb 3, 2022, 3:20 PM
94 points

31 votes

Overall karma indicates overall quality.

12 comments11 min readLW link1 review

Every Ma­jor LLM En­dorses New­comb One-Boxing

jackmastermindJun 15, 2025, 8:44 PM
19 points

8 votes

Overall karma indicates overall quality.

13 comments1 min readLW link
(jacktlab.substack.com)

What must be the case that ChatGPT would have mem­o­rized “To be or not to be”? – Three kinds of con­cep­tual ob­jects for LLMs

Bill BenzonSep 3, 2023, 6:39 PM
19 points

10 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

Steer­ing LLM Agents: Tem­per­a­ments or Per­son­al­ities?

sdetureAug 5, 2025, 12:40 AM
1 point

2 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

The In­for­ma­tion: OpenAI shows ‘Straw­berry’ to feds, races to launch it

Martín SotoAug 27, 2024, 11:10 PM
145 points

67 votes

Overall karma indicates overall quality.

15 comments3 min readLW link

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Miles TurpinJun 3, 2023, 12:22 AM
42 points

15 votes

Overall karma indicates overall quality.

8 comments7 min readLW link

GPT-3: a dis­ap­point­ing paper

nostalgebraistMay 29, 2020, 7:06 PM
65 points

63 votes

Overall karma indicates overall quality.

43 comments8 min readLW link1 review

Re­dun­dant At­ten­tion Heads in Large Lan­guage Models For In Con­text Learning

skunnavakkamSep 1, 2024, 8:08 PM
7 points

7 votes

Overall karma indicates overall quality.

2 comments4 min readLW link
(skunnavakkam.github.io)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

Sep 26, 2024, 1:44 PM
43 points

19 votes

Overall karma indicates overall quality.

4 comments1 min readLW link
(arxiv.org)

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM
8 points

5 votes

Overall karma indicates overall quality.

2 comments10 min readLW link
(www.lesswrong.com)

PCAST Work­ing Group on Gen­er­a­tive AI In­vites Public Input

Christopher KingMay 13, 2023, 10:49 PM
7 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(terrytao.wordpress.com)

Re­search Adenda: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyPNov 13, 2023, 2:33 PM
28 points

13 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

RowanWangMar 29, 2022, 8:09 PM
75 points

40 votes

Overall karma indicates overall quality.

4 comments6 min readLW link

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

Nov 1, 2023, 12:46 PM
18 points

11 votes

Overall karma indicates overall quality.

1 comment7 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_cApr 7, 2022, 1:46 PM
11 points

8 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM
6 points

12 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

Stop post­ing prompt in­jec­tions on Twit­ter and call­ing it “mis­al­ign­ment”

lcFeb 19, 2023, 2:21 AM
147 points

75 votes

Overall karma indicates overall quality.

9 comments1 min readLW link

Con­sen­sus Val­i­da­tion for LLM Out­puts: Ap­ply­ing Blockchain-In­spired Models to AI Reliability

MurrayAitkenJun 5, 2025, 12:13 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

Aug 29, 2023, 1:04 AM
77 points

28 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Retrieval Aug­mented Ge­n­e­sis II — Holy Texts Se­man­tics Analysis

João Ribeiro MedeirosOct 26, 2024, 5:00 PM
−1 points

2 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

Phal­lo­cen­tric­ity in GPT-J’s bizarre strat­ified ontology

mwatkinsFeb 17, 2024, 12:16 AM
56 points

34 votes

Overall karma indicates overall quality.

37 comments9 min readLW link

Ele­ments of Com­pu­ta­tional Philos­o­phy, Vol. I: Truth

Jul 1, 2023, 11:44 AM
12 points

8 votes

Overall karma indicates overall quality.

6 comments1 min readLW link
(compphil.github.io)

Me­tacog­ni­tion and Self-Model­ing in LLMs

Christopher AckermanJul 10, 2025, 9:25 PM
19 points

6 votes

Overall karma indicates overall quality.

2 comments16 min readLW link

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu PiMay 16, 2023, 3:27 PM
4 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

ChatGPT’s On­tolog­i­cal Land­scape

Bill BenzonNov 1, 2023, 3:12 PM
7 points

3 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points

67 votes

Overall karma indicates overall quality.

20 comments12 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Lo­cat­ing and Edit­ing Knowl­edge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

New LLM Scal­ing Law

wrmedfordFeb 19, 2025, 8:21 PM
2 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(github.com)

The De­creas­ing Value of Chain of Thought in Prompting

Matrice JacobineJun 8, 2025, 3:11 PM
11 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(papers.ssrn.com)

[Question] Re­in­force­ment Learn­ing: Essen­tial Step Towards AGI or Ir­rele­vant?

DoubleOct 17, 2024, 3:37 AM
1 point

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

[Question] How do I de­sign long prompts for think­ing zero shot sys­tems with dis­tinct equally dis­tributed prompt sec­tions (mis­sion, goals, mem­o­ries, how-to-re­spond,… etc) and how to main­tain llm co­her­ence?

ollie_May 11, 2025, 7:32 PM
2 points

7 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

May 6, 2024, 7:07 AM
95 points

42 votes

Overall karma indicates overall quality.

13 comments1 min readLW link
(arxiv.org)

Just be­cause an LLM said it doesn’t mean it’s true: an illus­tra­tive example

dirkAug 21, 2024, 9:05 PM
26 points

18 votes

Overall karma indicates overall quality.

12 comments3 min readLW link

Up­dat­ing and Edit­ing Fac­tual Knowl­edge in Lan­guage Models

Dhananjay AshokJan 23, 2025, 7:34 PM
2 points

2 votes

Overall karma indicates overall quality.

2 comments10 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

Mar 15, 2024, 11:16 PM
90 points

29 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(metr.github.io)

If It Talks Like It Thinks, Does It Think? De­sign­ing Tests for In­tent Without As­sum­ing It

yukin_coJul 28, 2025, 12:33 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

ChatGPT: Ex­plor­ing the Digi­tal Wilder­ness, Find­ings and Prospects

Bill BenzonFeb 2, 2025, 9:54 AM
2 points

2 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

Sep 27, 2024, 5:49 PM
61 points

22 votes

Overall karma indicates overall quality.

10 comments4 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneillMar 24, 2024, 8:05 PM
30 points

16 votes

Overall karma indicates overall quality.

4 comments24 min readLW link

Distil­la­tion Ro­bus­tifies Unlearning

Jun 13, 2025, 1:45 PM
234 points

108 votes

Overall karma indicates overall quality.

43 comments8 min readLW link
(arxiv.org)

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

emile delcourtAug 30, 2024, 8:20 PM
18 points

4 votes

Overall karma indicates overall quality.

0 comments11 min readLW link

Hu­mans vs LLM, memes as theorems

Yaroslav GranowskiMay 9, 2025, 1:26 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Aes­thetic Prefer­ences Can Cause Emer­gent Misalignment

Anders WoodruffAug 26, 2025, 6:41 PM
92 points

47 votes

Overall karma indicates overall quality.

16 comments3 min readLW link

Reflec­tion Mechanisms as an Align­ment Tar­get—At­ti­tudes on “near-term” AI

Mar 2, 2023, 4:29 AM
21 points

11 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Lan­guage Field Re­con­struc­tion The­ory: A User-Origi­nated Ob­ser­va­tion of Tier Lock and Se­man­tic Per­son­al­ity in GPT-4o

許皓翔Jun 15, 2025, 4:28 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Wor­ries about la­tent rea­son­ing in LLMs

Caleb BiddulphJan 20, 2025, 9:09 AM
45 points

20 votes

Overall karma indicates overall quality.

6 comments7 min readLW link

The Velvet Cage Hy­poth­e­sis: On the Epistemic Risks of Helpful AI

François-Xavier MorgandJun 7, 2025, 6:01 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments5 min readLW link

A brain­teaser for lan­guage models

Adam ScherlisDec 12, 2022, 2:43 AM
47 points

30 votes

Overall karma indicates overall quality.

3 comments2 min readLW link

MAKE IT BETTER (a po­etic demon­stra­tion of the ba­nal­ity of GPT-3)

rogersbaconJan 2, 2023, 8:47 PM
7 points

10 votes

Overall karma indicates overall quality.

2 comments5 min readLW link

An in­ter­est­ing math­e­mat­i­cal model of how LLMs work

Bill BenzonApr 30, 2024, 11:01 AM
5 points

7 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

LLMs Suck at Deep Think­ing Part 3 - Try­ing to Prove It (fixed)

Taylor G. LuntSep 27, 2025, 2:54 PM
17 points

8 votes

Overall karma indicates overall quality.

6 comments15 min readLW link

Prop­er­ties of cur­rent AIs and some pre­dic­tions of the evolu­tion of AI from the per­spec­tive of scale-free the­o­ries of agency and reg­u­la­tive development

Roman LeventovDec 20, 2022, 5:13 PM
33 points

24 votes

Overall karma indicates overall quality.

3 comments36 min readLW link

From Un­ruly Stacks to Or­ga­nized Shelves: Toy Model Val­i­da­tion of Struc­tured Pri­ors in Sparse Autoencoders

YuxiaoJul 6, 2025, 7:03 AM
9 points

8 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

[Question] Is “hid­den com­plex­ity of wishes prob­lem” solved?

Roman MalovJan 5, 2025, 10:59 PM
10 points

8 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

Ev­i­dence on lan­guage model consciousness

dsjNov 1, 2025, 4:01 AM
17 points

8 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(thedavidsj.substack.com)

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM
18 points

9 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

GPT Doesn’t Just Pre­dict Words — It Models You

Tom FandangoAug 1, 2025, 1:05 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

In­terLab – a toolkit for ex­per­i­ments with multi-agent interactions

Jan 22, 2024, 6:23 PM
69 points

25 votes

Overall karma indicates overall quality.

0 comments8 min readLW link
(acsresearch.org)

LW is prob­a­bly not the place for “I asked this LLM (x) and here’s what it said!”, but where is?

lillybaeumApr 12, 2023, 10:12 AM
21 points

13 votes

Overall karma indicates overall quality.

3 comments1 min readLW link

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_EvansSep 16, 2021, 10:09 AM
58 points

26 votes

Overall karma indicates overall quality.

24 comments6 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill BenzonDec 2, 2023, 2:36 PM
6 points

2 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Notes on ChatGPT’s “mem­ory” for strings and for events

Bill BenzonSep 20, 2023, 6:12 PM
3 points

3 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

ChatGPT in­ti­mates a tan­ta­l­iz­ing fu­ture; its core LLM is or­ga­nized on mul­ti­ple lev­els; and it has bro­ken the idea of think­ing.

Bill BenzonJan 24, 2023, 7:05 PM
5 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya CotraMar 5, 2021, 10:29 PM
186 points

78 votes

Overall karma indicates overall quality.

75 comments38 min readLW link1 review

What will the scaled up GATO look like? (Up­dated with ques­tions)

Amal Oct 25, 2022, 12:44 PM
34 points

21 votes

Overall karma indicates overall quality.

22 comments1 min readLW link

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

simeon_cMar 13, 2022, 10:58 AM
72 points

32 votes

Overall karma indicates overall quality.

10 comments7 min readLW link

One-shot steer­ing vec­tors cause emer­gent mis­al­ign­ment, too

Jacob DunefskyApr 14, 2025, 6:40 AM
98 points

43 votes

Overall karma indicates overall quality.

6 comments11 min readLW link

LLMs Look In­creas­ingly Like Gen­eral Reasoners

eggsyntaxNov 8, 2024, 11:47 PM
94 points

44 votes

Overall karma indicates overall quality.

45 comments3 min readLW link