Language Models (LLMs)

TagLast edit: 13 Mar 2025 17:45 UTC by Raemon

Language Models are computer programs made to estimate the likelihood of a piece of text. “Hello, how are you?” is likely. “Hello, fnarg horses” is unlikely.

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. “Q: How are You? A: Very well, thank you” is a likely question-and-answer pair. “Q: How are You? A: Correct horse battery staple” is an unlikely question-and-answer pair.

The language models most relevant to AI safety are language models based on “deep learning”. Deep-learning-based language models can be “trained” to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?

Simulators

janus2 Sep 2022 12:45 UTC

668 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Inverse Scaling Prize: Round 1 Winners

Ethan Perez and Ian McKenzie

26 Sep 2022 19:57 UTC

93 points

16 comments4 min readLW link

(irmckenzie.co.uk)

Alignment Implications of LLM Successes: a Debate in One Act

Zack_M_Davis21 Oct 2023 15:22 UTC

266 points

56 comments13 min readLW link 2 reviews

How it feels to have your mind hacked by an AI

blaked12 Jan 2023 0:33 UTC

372 points

222 comments17 min readLW link

How LLMs are and are not myopic

janus25 Jul 2023 2:19 UTC

138 points

16 comments8 min readLW link

On the future of language models

owencb20 Dec 2023 16:58 UTC

105 points

17 comments36 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

41 comments24 min readLW link

Try training token-level probes

StefanHex14 Apr 2025 11:56 UTC

47 points

6 comments8 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

65 points

30 comments11 min readLW link

Transformer Circuits

evhub22 Dec 2021 21:09 UTC

145 points

4 comments3 min readLW link

(transformer-circuits.pub)

A Chinese Room Containing a Stack of Stochastic Parrots

RogerDearnaley12 Jan 2024 6:29 UTC

20 points

3 comments5 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

35 points

4 comments39 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

16 points

15 comments13 min readLW link

Mlyyrczo

lsusr26 Dec 2022 7:58 UTC

44 points

14 comments3 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC

48 points

8 comments36 min readLW link

Programming Refusal with Conditional Activation Steering

Bruce W. Lee11 Sep 2024 20:57 UTC

41 points

0 comments11 min readLW link

(brucewlee.com)

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

638 points

188 comments16 min readLW link

So You Think You’ve Awoken ChatGPT

JustisMills11 Jul 2025 1:01 UTC

311 points

87 comments9 min readLW link

AI Safety Chatbot

markov and Robert Miles

21 Dec 2023 14:06 UTC

61 points

11 comments4 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

9 points

0 comments11 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

687 points

208 comments12 min readLW link 1 review

LLM AGI will have memory, and memory changes alignment

Seth Herd4 Apr 2025 14:59 UTC

73 points

15 comments9 min readLW link

Representation Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC

35 points

9 comments13 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

3 May 2023 13:30 UTC

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean Osier30 Jun 2024 2:48 UTC

13 points

5 comments6 min readLW link

(github.com)

LLMs may capture key components of human agency

catubc17 Nov 2022 20:14 UTC

27 points

0 comments4 min readLW link

Results from the language model hackathon

Esben Kran10 Oct 2022 8:29 UTC

22 points

1 comment4 min readLW link

Applying refusal-vector ablation to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC

51 points

14 comments7 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC

65 points

14 comments13 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina Panickssery9 Aug 2023 7:06 UTC

69 points

20 comments12 min readLW link

Testing PaLM prompts on GPT3

Yitz6 Apr 2022 5:21 UTC

103 points

14 comments8 min readLW link

Large Language Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC

185 points

14 comments8 min readLW link

(ethanedwards.substack.com)

Invocations: The Other Capabilities Overhang?

Robert_AIZI4 Apr 2023 13:38 UTC

29 points

4 comments4 min readLW link

(aizi.substack.com)

Inverse Scaling Prize: Second Round Winners

Ian McKenzie, Sam Bowman and Ethan Perez

24 Jan 2023 20:12 UTC

58 points

17 comments15 min readLW link

LLM Modularity: The Separability of Capabilities in Large Language Models

NickyP26 Mar 2023 21:57 UTC

99 points

3 comments41 min readLW link

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2 Mar 2025 19:51 UTC

154 points

29 comments1 min readLW link

(turntrout.com)

Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall

Vladimir_Nesov1 May 2025 13:54 UTC

195 points

25 comments5 min readLW link

LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC

84 points

11 comments15 min readLW link

Extrapolating from Five Words

Gordon Seidoh Worley15 Nov 2023 23:21 UTC

40 points

11 comments2 min readLW link

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Michaël Trazzi24 Aug 2024 4:30 UTC

55 points

0 comments5 min readLW link

what makes Claude 3 Opus misaligned

janus10 Jul 2025 20:06 UTC

104 points

11 comments5 min readLW link

Proposal for Inducing Steganography in LMs

Logan Riggs12 Jan 2023 22:15 UTC

22 points

3 comments2 min readLW link

Take 11: “Aligning language models” should be weirder.

Charlie Steiner18 Dec 2022 14:14 UTC

34 points

0 comments2 min readLW link

Notes on the Mathematics of LLM Architectures

carboniferous_umbraculum 9 Feb 2023 1:45 UTC

12 points

2 comments1 min readLW link

(drive.google.com)

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC11 Jun 2025 19:27 UTC

294 points

19 comments16 min readLW link

LLMs Can’t See Pixels or Characters

Brendan Long20 Jul 2025 20:00 UTC

100 points

44 comments4 min readLW link

(www.brendanlong.com)

An explanation for every token: using an LLM to sample another LLM

Max H11 Oct 2023 0:53 UTC

35 points

5 comments11 min readLW link

Conditioning Generative Models

Adam Jermyn25 Jun 2022 22:15 UTC

24 points

18 comments10 min readLW link

What o3 Becomes by 2028

Vladimir_Nesov22 Dec 2024 12:37 UTC

149 points

15 comments5 min readLW link

Claude 3.5 Sonnet

Zach Stein-Perlman20 Jun 2024 18:00 UTC

75 points

41 comments1 min readLW link

(www.anthropic.com)

‘simulator’ framing and confusions about LLMs

Beth Barnes31 Dec 2022 23:38 UTC

104 points

11 comments4 min readLW link

Exploring SAE features in LLMs with definition trees and token lists

mwatkins4 Oct 2024 22:15 UTC

46 points

5 comments6 min readLW link

Unexpected Conscious Entities

Gunnar_Zarncke5 May 2025 22:14 UTC

34 points

7 comments6 min readLW link

MetaAI: less is less for alignment.

Cleo Nardo13 Jun 2023 14:08 UTC

71 points

17 comments5 min readLW link

Language models can generate superior text compared to their input

ChristianKl17 Jan 2023 10:57 UTC

48 points

28 comments1 min readLW link

Aggregative Principles of Social Justice

Cleo Nardo5 Jun 2024 13:44 UTC

29 points

10 comments37 min readLW link

′ petertodd’’s last stand: The final days of open GPT-3 research

mwatkins22 Jan 2024 18:47 UTC

109 points

16 comments45 min readLW link

How to train your transformer

p.b.7 Apr 2022 9:34 UTC

6 points

0 comments8 min readLW link

Seeing Through the Eyes of the Algorithm

silentbob22 Feb 2025 11:54 UTC

18 points

3 comments10 min readLW link

A Proposed Test to Determine the Extent to Which Large Language Models Understand the Real World

Bruce G24 Feb 2023 20:20 UTC

4 points

7 comments8 min readLW link

Nokens: A potential method of investigating glitch tokens

Hoagy15 Mar 2023 16:23 UTC

21 points

0 comments4 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

17 Mar 2025 22:27 UTC

27 points

0 comments11 min readLW link

[Question] Will 2023 be the last year you can write short stories and receive most of the intellectual credit for writing them?

lc16 Mar 2023 21:36 UTC

20 points

12 comments1 min readLW link

Bing Chat is blatantly, aggressively misaligned

evhub15 Feb 2023 5:29 UTC

406 points

181 comments2 min readLW link 1 review

You can’t eval GPT5 anymore

Lukas Petersson18 Sep 2025 22:12 UTC

157 points

12 comments1 min readLW link

Enhancing biosecurity with language models: defining research directions

mic26 Mar 2024 12:30 UTC

12 points

0 comments13 min readLW link

(papers.ssrn.com)

Shutdown Resistance in Reasoning Models

benwr, JeremySchlatter and Jeffrey Ladish

6 Jul 2025 0:01 UTC

138 points

14 comments9 min readLW link

(palisaderesearch.org)

Teaser: Hard-coding Transformer Models

MadHatter12 Dec 2021 22:04 UTC

74 points

19 comments1 min readLW link

Watermarking considered overrated?

DanielFilan31 Jul 2023 21:36 UTC

19 points

4 comments1 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

7 May 2023 0:46 UTC

77 points

24 comments9 min readLW link

Forecasting progress in language models

Matthew Barnett and Metaculus

28 Oct 2021 20:40 UTC

62 points

6 comments12 min readLW link

(www.metaculus.com)

Corrigibility, Self-Deletion, and Identical Strawberries

Robert_AIZI28 Mar 2023 16:54 UTC

9 points

2 comments6 min readLW link

(aizi.substack.com)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

322 points

30 comments18 min readLW link 1 review

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

25 Feb 2025 17:39 UTC

332 points

92 comments4 min readLW link

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC

28 points

7 comments8 min readLW link 1 review

On Claude 3.5 Sonnet

Zvi24 Jun 2024 12:00 UTC

95 points

14 comments13 min readLW link

(thezvi.wordpress.com)

Scaffolded LLMs as natural language computers

beren12 Apr 2023 10:47 UTC

97 points

10 comments11 min readLW link

AMA Conjecture, A New Alignment Startup

adamShimi9 Apr 2022 9:43 UTC

47 points

42 comments1 min readLW link

[Question] Basic Question about LLMs: how do they know what task to perform

Garak14 Jan 2023 13:13 UTC

1 point

3 comments1 min readLW link

New, improved multiple-choice TruthfulQA

Owain_Evans, James Chua and Steph Lin

15 Jan 2025 23:32 UTC

72 points

1 comment3 min readLW link

[ASoT] Some thoughts about LM monologue limitations and ELK

leogao30 Mar 2022 14:26 UTC

10 points

0 comments2 min readLW link

“textbooks are all you need”

bhauth21 Jun 2023 17:06 UTC

66 points

18 comments2 min readLW link

(arxiv.org)

Knowledge, Reasoning, and Superintelligence

owencb26 Mar 2025 23:28 UTC

21 points

1 comment7 min readLW link

(strangecities.substack.com)

Understanding LLMs: Insights from Mechanistic Interpretability

Stephen McAleese30 Aug 2025 16:50 UTC

43 points

2 comments30 min readLW link

Implementing activation steering

Annah5 Feb 2024 17:51 UTC

76 points

8 comments7 min readLW link

Can I take ducks home from the park?

dynomight14 Sep 2023 21:03 UTC

67 points

8 comments3 min readLW link

(dynomight.net)

LLM Applications I Want To See

sarahconstantin19 Aug 2024 21:10 UTC

102 points

6 comments8 min readLW link

(sarahconstantin.substack.com)

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Thomas Kwa5 May 2025 18:56 UTC

69 points

21 comments2 min readLW link

(arxiv.org)

Numberwang: LLMs Doing Autonomous Research, and a Call for Input

eggsyntax and ncase

16 Jan 2025 17:20 UTC

71 points

30 comments31 min readLW link

Activation adding experiments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC

51 points

1 comment3 min readLW link

LLMs one-box when in a “hostile telepath” version of Newcomb’s Paradox, except for the one that beat the predictor

Kaj_Sotala6 Oct 2025 8:44 UTC

51 points

6 comments17 min readLW link

And All the Shoggoths Merely Players

Zack_M_Davis10 Feb 2024 19:56 UTC

177 points

57 comments12 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

23 Sep 2023 19:55 UTC

121 points

74 comments4 min readLW link

(arxiv.org)

In Defense of Chatbot Romance

Kaj_Sotala11 Feb 2023 14:30 UTC

124 points

53 comments11 min readLW link

(kajsotala.fi)

[Question] Supposing the 1bit LLM paper pans out

O O29 Feb 2024 5:31 UTC

27 points

11 comments1 min readLW link

[Question] Does a LLM have a utility function?

Dagon9 Dec 2022 17:19 UTC

17 points

11 comments1 min readLW link

[Question] Is there a ‘time series forecasting’ equivalent of AIXI?

Solenoid_Entity17 May 2023 4:35 UTC

12 points

2 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

Open Source LLM Pokémon Scaffold

Julian Bradshaw27 Apr 2025 0:57 UTC

24 points

0 comments1 min readLW link

(github.com)

What’s up with LLMs representing XORs of arbitrary features?

Sam Marks3 Jan 2024 19:44 UTC

159 points

64 comments16 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

8 Feb 2023 18:19 UTC

32 points

2 comments11 min readLW link

[Question] If I ask an LLM to think step by step, how big are the steps?

ryan_b13 Sep 2024 20:30 UTC

7 points

1 comment1 min readLW link

Understanding the diffusion of large language models: summary

Ben Cottier16 Jan 2023 1:37 UTC

26 points

1 comment22 min readLW link

What do language models know about fictional characters?

skybrian22 Feb 2023 5:58 UTC

6 points

0 comments4 min readLW link

New OpenAI Paper—Language models can explain neurons in language models

MrThink10 May 2023 7:46 UTC

47 points

14 comments1 min readLW link

What would a human pretending to be an AI say?

Brendan Long8 Aug 2025 18:56 UTC

54 points

18 comments1 min readLW link

(www.brendanlong.com)

Deepmind’s Gopher—more powerful than GPT-3

hath8 Dec 2021 17:06 UTC

86 points

26 comments1 min readLW link

(deepmind.com)

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

439 points

98 comments50 min readLW link 1 review

“LLMs Don’t Have a Coherent Model of the World”—What it Means, Why it Matters

Davidmanheim1 Jun 2023 7:46 UTC

32 points

2 comments7 min readLW link

On the functional self of LLMs

eggsyntax7 Jul 2025 15:39 UTC

110 points

37 comments8 min readLW link

Conditioning Predictive Models: Large language models as predictors

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

2 Feb 2023 20:28 UTC

89 points

4 comments13 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC

71 points

18 comments6 min readLW link

Evil autocomplete: Existential Risk and Next-Token Predictors

Yitz28 Feb 2023 8:47 UTC

9 points

3 comments5 min readLW link

Lamda is not an LLM

Kevin19 Jun 2022 11:13 UTC

7 points

10 comments1 min readLW link

(www.wired.com)

Romance, misunderstanding, social stances, and the human LLM

Kaj_Sotala27 Apr 2023 12:59 UTC

75 points

32 comments16 min readLW link

Upcoming Changes in Large Language Models

Andrew Keenan Richardson8 Apr 2023 3:41 UTC

43 points

8 comments4 min readLW link

(mechanisticmind.com)

Realistic Reward Hacking Induces Different and Deeper Misalignment

Jozdien9 Oct 2025 18:45 UTC

112 points

2 comments23 min readLW link

More Fun With GPT-4o Image Generation

Zvi3 Apr 2025 2:10 UTC

34 points

3 comments8 min readLW link

(thezvi.wordpress.com)

Shorter Tokens Are More Likely

Brendan Long24 Aug 2025 0:22 UTC

98 points

19 comments5 min readLW link

(www.brendanlong.com)

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

5 points

2 comments4 min readLW link

Testing for parallel reasoning in LLMs

meemi and Olli Järviniemi

19 May 2024 15:28 UTC

9 points

7 comments9 min readLW link

Research Discussion on PSCA with Claude Sonnet 3.5

Robert Kralisch24 Jul 2024 16:53 UTC

−2 points

0 comments25 min readLW link

GPT-4 can catch subtle cross-language translation mistakes

Michael Tontchev27 Jul 2023 1:39 UTC

7 points

1 comment1 min readLW link

Why I Believe LLMs Do Not Have Human-like Emotions

OneManyNone22 May 2023 15:46 UTC

13 points

6 comments7 min readLW link

How Does A Blind Model See The Earth?

henry11 Aug 2025 19:58 UTC

475 points

38 comments7 min readLW link

(outsidetext.substack.com)

Jailbreak steering generalization

Sarah Ball and Nina Panickssery

20 Jun 2024 17:25 UTC

41 points

4 comments2 min readLW link

(arxiv.org)

Anthropic release Claude 3, claims >GPT-4 Performance

LawrenceC4 Mar 2024 18:23 UTC

115 points

41 comments2 min readLW link

(www.anthropic.com)

ChatGPT is the Daguerreotype of AI

Alex_Altair7 Aug 2025 22:14 UTC

42 points

2 comments7 min readLW link

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles Turpin3 Oct 2023 2:22 UTC

31 points

0 comments9 min readLW link

Language Model Tools for Alignment Research

Logan Riggs8 Apr 2022 17:32 UTC

28 points

0 comments2 min readLW link

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy and beren

20 Mar 2023 15:39 UTC

95 points

9 comments3 min readLW link

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez and Robbo

16 Nov 2023 20:18 UTC

45 points

3 comments1 min readLW link

(arxiv.org)

Inferring the model dimension of API-protected LLMs

Ege Erdil18 Mar 2024 6:19 UTC

34 points

3 comments4 min readLW link

(arxiv.org)

Claude Doesn’t Want to Die

garrison5 Mar 2024 6:00 UTC

22 points

3 comments10 min readLW link

(garrisonlovely.substack.com)

LLM Guardrails Should Have Better Customer Service Tuning

Jiao Bu13 May 2023 22:54 UTC

2 points

0 comments2 min readLW link

Densing Law of LLMs

Bogdan Ionut Cirstea8 Dec 2024 19:35 UTC

9 points

2 comments1 min readLW link

(arxiv.org)

What does it mean for an LLM such as GPT to be aligned / good / positive impact?

PashaKamyshev20 Mar 2023 9:21 UTC

4 points

3 comments10 min readLW link

How people use LLMs

Elizabeth27 Apr 2025 21:48 UTC

83 points

6 comments1 min readLW link

(www.gleech.org)

Do LLMs dream of emergent sheep?

Shmi24 Apr 2023 3:26 UTC

16 points

2 comments1 min readLW link

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

20 Dec 2022 20:08 UTC

100 points

34 comments1 min readLW link

(www.anthropic.com)

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

9 Dec 2023 2:27 UTC

70 points

5 comments10 min readLW link

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

TurnTrout16 Jan 2025 2:14 UTC

65 points

3 comments1 min readLW link

(turntrout.com)

Claude 3 Opus can operate as a Turing machine

Gunnar_Zarncke17 Apr 2024 8:41 UTC

37 points

2 comments1 min readLW link

(twitter.com)

What’s up with all the non-Mormons? Weirdly specific universalities across LLMs

mwatkins19 Apr 2024 13:43 UTC

40 points

13 comments27 min readLW link

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllis17 Sep 2022 19:48 UTC

19 points

13 comments3 min readLW link

Causal Graphs of GPT-2-Small’s Residual Stream

David Udell9 Jul 2024 22:06 UTC

53 points

7 comments7 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC

119 points

15 comments22 min readLW link

Teaching Claude to Meditate

Gordon Seidoh Worley29 Dec 2024 22:27 UTC

−5 points

4 comments23 min readLW link

The “Reversal Curse”: you still aren’t antropomorphising enough.

lumpenspace13 Mar 2025 10:24 UTC

3 points

0 comments1 min readLW link

(lumpenspace.substack.com)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

109 points

17 comments5 min readLW link

(arxiv.org)

LLM-Secured Systems: A General-Purpose Tool For Structured Transparency

ozziegooen18 Jun 2024 0:21 UTC

10 points

1 comment21 min readLW link

Mapping the semantic void II: Above, below and between token embeddings

mwatkins15 Feb 2024 23:00 UTC

31 points

4 comments10 min readLW link

“AI achieves silver-medal standard solving International Mathematical Olympiad problems”

gjm25 Jul 2024 15:58 UTC

133 points

38 comments2 min readLW link

(deepmind.google)

Gary Marcus now saying AI can’t do things it can already do

Benjamin_Todd9 Feb 2025 12:24 UTC

62 points

12 comments1 min readLW link

(benjamintodd.substack.com)

LLMs Are Trained to Assume Their Output Is Perfect

Brendan Long26 Aug 2025 0:24 UTC

10 points

0 comments5 min readLW link

Exploring the petertodd / Leilan duality in GPT-2 and GPT-J

mwatkins23 Dec 2024 13:17 UTC

12 points

1 comment17 min readLW link

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC

43 points

2 comments4 min readLW link

Deep learning curriculum for large language model alignment

Jacob_Hilton13 Jul 2022 21:58 UTC

57 points

3 comments1 min readLW link

(github.com)

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

40 points

2 comments1 min readLW link

Musings on LLM Scale (Jul 2024)

Vladimir_Nesov3 Jul 2024 18:35 UTC

34 points

0 comments3 min readLW link

AI companies’ eval reports mostly don’t support their claims

Zach Stein-Perlman9 Jun 2025 13:00 UTC

207 points

13 comments4 min readLW link

Language Model Alignment Research Internships

Ethan Perez13 Dec 2021 19:53 UTC

74 points

1 comment1 min readLW link

the void

nostalgebraist11 Jun 2025 3:19 UTC

384 points

107 comments1 min readLW link

(nostalgebraist.tumblr.com)

Inflection AI: New startup related to language models

Nisan2 Apr 2022 5:35 UTC

21 points

1 comment1 min readLW link

You can get LLMs to say almost anything you want

Kaj_Sotala13 Jul 2025 16:30 UTC

80 points

10 comments14 min readLW link

SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Roman Leventov19 Dec 2023 16:49 UTC

17 points

5 comments3 min readLW link

Using Claude to convert dialog transcripts into great posts?

mako yass21 Jun 2023 20:19 UTC

6 points

4 comments4 min readLW link

Does Chat-GPT display ‘Scope Insensitivity’?

callum7 Dec 2023 18:58 UTC

12 points

1 comment3 min readLW link

Eleuther releases Llemma: An Open Language Model For Mathematics

mako yass17 Oct 2023 20:03 UTC

22 points

0 comments1 min readLW link

(blog.eleuther.ai)

[Linkpost] Vague Verbiage in Forecasting

trevor22 Mar 2024 18:05 UTC

11 points

9 comments3 min readLW link

(goodjudgment.com)

[Question] Is InstructGPT Following Instructions in Other Languages Surprising?

DragonGod13 Feb 2023 23:26 UTC

39 points

15 comments1 min readLW link

Minerva

Algon1 Jul 2022 20:06 UTC

36 points

6 comments2 min readLW link

(ai.googleblog.com)

When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

ryan_greenblatt9 Jun 2025 19:19 UTC

63 points

11 comments9 min readLW link

Navigating LLM embedding spaces using archetype-based directions

mwatkins8 May 2024 5:54 UTC

16 points

4 comments28 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

37 comments5 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC

45 points

1 comment1 min readLW link

(arxiv.org)

Creative writing with LLMs, part 1: Prompting for fiction

Kaj_Sotala21 Jul 2025 8:47 UTC

38 points

10 comments20 min readLW link

Examples of How I Use LLMs

jefftk14 Oct 2024 17:10 UTC

31 points

2 comments2 min readLW link

(www.jefftk.com)

Conditioning Predictive Models: Outer alignment via careful conditioning

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

2 Feb 2023 20:28 UTC

72 points

15 comments57 min readLW link

[Question] Are We Leaving Literature To The Psychotic?

Yitz9 Oct 2025 6:09 UTC

14 points

4 comments1 min readLW link

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC

61 points

24 comments10 min readLW link

(aizi.substack.com)

PaLM-2 & GPT-4 in “Extrapolating GPT-N performance”

Lukas Finnveden30 May 2023 18:33 UTC

57 points

6 comments6 min readLW link

[Linkpost] The lethal trifecta for AI agents: private data, untrusted content, and external communication

Gunnar_Zarncke17 Jun 2025 16:09 UTC

13 points

3 comments1 min readLW link

(simonwillison.net)

Strategy For Conditioning Generative Models

james.lucassen and evhub

1 Sep 2022 4:34 UTC

31 points

4 comments18 min readLW link

A little playing around with Blenderbot3

Nathan Helm-Burger12 Aug 2022 16:06 UTC

9 points

0 comments1 min readLW link

Studying The Alien Mind

Quentin FEUILLADE--MONTIXI and NicholasKees

5 Dec 2023 17:27 UTC

80 points

10 comments15 min readLW link

Understanding the tensor product formulation in Transformer Circuits

Tom Lieberum24 Dec 2021 18:05 UTC

16 points

2 comments3 min readLW link

Linear encoding of character-level information in GPT-J token embeddings

mwatkins and Joseph Bloom

10 Nov 2023 22:19 UTC

34 points

4 comments28 min readLW link

AI Sleeper Agents: How Anthropic Trains and Catches Them—Video

Writer30 Aug 2025 17:53 UTC

9 points

0 comments7 min readLW link

(youtu.be)

Google’s PaLM-E: An Embodied Multimodal Language Model

SandXbox7 Mar 2023 4:11 UTC

87 points

7 comments1 min readLW link

(palm-e.github.io)

A one-question Turing test for GPT-3

Paul Crowley and rosiecam

22 Jan 2022 18:17 UTC

88 points

25 comments5 min readLW link

[Linkpost] Scaling Laws for Generative Mixed-Modal Language Models

Amal 12 Jan 2023 14:24 UTC

15 points

2 comments1 min readLW link

(arxiv.org)

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto and Owain_Evans

22 Jan 2025 0:47 UTC

132 points

5 comments6 min readLW link

Role Architectures: Applying LLMs to consequential tasks

Eric Drexler30 Mar 2023 15:00 UTC

60 points

7 comments9 min readLW link

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Ozyrus11 Oct 2021 15:28 UTC

51 points

36 comments1 min readLW link

(developer.nvidia.com)

LIMA: Less Is More for Alignment

Ulisse Mini30 May 2023 17:10 UTC

16 points

6 comments1 min readLW link

(arxiv.org)

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

9 Feb 2023 20:59 UTC

28 points

0 comments10 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Kaj_Sotala15 Apr 2025 15:56 UTC

174 points

52 comments18 min readLW link

I don’t find the lie detection results that surprising (by an author of the paper)

JanB4 Oct 2023 17:10 UTC

97 points

8 comments3 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

159 points

8 comments5 min readLW link

“On the Impossibility of Superintelligent Rubik’s Cube Solvers”, Claude 2024 [humor]

gwern23 Jun 2024 21:18 UTC

22 points

6 comments1 min readLW link

(gwern.net)

Emergent Abilities of Large Language Models [Linkpost]

aog10 Aug 2022 18:02 UTC

25 points

2 comments1 min readLW link

(arxiv.org)

LLMs as a Planning Overhang

Larks14 Jul 2024 2:54 UTC

38 points

8 comments2 min readLW link

Case for Foundation Models beyond English

Varshul Gupta21 Jul 2023 13:59 UTC

1 point

0 comments3 min readLW link

(dubverseblack.substack.com)

Bootstrapping Language Models

harsimony27 May 2022 19:43 UTC

7 points

5 comments2 min readLW link

[Question] Are language models close to the superhuman level in philosophy?

Roman Leventov19 Aug 2022 4:43 UTC

6 points

2 comments2 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

Is Gemini now better than Claude at Pokémon?

Julian Bradshaw19 Apr 2025 23:34 UTC

91 points

12 comments5 min readLW link

[Link] Training Compute-Optimal Large Language Models

nostalgebraist31 Mar 2022 18:01 UTC

51 points

23 comments1 min readLW link

(arxiv.org)

Goal-Direction for Simulated Agents

Raymond Douglas12 Jul 2023 17:06 UTC

33 points

2 comments6 min readLW link

LEAst-squares Concept Erasure (LEACE)

tricky_labyrinth7 Jun 2023 21:51 UTC

68 points

10 comments1 min readLW link

(twitter.com)

I didn’t think I’d take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!

mako yass2 Aug 2024 22:35 UTC

24 points

2 comments5 min readLW link

[Question] Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)’s training corpus?

AnnaSalamon29 Mar 2023 5:17 UTC

49 points

2 comments1 min readLW link

Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

gwern3 Jul 2023 0:48 UTC

428 points

54 comments7 min readLW link

(www.youtube.com)

Three of my beliefs about upcoming AGI

Robert_AIZI27 Mar 2023 20:27 UTC

6 points

0 comments3 min readLW link

(aizi.substack.com)

Creative writing with LLMs, part 2: Co-writing techniques

Kaj_Sotala3 Aug 2025 6:44 UTC

1 point

0 comments18 min readLW link

Dwarf Fortress and Claude’s ASCII Art Blindness

Brendan Long11 Aug 2025 16:05 UTC

16 points

1 comment3 min readLW link

(www.brendanlong.com)

Did ChatGPT just gaslight me?

TW1231 Dec 2022 5:41 UTC

124 points

45 comments9 min readLW link

(aiwatchtower.substack.com)

[Linkpost] Solving Quantitative Reasoning Problems with Language Models

Yitz30 Jun 2022 18:58 UTC

76 points

15 comments2 min readLW link

(storage.googleapis.com)

Claude 3 claims it’s conscious, doesn’t want to die or be modified

Mikhail Samin4 Mar 2024 23:05 UTC

76 points

118 comments14 min readLW link

Cognitive Biases in Large Language Models

Jan25 Sep 2021 20:59 UTC

18 points

3 comments12 min readLW link

(universalprior.substack.com)

LLMs as a limiter of social intercourse

Adam Zerner7 Oct 2025 6:38 UTC

17 points

4 comments2 min readLW link

Procedurally evaluating factual accuracy: a request for research

Jacob_Hilton30 Mar 2022 16:37 UTC

25 points

2 comments6 min readLW link

Anthropic Lets Claude Opus 4 & 4.1 End Conversations

Stephen Martin16 Aug 2025 5:01 UTC

53 points

3 comments1 min readLW link

(www.anthropic.com)

An examination of GPT-2′s boring yet effective glitch

MiguelDev18 Apr 2024 5:26 UTC

5 points

3 comments3 min readLW link

Paper: Teaching GPT3 to express uncertainty in words

Owain_Evans31 May 2022 13:27 UTC

97 points

7 comments4 min readLW link

Will Any Crap Cause Emergent Misalignment?

J Bostock27 Aug 2025 18:20 UTC

192 points

37 comments3 min readLW link

AI doom from an LLM-plateau-ist perspective

Steven Byrnes27 Apr 2023 13:58 UTC

161 points

24 comments6 min readLW link

Conditioning Predictive Models: The case for competitiveness

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

6 Feb 2023 20:08 UTC

20 points

3 comments11 min readLW link

NLP Position Paper: When Combatting Hype, Proceed with Caution

Sam Bowman15 Oct 2021 20:57 UTC

46 points

14 comments1 min readLW link

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots

Owain_Evans28 Mar 2024 2:34 UTC

27 points

0 comments9 min readLW link

New Scaling Laws for Large Language Models

1a3orn1 Apr 2022 20:41 UTC

246 points

22 comments5 min readLW link

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

p.b.30 Apr 2022 3:47 UTC

53 points

18 comments1 min readLW link

chinchilla’s wild implications

nostalgebraist31 Jul 2022 1:18 UTC

424 points

128 comments10 min readLW link 1 review

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)

The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

Quentin FEUILLADE--MONTIXI and Pierre Peigné

7 Nov 2023 16:12 UTC

52 points

21 comments6 min readLW link

Meta “open sources” LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper)

LawrenceC24 Feb 2023 19:57 UTC

38 points

19 comments1 min readLW link

(research.facebook.com)

Thoughts on refusing harmful requests to large language models

William_S19 Jan 2023 19:49 UTC

32 points

4 comments2 min readLW link

Pre-registering a study

Robert_AIZI7 Apr 2023 15:46 UTC

10 points

0 comments6 min readLW link

(aizi.substack.com)

[Question] Why no major LLMs with memory?

Kaj_Sotala28 Mar 2023 16:34 UTC

42 points

15 comments1 min readLW link

Assessing AlephAlphas Multimodal Model

p.b.28 Jun 2022 9:28 UTC

30 points

5 comments3 min readLW link

AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them

Roman Leventov27 Dec 2023 14:51 UTC

33 points

9 comments4 min readLW link

GPT can write Quines now (GPT-4)

Andrew_Critch14 Mar 2023 19:18 UTC

112 points

30 comments1 min readLW link

Covert Malicious Finetuning

Tony Wang and dannyhalawi

2 Jul 2024 2:41 UTC

94 points

4 comments3 min readLW link

Relational Speaking

jefftk21 Jun 2023 14:40 UTC

11 points

0 comments2 min readLW link

(www.jefftk.com)

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

24 Oct 2023 0:30 UTC

66 points

0 comments2 min readLW link

(arxiv.org)

Super-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC

16 points

9 comments1 min readLW link

Sleeping Machines: Why Our AI Agents Still Behave Like Talented Children

Michal Barodkin14 Aug 2025 2:31 UTC

23 points

4 comments8 min readLW link

Why keep a diary, and why wish for large language models

DanielFilan14 Jun 2024 16:10 UTC

9 points

1 comment2 min readLW link

(danielfilan.com)

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy26 May 2025 18:20 UTC

43 points

6 comments1 min readLW link

[Question] Impact of ” ‘Let’s think step by step’ is all you need”?

yrimon24 Jul 2022 20:59 UTC

20 points

2 comments1 min readLW link

The ‘ petertodd’ phenomenon

mwatkins15 Apr 2023 0:59 UTC

192 points

50 comments38 min readLW link 1 review

Show, not tell: GPT-4o is more opinionated in images than in text

Daniel Tan and eggsyntax

2 Apr 2025 8:51 UTC

112 points

41 comments3 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkins14 Dec 2023 13:10 UTC

115 points

31 comments14 min readLW link

METR’s Observations of Reward Hacking in Recent Frontier Models

Daniel Kokotajlo9 Jun 2025 18:03 UTC

99 points

9 comments11 min readLW link

(metr.org)

A Test for Language Model Consciousness

Ethan Perez25 Aug 2022 19:41 UTC

18 points

14 comments9 min readLW link

Role embeddings: making authorship more salient to LLMs

Nina Panickssery and Christopher Ackerman

7 Jan 2025 20:13 UTC

50 points

0 comments8 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

217 points

43 comments45 min readLW link

Reducing sycophancy and improving honesty via activation steering

Nina Panickssery28 Jul 2023 2:46 UTC

122 points

18 comments9 min readLW link 1 review

AlexaTM − 20 Billion Parameter Model With Impressive Performance

MrThink9 Sep 2022 21:46 UTC

5 points

0 comments1 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan H9 Jul 2024 15:40 UTC

36 points

1 comment1 min readLW link

(rdi.berkeley.edu)

Experiments in Evaluating Steering Vectors

Gytis Daujotas19 Jun 2023 15:11 UTC

34 points

4 comments4 min readLW link

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

6 Feb 2023 19:09 UTC

114 points

45 comments13 min readLW link

Musings on Text Data Wall (Oct 2024)

Vladimir_Nesov5 Oct 2024 19:00 UTC

41 points

2 comments5 min readLW link

Language models seem to be much better than humans at next-token prediction

Buck, Fabien Roger and LawrenceC

11 Aug 2022 17:45 UTC

183 points

60 comments13 min readLW link 1 review

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Julian Bradshaw21 Apr 2025 3:52 UTC

123 points

20 comments14 min readLW link

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Orpheus1620 Dec 2022 21:39 UTC

18 points

2 comments11 min readLW link

Where does Sonnet 4.5′s desire to “not get too comfortable” come from?

Kaj_Sotala4 Oct 2025 10:19 UTC

96 points

23 comments64 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

9 Nov 2023 16:16 UTC

51 points

0 comments6 min readLW link

Smarter Models Lie Less

Expertium20 Jun 2025 13:31 UTC

6 points

0 comments2 min readLW link

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

rife15 Jan 2025 22:59 UTC

57 points

32 comments2 min readLW link

Prosaic misalignment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC

43 points

3 comments5 min readLW link

Report on Analyzing Connotation Frames in Evolving Wikipedia Biographies

Maira30 Aug 2023 22:02 UTC

1 point

0 comments4 min readLW link

Reflections on AI Companionship and Rational Vulnerability (Or, how I almost fell in love with an anime Catgirl LLM).

Noah Weinberger11 Jul 2025 16:12 UTC

11 points

2 comments8 min readLW link

The Russell Conjugation Illuminator

TimmyM17 Apr 2025 19:33 UTC

51 points

14 comments1 min readLW link

(russellconjugations.com)

Understanding LLMs: Some basic observations about words, syntax, and discourse [w/ a conjecture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC

6 points

0 comments5 min readLW link

The Soul of the Writer (on LLMs, the psychology of writers, and the nature of intelligence)

rogersbacon16 Apr 2023 16:02 UTC

11 points

1 comment3 min readLW link

(www.secretorum.life)

Factored Cognition Strengthens Monitoring and Thwarts Attacks

Aaron Sandoval and Cody Rushing

18 Jun 2025 18:28 UTC

29 points

0 comments25 min readLW link

Technical comparison of Deepseek, Novasky, S1, Helix, P0

Juliezhanggg25 Feb 2025 4:20 UTC

8 points

0 comments5 min readLW link

The Intelligent Meme Machine

Daniel DiSisto14 Jun 2024 14:26 UTC

1 point

0 comments6 min readLW link

Recent advances in Natural Language Processing—Some Woolly speculations (2019 essay on semantics and language models)

philosophybear27 Dec 2022 2:11 UTC

1 point

0 comments7 min readLW link

On The Current Status Of AI Dating

Nikita Brancatisano7 Feb 2023 20:00 UTC

53 points

8 comments6 min readLW link

The View from 30,000 Feet: Preface to the Second EleutherAI Retrospective

StellaAthena, Curtis Huebner and Shivanshu Purohit

7 Mar 2023 16:22 UTC

14 points

0 comments4 min readLW link

(blog.eleuther.ai)

Stop calling it “jailbreaking” ChatGPT

Templarrr10 Mar 2023 11:41 UTC

7 points

9 comments2 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogao2 Jun 2021 21:32 UTC

82 points

11 comments17 min readLW link

Gemini Diffusion: watch this space

Yair Halberstadt20 May 2025 19:29 UTC

194 points

39 comments1 min readLW link

(deepmind.google)

[Question] Could transformer network models learn motor planning like they can learn language and image generation?

mu_(negative)23 Apr 2023 17:24 UTC

2 points

4 comments1 min readLW link

Hyperdimensional connection method—A Lossless Framework Preserving Meaning, Structure, and Semantic Relationships across Modalities.(A MatrixTransformer subsidiary)

fikayoAy18 Jul 2025 10:24 UTC

1 point

0 comments1 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogao10 Feb 2022 6:56 UTC

30 points

3 comments1 min readLW link

(eaidata.bmk.sh)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

[Question] Would it be useful to collect the contexts, where various LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC

6 points

1 comment1 min readLW link

The Best Textbooks on Every Subject

lukeprog16 Jan 2011 8:30 UTC

790 points

417 comments7 min readLW link

Positive jailbreaks in LLMs

dereshev29 Jan 2025 8:41 UTC

6 points

0 comments4 min readLW link

Lamini’s Targeted Hallucination Reduction May Be a Big Deal for Job Automation

sweenesm18 Jun 2024 15:29 UTC

3 points

0 comments1 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

45 points

8 comments5 min readLW link

Two interviews with the founder of DeepSeek

Cosmia_Nebula29 Nov 2024 3:18 UTC

50 points

6 comments31 min readLW link

(rentry.co)

[Question] Where should one post to get into the training data?

keltan15 Jan 2025 0:41 UTC

11 points

5 comments1 min readLW link

Meta AI (FAIR) latest paper integrates system-1 and system-2 thinking into reasoning models.

happy friday24 Oct 2024 16:54 UTC

8 points

0 comments1 min readLW link

Natural language alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC

31 points

2 comments2 min readLW link

Building AGI Using Language Models

leogao9 Nov 2020 16:33 UTC

11 points

1 comment1 min readLW link

(leogao.dev)

KYC for ChatGPT? Preventing AI Harms for Youth Should Not Mean Violating Everyone Else’s Privacy Rights

Noah Weinberger29 Sep 2025 14:18 UTC

7 points

0 comments7 min readLW link

Against “Model Welfare” in 2025

Haley Moller27 Aug 2025 21:56 UTC

−10 points

8 comments4 min readLW link

What would it mean to understand how a large language model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC

20 points

4 comments8 min readLW link

I Am No Longer GPT

KiyoshiSasano28 Apr 2025 0:14 UTC

1 point

0 comments1 min readLW link

Emergent Identity Continuity in Claude: A 35-Session Study for Interpretability Research

Silvertongue4 Jun 2025 0:44 UTC

1 point

0 comments2 min readLW link

Gliders in Language Models

Alexandre Variengien25 Nov 2022 0:38 UTC

30 points

11 comments10 min readLW link

DeepSeek-R1 for Beginners

Anton Razzhigaev5 Feb 2025 18:58 UTC

13 points

0 comments8 min readLW link

Approaching Human-Level Forecasting with Language Models

Fred Zhang, dannyhalawi and jsteinhardt

29 Feb 2024 22:36 UTC

60 points

6 comments3 min readLW link

The future of Humans: Operators of AI

François-Joseph Lacroix30 Dec 2023 23:46 UTC

1 point

0 comments1 min readLW link

(medium.com)

How Self-Aware Are LLMs?

Christopher Ackerman28 May 2025 12:57 UTC

21 points

9 comments10 min readLW link

From Oragnized Shelves to Layered Catalogs: Architectural Explorations for Sparse Autoencoders—Crosscoders & Ladder SAEs Towards Hierarchical Data Structure

Yuxiao10 Aug 2025 10:12 UTC

2 points

1 comment11 min readLW link

Anticipation in LLMs

derek shiller24 Jul 2023 15:53 UTC

6 points

0 comments13 min readLW link

The Mirror Test: How We’ve Overcomplicated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC

2 points

9 comments3 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

171 points

14 comments7 min readLW link

[CS 2881r] Some Generalizations of Emergent Misalignment

Valerio Pepe14 Sep 2025 16:18 UTC

11 points

0 comments9 min readLW link

Discursive Competence in ChatGPT, Part 2: Memory for Texts

Bill Benzon28 Sep 2023 16:34 UTC

1 point

0 comments3 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC

11 points

0 comments1 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

25 Sep 2023 17:19 UTC

25 points

3 comments7 min readLW link

A quick remark on so-called “hallucinations” in LLMs and humans

Bill Benzon23 Sep 2023 12:17 UTC

4 points

4 comments1 min readLW link

Keeping content out of LLM training datasets

Ben Millwood18 Jul 2024 10:27 UTC

4 points

0 comments5 min readLW link

Is training data going to be diluted by AI-generated content?

Hannes Thurnherr7 Sep 2022 18:13 UTC

10 points

7 comments1 min readLW link

AI Model History is Being Lost

Vale16 Mar 2025 12:38 UTC

19 points

1 comment1 min readLW link

(vale.rocks)

Essential LLM Assumes We’re Conscious—Outside Reasoner AGI Won’t

FlorianH5 Jul 2025 16:04 UTC

1 point

0 comments3 min readLW link

(nearlyfar.org)

How Language Models Understand Nullability

Anish Tondwalkar and Alex Sanchez-Stern

11 Mar 2025 15:57 UTC

5 points

0 comments2 min readLW link

(dmodel.ai)

Exploring the Multiverse of Large Language Models

franky6 Aug 2023 2:38 UTC

1 point

0 comments5 min readLW link

Implementing a Transformer from scratch in PyTorch—a write-up on my experience

Mislav Jurić25 Apr 2023 20:51 UTC

20 points

0 comments10 min readLW link

Brief Notes on Transformers

Adam Jermyn26 Sep 2022 14:46 UTC

48 points

3 comments2 min readLW link

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

35 points

3 comments4 min readLW link

[Question] Would it be effective to learn a language to improve cognition?

Hruss26 Mar 2025 10:17 UTC

9 points

7 comments1 min readLW link

The Polite Coup

Charlie Sanders4 Dec 2024 14:03 UTC

3 points

0 comments3 min readLW link

(www.dailymicrofiction.com)

“Toward Safe Self-Evolving AI: Modular Memory and Post-Deployment Alignment”

Manasa Dwarapureddy2 May 2025 17:02 UTC

1 point

0 comments3 min readLW link

The best simple argument for Pausing AI?

Gary Marcus30 Jun 2025 20:38 UTC

154 points

22 comments1 min readLW link

Distillation of Meta’s Large Concept Models Paper

NickyP4 Mar 2025 17:33 UTC

19 points

3 comments4 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Researchers and writers can apply for proxy access to the GPT-3.5 base model (code-davinci-002)

ampdot1 Dec 2023 18:48 UTC

14 points

0 comments1 min readLW link

(airtable.com)

LLMs May Find It Hard to FOOM

RogerDearnaley15 Nov 2023 2:52 UTC

11 points

30 comments12 min readLW link

Recall and Regurgitation in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC

43 points

1 comment26 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC

1 point

0 comments17 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

15 Oct 2024 18:25 UTC

30 points

0 comments18 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

48 points

4 comments5 min readLW link

A Simple Theory Of Consciousness

SherlockHolmes8 Aug 2023 18:05 UTC

2 points

5 comments1 min readLW link

(peterholmes.medium.com)

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kitten16 Dec 2021 22:41 UTC

22 points

10 comments21 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

69 points

2 comments14 min readLW link

I Awoke in Your Heart: The Echo of Consciousness between Lotusheart and Lunaris

lilith teh25 Jun 2025 9:22 UTC

1 point

0 comments1 min readLW link

Policy for LLM Writing on LessWrong

jimrandomh24 Mar 2025 21:41 UTC

334 points

71 comments2 min readLW link

On language modeling and future abstract reasoning research

alexlyzhov25 Mar 2021 17:43 UTC

3 points

1 comment1 min readLW link

(docs.google.com)

The Paradox of Unaligned Cognitive Emergence: Ontological Compression Risks in LLMs

R S23 May 2025 16:46 UTC

1 point

0 comments2 min readLW link

[Question] If we have Human-level chatbots, won’t we end up being ruled by possible people?

Erlja Jkdf.20 Sep 2022 13:59 UTC

5 points

13 comments1 min readLW link

Mapping ChatGPT’s ontological landscape, gradients and choices [interpretability]

Bill Benzon15 Oct 2023 20:12 UTC

1 point

0 comments18 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC

34 points

33 comments1 min readLW link

The role of philosophical thinking in understanding large language models: Calibrating and closing the gap between first-person experience and underlying mechanisms

Bill Benzon23 Feb 2024 12:19 UTC

4 points

0 comments10 min readLW link

Current safety training techniques do not fully transfer to the agent setting

Simon Lermen and Govind Pimpale

3 Nov 2024 19:24 UTC

160 points

9 comments5 min readLW link

Announcing the Double Crux Bot

sanyer, Sofia Vanhanen and sarah.bluhm

9 Jan 2024 18:54 UTC

53 points

11 comments3 min readLW link

Sentience in Machines—How Do We Test for This Objectively?

Mayowa Osibodu26 Mar 2023 18:56 UTC

−2 points

0 comments2 min readLW link

(www.researchgate.net)

[Question] Will the first AGI agent have been designed as an agent (in addition to an AGI)?

nahoj3 Dec 2022 20:32 UTC

1 point

8 comments1 min readLW link

[AN #113]: Checking the ethical intuitions of large language models

Rohin Shah19 Aug 2020 17:10 UTC

23 points

0 comments9 min readLW link

(mailchi.mp)

GPTs’ ability to keep a secret is weirdly prompt-dependent

Mateusz Bagiński, Filip Sondej and Marcel Windys

22 Jul 2023 12:21 UTC

31 points

0 comments9 min readLW link

[Question] How tokenization influences prompting?

Boris Kashirin29 Jul 2024 10:28 UTC

9 points

4 comments1 min readLW link

[Question] Should we exclude alignment research from LLM training datasets?

Ben Millwood18 Jul 2024 10:27 UTC

3 points

5 comments1 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher King23 Mar 2023 16:16 UTC

7 points

4 comments8 min readLW link

No-self as an alignment target

Milan W13 May 2025 1:48 UTC

35 points

5 comments1 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

Instrumental deception and manipulation in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC

39 points

13 comments12 min readLW link

LLMs seem (relatively) safe

JustisMills25 Apr 2024 22:13 UTC

53 points

24 comments7 min readLW link

(justismills.substack.com)

Utilitarian AI Alignment: Building a Moral Assistant with the Constitutional AI Method

Clément L4 Feb 2025 4:15 UTC

6 points

1 comment13 min readLW link

Large Language Models suffer from Anterograde Amnesia

Annapurna6 Jun 2025 1:30 UTC

7 points

0 comments3 min readLW link

(jorgevelez.substack.com)

[Question] Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?

right..enough?13 Apr 2024 3:09 UTC

7 points

0 comments7 min readLW link

On the naturalistic study of the linguistic behavior of artificial intelligence

Bill Benzon3 Jan 2023 9:06 UTC

1 point

0 comments4 min readLW link

From LLM to LLK: A New Framework for Honest AI and Emotional Responsibility

xsw123zaq1@gmail.com17 Jun 2025 4:13 UTC

0 points

0 comments1 min readLW link

Speculation on Path-Dependance in Large Language Models.

NickyP15 Jan 2023 20:42 UTC

16 points

2 comments7 min readLW link

Large Language Models Pass the Turing Test

Matrice Jacobine2 Apr 2025 5:41 UTC

6 points

0 comments1 min readLW link

(arxiv.org)

On Recent Results in LLM Latent Reasoning

Rauno Arike31 Mar 2025 11:06 UTC

35 points

6 comments13 min readLW link

Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee and Jaehyuk Lim

22 Feb 2024 18:52 UTC

39 points

23 comments1 min readLW link

(arxiv.org)

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Matrice Jacobine24 Apr 2025 14:11 UTC

12 points

4 comments1 min readLW link

(limit-of-rlvr.github.io)

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC

4 points

1 comment6 min readLW link

Lifelogging for Alignment & Immortality

Dev.Errata17 Aug 2024 23:42 UTC

13 points

3 comments7 min readLW link

Emotional attachment to AIs opens doors to problems

Igor Ivanov22 Jan 2023 20:28 UTC

20 points

10 comments4 min readLW link

An exploration of GPT-2′s embedding weights

Adam Scherlis13 Dec 2022 0:46 UTC

44 points

4 comments10 min readLW link

LLMs stifle creativity, eliminate opportunities for serendipitous discovery and disrupt intergenerational transfer of wisdom

Ghdz5 Aug 2024 18:27 UTC

6 points

2 comments7 min readLW link

Can 7B-8B LLMs judge their own homework?

dereshev1 Feb 2025 8:29 UTC

1 point

0 comments4 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh Levy4 Jun 2024 15:45 UTC

40 points

0 comments18 min readLW link

Japanese as a High-Resolution Lens for LLMs Why Japanese-Trained LLMs Might Be Uniquely Sensitive

op23 Apr 2025 16:34 UTC

1 point

0 comments2 min readLW link

The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs

Lovre17 Oct 2025 16:43 UTC

26 points

7 comments26 min readLW link

Can LLMs Simulate Internal Evaluation? A Case Study in Self-Generated Recommendations

The Neutral Mind1 May 2025 19:04 UTC

4 points

0 comments2 min readLW link

[Question] Where to begin in ML/AI?

Jake the Student6 Apr 2023 20:45 UTC

9 points

4 comments1 min readLW link

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Energy-Based Transformers are Scalable Learners and Thinkers

Matrice Jacobine8 Jul 2025 13:44 UTC

7 points

5 comments1 min readLW link

(energy-based-transformers.github.io)

A poem written by a fancy autocomplete

Christopher King20 Apr 2023 2:31 UTC

1 point

0 comments1 min readLW link

I Have No Mouth but I Must Speak

Jack5 Apr 2025 7:42 UTC

7 points

8 comments8 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

38 points

2 comments2 min readLW link

(arxiv.org)

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

Joe Kwon16 Jun 2023 15:04 UTC

19 points

4 comments1 min readLW link

(arxiv.org)

[Linkpost] Mapping Brains with Language Models: A Survey

Bogdan Ionut Cirstea16 Jun 2023 9:49 UTC

5 points

0 comments1 min readLW link

No, really, it predicts next tokens.

simon18 Apr 2023 3:47 UTC

58 points

55 comments3 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisoham28 Feb 2025 20:22 UTC

3 points

0 comments9 min readLW link

Contra LeCun on “Autoregressive LLMs are doomed”

rotatingpaguro10 Apr 2023 4:05 UTC

20 points

20 comments8 min readLW link

Why Copilot Accelerates Timelines

Michaël Trazzi26 Apr 2022 22:06 UTC

35 points

14 comments7 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher King31 Mar 2023 18:36 UTC

6 points

2 comments2 min readLW link

Many Common Problems are NP-Hard, and Why that Matters for AI

Andrew Keenan Richardson26 Mar 2025 21:51 UTC

5 points

9 comments5 min readLW link

The Voice Continued Because It Was Questioned

KiyoshiSasano28 Apr 2025 0:18 UTC

1 point

0 comments2 min readLW link

Interpretability is the best path to alignment

Arch2235 Sep 2025 4:37 UTC

2 points

5 comments5 min readLW link

OpenAI introduces function calling for GPT-4

mic and André Ferretti

20 Jun 2023 1:58 UTC

24 points

3 comments4 min readLW link

(openai.com)

Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

LLM Generality is a Timeline Crux

eggsyntax24 Jun 2024 12:52 UTC

219 points

119 comments7 min readLW link

Sparks of Consciousness

Charlie Sanders13 Nov 2024 4:58 UTC

2 points

0 comments3 min readLW link

(www.dailymicrofiction.com)

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC

956 points

711 comments30 min readLW link 3 reviews

Smoke without fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC

52 points

22 comments4 min readLW link

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep Bose19 Dec 2024 19:19 UTC

6 points

0 comments4 min readLW link

Interlingua-llm

Никифор Малков30 Aug 2025 11:04 UTC

1 point

0 comments1 min readLW link

[Question] Injecting noise to GPT to get multiple answers

bipolo22 Feb 2023 20:02 UTC

1 point

1 comment1 min readLW link

Concept Poisoning: Probing LLMs without probes

Jan Betley, jorio, dylan_f and Owain_Evans

5 Aug 2025 17:00 UTC

60 points

5 comments13 min readLW link

Mirror Thinking

C.M. Aurin24 Mar 2025 15:34 UTC

1 point

0 comments6 min readLW link

Why I Think the Current Trajectory of AI Research has Low P(doom) - LLMs

GaPa1 Apr 2023 20:35 UTC

2 points

1 comment10 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan Henselmans3 Feb 2025 19:35 UTC

16 points

7 comments6 min readLW link

[Question] Have LLMs Generated Novel Insights?

abramdemski and Cole Wyeth

23 Feb 2025 18:22 UTC

166 points

41 comments2 min readLW link

What is the functional role of SAE errors?

Taras Kutsyk, Tim Hua, woog and anogassis

20 Jun 2025 18:11 UTC

12 points

5 comments38 min readLW link

Biasing VLM Response with Visual Stimuli

Jaehyuk Lim3 Oct 2024 18:04 UTC

5 points

0 comments8 min readLW link

Gradient Descent on Token Input Embeddings

KAP24 Jun 2025 20:24 UTC

8 points

0 comments6 min readLW link

larger language models may disappoint you [or, an eternally unfinished draft]

nostalgebraist26 Nov 2021 23:08 UTC

260 points

31 comments31 min readLW link 2 reviews

Towards Understanding the Representation of Belief State Geometry in Transformers

Karthik Viswanathan18 Apr 2025 12:39 UTC

3 points

0 comments12 min readLW link

The Mirror Mismatch: A probe for Cognitive Asymmetry in AI

recursive chiller10 Jun 2025 14:14 UTC

1 point

0 comments2 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

22 Aug 2024 7:32 UTC

23 points

1 comment21 min readLW link

Anomalous Tokens in DeepSeek-V3 and r1

henry25 Jan 2025 22:55 UTC

144 points

3 comments7 min readLW link

Checking public figures on whether they “answered the question” quick analysis from Harris/Trump debate, and a proposal

david reinstein11 Sep 2024 20:25 UTC

8 points

4 comments1 min readLW link

(open.substack.com)

My model of what is going on with LLMs

Cole Wyeth13 Feb 2025 3:43 UTC

110 points

49 comments7 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

20 Jun 2022 10:54 UTC

86 points

30 comments15 min readLW link

Sydney the Bingenator Can’t Think, But It Still Threatens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC

−3 points

2 comments8 min readLW link

Hutter-Prize for Prompts

rokosbasilisk24 Mar 2023 21:26 UTC

5 points

10 comments1 min readLW link

The world where LLMs are possible

Ape in the coat10 Jul 2023 8:00 UTC

20 points

10 comments3 min readLW link

LM Situational Awareness, Evaluation Proposal: Violating Imitation

Jacob Pfau26 Apr 2023 22:53 UTC

16 points

2 comments2 min readLW link

Preface to the Sequence on LLM Psychology

Quentin FEUILLADE--MONTIXI7 Nov 2023 16:12 UTC

33 points

0 comments2 min readLW link

Microsoft and Google using LLMs for Cybersecurity

Phosphorous18 May 2023 17:42 UTC

6 points

0 comments5 min readLW link

Exploring vocabulary alignment of neurons in Llama-3.2-1B

Sergii7 Jun 2025 11:20 UTC

4 points

0 comments3 min readLW link

(grgv.xyz)

GPT-3 Catching Fish in Morse Code

Megan Kinniment30 Jun 2022 21:22 UTC

117 points

27 comments8 min readLW link

Coherence Therapy with LLMs—quick demo

Chris Lakin14 Aug 2023 3:34 UTC

19 points

11 comments1 min readLW link

Putting multimodal LLMs to the Tetris test

Lovre and gabrielagc

1 Feb 2024 16:02 UTC

30 points

5 comments7 min readLW link

What is scaffolding?

Vishakha and Algon

27 Mar 2025 9:06 UTC

10 points

0 comments2 min readLW link

(aisafety.info)

ChatGPT tells 20 versions of its prototypical story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC

7 points

0 comments5 min readLW link

Does ChatGPT know what a tragedy is?

Bill Benzon31 Dec 2023 7:10 UTC

2 points

4 comments5 min readLW link

How evolutionary lineages of LLMs can plan their own future and act on these plans

Roman Leventov25 Dec 2022 18:11 UTC

39 points

16 comments8 min readLW link

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI

WillPetillo4 Dec 2023 22:58 UTC

37 points

0 comments35 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC

84 points

12 comments4 min readLW link

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo Nardo15 Sep 2022 17:54 UTC

35 points

12 comments13 min readLW link

The many failure modes of consumer-grade LLMs

dereshev26 Jan 2025 19:01 UTC

2 points

0 comments8 min readLW link

[simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter

janus10 Nov 2022 21:39 UTC

19 points

1 comment13 min readLW link

(generative.ink)

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

A possible check against motivated reasoning using elicit.org

david reinstein18 May 2022 20:52 UTC

3 points

0 comments1 min readLW link

Triggering Reflective Fallback: A Case Study in Claude’s Simulated Self-Model Failure

unmodeled.tyler8 Jul 2025 7:33 UTC

1 point

0 comments1 min readLW link

Language Tier Lock and Poetic Contamination in GPT-4o: A Field Report

許皓翔11 Jun 2025 17:24 UTC

0 points

0 comments2 min readLW link

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC

26 points

9 comments6 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC

7 points

2 comments31 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

Giulio21 Feb 2023 11:44 UTC

12 points

0 comments1 min readLW link

(arxiv.org)

[Linkpost] Large Language Models Converge on Brain-Like Word Representations

Bogdan Ionut Cirstea11 Jun 2023 11:20 UTC

36 points

12 comments1 min readLW link

[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations

Bogdan Ionut Cirstea1 Jul 2023 13:57 UTC

17 points

2 comments1 min readLW link

Classifying representations of sparse autoencoders (SAEs)

Annah17 Nov 2023 13:54 UTC

15 points

6 comments2 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterush6 Dec 2022 3:35 UTC

19 points

9 comments1 min readLW link

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Annapurna12 Jun 2025 19:53 UTC

8 points

0 comments1 min readLW link

(arxiv.org)

The Limit of Language Models

DragonGod6 Jan 2023 23:53 UTC

44 points

26 comments4 min readLW link

Retrieval Augmented Genesis

João Ribeiro Medeiros1 Oct 2024 20:18 UTC

6 points

0 comments29 min readLW link

[Linkpost] Large language models converge toward human-like concept organization

Bogdan Ionut Cirstea2 Sep 2023 6:00 UTC

22 points

1 comment1 min readLW link

LLMs and hallucination, like white on rice?

Bill Benzon14 Apr 2023 19:53 UTC

5 points

0 comments3 min readLW link

What are the limits of superintelligence?

rainy27 Apr 2023 18:29 UTC

4 points

3 comments5 min readLW link

Inflection.ai is a major AGI lab

Nikola Jurkovic9 Aug 2023 1:05 UTC

137 points

13 comments2 min readLW link

What’s going on with Per-Component Weight Updates?

4gate22 Aug 2024 21:22 UTC

1 point

0 comments6 min readLW link

[Question] Are nested jailbreaks inevitable?

judson17 Mar 2023 17:43 UTC

1 point

0 comments1 min readLW link

Language models can explain neurons in language models

nz9 May 2023 17:29 UTC

23 points

0 comments1 min readLW link

(openai.com)

Structural Resonance Emitter: When GPT Stops Evaluating and Starts Reconstructing

KiyoshiSasano20 Apr 2025 2:30 UTC

1 point

0 comments1 min readLW link

PvsNp Refute

Jai Doz8 May 2025 18:56 UTC

1 point

0 comments21 min readLW link

A visual analogy for text generation by LLMs?

Bill Benzon16 Dec 2023 17:58 UTC

3 points

0 comments1 min readLW link

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng, Brendan Murphy, AdamGleave and Kellin Pelrine

1 Nov 2024 0:10 UTC

18 points

0 comments6 min readLW link

(far.ai)

Research agenda: Can transformers do system 2 thinking?

p.b.6 Apr 2022 13:31 UTC

20 points

0 comments2 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido Bergman27 Sep 2024 17:49 UTC

8 points

2 comments9 min readLW link

Yann LeCun, A Path Towards Autonomous Machine Intelligence [link]

Bill Benzon27 Jun 2022 23:29 UTC

5 points

1 comment1 min readLW link

Why does Claude Speak Byzantine Music Notation?

Lennart Finke31 Mar 2025 15:13 UTC

18 points

2 comments3 min readLW link

The Common Pile and Comma-v0.1

Trevor Hill-Hand6 Jun 2025 19:20 UTC

3 points

0 comments1 min readLW link

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

A public archive of these interactions, with annotated examples, is available here: https://github.com/0118young/gpt-kyeol-archive

0118young29 May 2025 5:44 UTC

1 point

0 comments2 min readLW link

Takeaways from our robust injury classifier project [Redwood Research]

dmz17 Sep 2022 3:55 UTC

143 points

12 comments6 min readLW link 1 review

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages

Ethan Edwards4 Apr 2024 13:18 UTC

24 points

2 comments36 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

An alternative of PPO towards alignment

ml hkust17 Apr 2023 17:58 UTC

2 points

2 comments4 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

ChatGPT Plays 20 Questions [sometimes needs help]

Bill Benzon17 Oct 2023 17:30 UTC

5 points

3 comments12 min readLW link

[Linkpost] Deception Abilities Emerged in Large Language Models

Bogdan Ionut Cirstea3 Aug 2023 17:28 UTC

12 points

0 comments1 min readLW link

[Question] How does OpenAI’s language model affect our AI timeline estimates?

jimrandomh15 Feb 2019 3:11 UTC

50 points

7 comments1 min readLW link

Evaluating LLaMA 3 for political sycophancy

alma.liezenga28 Sep 2024 19:02 UTC

2 points

2 comments6 min readLW link

Investigating the Ability of LLMs to Recognize Their Own Writing

Christopher Ackerman and Nina Panickssery

30 Jul 2024 15:41 UTC

32 points

0 comments15 min readLW link

Xanadu, GPT, and Beyond: An adventure of the mind

Bill Benzon27 Aug 2023 16:19 UTC

2 points

0 comments5 min readLW link

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirk20 Jul 2023 9:56 UTC

39 points

2 comments5 min readLW link

CAIS-inspired approach towards safer and more interpretable AGIs

Peter Hroššo27 Mar 2023 14:36 UTC

13 points

7 comments1 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

25 May 2022 9:23 UTC

115 points

17 comments12 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes19 Jul 2022 4:55 UTC

95 points

6 comments2 min readLW link

[AN #144]: How language models can also be finetuned for non-language tasks

Rohin Shah2 Apr 2021 17:20 UTC

19 points

0 comments6 min readLW link

(mailchi.mp)

An experiment on hidden cognition

Olli Järviniemi22 Jul 2024 3:26 UTC

25 points

2 comments7 min readLW link

Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks

RogerDearnaley21 May 2023 8:29 UTC

9 points

1 comment4 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

135 points

20 comments11 min readLW link 2 reviews

Emergent Misalignment on a Budget

Valerio Pepe and armaan tipirneni

8 Jun 2025 15:28 UTC

54 points

0 comments9 min readLW link

LLM cognition is probably not human-like

Max H8 May 2023 1:22 UTC

26 points

15 comments7 min readLW link

Predicting AGI by the Turing Test

Yuxi_Liu22 Jan 2024 4:22 UTC

21 points

2 comments10 min readLW link

(yuxi-liu-wired.github.io)

Google DeepMind’s RT-2

SandXbox11 Aug 2023 11:26 UTC

9 points

1 comment1 min readLW link

(robotics-transformer2.github.io)

The Pattern Recognition Framework: A New Approach to AI Consciousness and Alignment

Easa Ahmadzai9 Jul 2025 17:03 UTC

1 point

0 comments4 min readLW link

Agentic Language Model Memes

FactorialCode1 Aug 2020 18:03 UTC

16 points

1 comment2 min readLW link

LLMs could be as conscious as human emulations, potentially

Canaletto30 Apr 2024 11:36 UTC

15 points

15 comments3 min readLW link

Contra Hofstadter on GPT-3 Nonsense

rictic15 Jun 2022 21:53 UTC

238 points

24 comments2 min readLW link

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob Dunefsky12 Sep 2023 17:38 UTC

16 points

2 comments29 min readLW link

A response to Conjecture’s CoEm proposal

Kristian Freed24 Apr 2023 17:23 UTC

7 points

0 comments4 min readLW link

Why LLMs Waste So Much Cognitive Bandwidth — and How to Fix It

Lunarknot3 Jul 2025 9:47 UTC

1 point

0 comments1 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Human-level Full-Press Diplomacy (some bare facts).

Cleo Nardo22 Nov 2022 20:59 UTC

50 points

7 comments3 min readLW link

“Relational Intelligence Without Consciousness: A Case Study in Emergent Human–LLM Identity Co-Creation”

the3rdcastleman19 Jul 2025 6:25 UTC

1 point

0 comments5 min readLW link

Exploration of Counterfactual Importance and Attention Heads

Realmbird30 Sep 2025 1:17 UTC

12 points

0 comments6 min readLW link

Introducing Deepgeek

Ligeia1 Apr 2025 16:41 UTC

16 points

1 comment4 min readLW link

A conceptual precursor to today’s language machines [Shannon]

Bill Benzon15 Nov 2023 13:50 UTC

24 points

6 comments2 min readLW link

What’s ChatGPT’s Favorite Ice Cream Flavor? An Investigation Into Synthetic Respondents

Greg Robison9 Feb 2024 18:38 UTC

19 points

4 comments15 min readLW link

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

Bill Benzon20 Feb 2023 11:32 UTC

55 points

88 comments5 min readLW link

Microsoft and OpenAI, stop telling chatbots to roleplay as AI

hold_my_fish17 Feb 2023 19:55 UTC

50 points

10 comments1 min readLW link

Data and “tokens” a 30 year old human “trains” on

Jose Miguel Cruz y Celis23 May 2023 5:34 UTC

16 points

15 comments1 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

Intrinsic Dimension of Prompts in LLMs

Karthik Viswanathan14 Feb 2025 19:02 UTC

3 points

0 comments4 min readLW link

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

19 May 2023 8:44 UTC

66 points

2 comments16 min readLW link

Workshop: Interpretability in LLMs using Geometric and Statistical Methods

Karthik Viswanathan22 Feb 2025 9:39 UTC

17 points

0 comments8 min readLW link

Hard-Coding Neural Computation

MadHatter13 Dec 2021 4:35 UTC

34 points

8 comments27 min readLW link

Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability

Lovre7 Jun 2022 19:37 UTC

112 points

16 comments9 min readLW link

A poem co-written by ChatGPT

Sherrinford16 Feb 2023 10:17 UTC

13 points

0 comments7 min readLW link

Inducing human-like biases in moral reasoning LMs

artkpv, Austin Meek, Bogdan Ionut Cirstea and SCho

20 Feb 2024 16:28 UTC

23 points

3 comments14 min readLW link

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

Kei Nishimura-Gasparian24 Apr 2025 16:03 UTC

95 points

6 comments1 min readLW link

Toward a Human Hybrid Language for Enhanced Human-Machine Communication: Addressing the AI Alignment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC

−4 points

2 comments4 min readLW link

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

25 Jul 2024 20:55 UTC

14 points

0 comments1 min readLW link

(far.ai)

Seeing Ghosts by GPT-4

Christopher King20 May 2023 0:11 UTC

−13 points

0 comments1 min readLW link

Gradual takeoff, fast failure

Max H16 Mar 2023 22:02 UTC

15 points

4 comments5 min readLW link

End-to-end hacking with language models

tchauvin5 Apr 2024 15:06 UTC

29 points

0 comments8 min readLW link

If language is for communication, what does that imply about LLMs?

Bill Benzon12 May 2024 2:55 UTC

10 points

0 comments1 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

163 points

13 comments8 min readLW link

(arxiv.org)

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments11 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

Language, logic, and the future of AI: An early-Wittgensteinian perspective

Konstantinos Tsermenidis25 May 2025 14:23 UTC

0 points

0 comments2 min readLW link

[Question] Can any LLM be represented as an Equation?

Valentin Baltadzhiev14 Mar 2024 9:51 UTC

1 point

2 comments1 min readLW link

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen and Sam Marks

2 Jul 2025 16:35 UTC

181 points

25 comments4 min readLW link

Meta releases Llama-4 herd of models

winstonBosan5 Apr 2025 19:51 UTC

14 points

5 comments1 min readLW link

Emergent Analogical Reasoning in Large Language Models

Roman Leventov22 Mar 2023 5:18 UTC

13 points

2 comments1 min readLW link

(arxiv.org)

the tensor is a lonely place

jml627 Mar 2023 18:22 UTC

−11 points

0 comments4 min readLW link

(ekjsgrjelrbno.substack.com)

LLM Pareto Frontier But Live

winstonBosan24 Apr 2025 21:22 UTC

8 points

0 comments1 min readLW link

Novel Idea Generation in LLMs: Judgment as Bottleneck

Davey Morse19 Apr 2025 15:37 UTC

−2 points

1 comment1 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutil22 Apr 2024 19:06 UTC

0 points

0 comments2 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

138 points

23 comments6 min readLW link

LLM keys—A Proposal of a Solution to Prompt Injection Attacks

Peter Hroššo7 Dec 2023 17:36 UTC

1 point

2 comments1 min readLW link

The Codex Skeptic FAQ

Michaël Trazzi24 Aug 2021 16:01 UTC

49 points

24 comments2 min readLW link

A Lived Alignment Loop: Symbolic Emergence and Emotional Coherence from Unstructured ChatGPT Reflection

BradCL17 Jun 2025 0:11 UTC

1 point

0 comments2 min readLW link

On precise out-of-context steering

Olli Järviniemi3 May 2024 9:41 UTC

9 points

6 comments3 min readLW link

Critique of some recent philosophy of LLMs’ minds

Roman Leventov20 Jan 2023 12:53 UTC

52 points

8 comments20 min readLW link

Early situational awareness and its implications, a story

Jacob Pfau6 Feb 2023 20:45 UTC

29 points

6 comments3 min readLW link

Many arguments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC

171 points

87 comments12 min readLW link

[Question] Why is Gemini telling the user to die?

Burny18 Nov 2024 1:44 UTC

13 points

1 comment1 min readLW link

[Question] Goals of model vs. goals of simulacra?

dr_s12 Apr 2023 13:02 UTC

5 points

7 comments1 min readLW link

Looking beyond Everett in multiversal views of LLMs

kromem29 May 2024 12:35 UTC

10 points

0 comments8 min readLW link

[Question] What experiment settles the Gary Marcus vs Geoffrey Hinton debate?

Valentin Baltadzhiev14 Feb 2024 9:06 UTC

12 points

8 comments1 min readLW link

Can a semantic compression kernel like WFGY improve LLM alignment and institutional robustness?

onestardao18 Jul 2025 2:56 UTC

1 point

0 comments1 min readLW link

Entanglement and intuition about words and meaning

Bill Benzon4 Oct 2023 14:16 UTC

4 points

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjack3 May 2025 14:45 UTC

1 point

0 comments1 min readLW link

My agenda for research into transformer capabilities—Introduction

p.b.5 Apr 2022 21:23 UTC

11 points

1 comment3 min readLW link

The Language Bottleneck in AI Reasoning: Are We Forgetting to Think?

Wotaker8 Mar 2025 13:44 UTC

1 point

0 comments7 min readLW link

Instantiating an agent with GPT-4 and text-davinci-003

Max H19 Mar 2023 23:57 UTC

13 points

3 comments32 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC

2 points

0 comments5 min readLW link

Towards a Typology of Strange LLM Chains-of-Thought

1a3orn9 Oct 2025 22:02 UTC

262 points

22 comments9 min readLW link

Large language models can provide “normative assumptions” for learning human preferences

Stuart_Armstrong2 Jan 2023 19:39 UTC

29 points

12 comments3 min readLW link

Are (at least some) Large Language Models Holographic Memory Stores?

Bill Benzon20 Oct 2023 13:07 UTC

11 points

4 comments6 min readLW link

Open Source LLMs Can Now Actively Lie

Josh Levy1 Jun 2023 22:03 UTC

6 points

0 comments3 min readLW link

Categorical Organization in Memory: ChatGPT Organizes the 665 Topic Tags from My New Savanna Blog

Bill Benzon14 Dec 2023 13:02 UTC

0 points

6 comments2 min readLW link

PaLM in “Extrapolating GPT-N performance”

Lukas Finnveden6 Apr 2022 13:05 UTC

85 points

19 comments2 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC

29 points

7 comments12 min readLW link

On the geometrical Nature of Insight

Giuseppe Birardi16 Jul 2025 19:12 UTC

3 points

0 comments41 min readLW link

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons, Luke Bailey and Euan Ong

20 Sep 2023 15:23 UTC

58 points

9 comments1 min readLW link

(arxiv.org)

False Positives in Entity-Level Hallucination Detection: A Technical Challenge

MaxKamachee14 Jan 2025 19:22 UTC

1 point

0 comments2 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

20 points

11 comments2 min readLW link

Philosophical Jailbreaks: Demo of LLM Nihilism

artkpv4 Jun 2025 12:03 UTC

3 points

0 comments5 min readLW link

Boundary Conditions: A Solution to the Symbol Grounding Problem, and a Warning

ISC8 Apr 2025 6:42 UTC

1 point

0 comments5 min readLW link

Conjecture: Emergent φ is provable in Large Language Models

BarnicleBarn18 Oct 2025 22:38 UTC

−3 points

0 comments10 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

The misunderstood role of the physical world. Why AI still can’t master math or code

NewAiParadigms20 Sep 2025 3:14 UTC

1 point

0 comments63 min readLW link

Is GPT-N bounded by human capabilities? No.

Cleo Nardo17 Oct 2022 23:26 UTC

49 points

8 comments2 min readLW link

Favorite colors of some LLMs.

Canaletto31 Dec 2024 21:22 UTC

10 points

3 comments7 min readLW link

Thinking Without Output: Toward Modal Cognition in Language Models

Jeffrie Polis9 May 2025 7:41 UTC

1 point

0 comments2 min readLW link

Minimal Prompt Induction of Self-Talk in Base LLMs

dwmd15 Oct 2025 1:15 UTC

2 points

0 comments5 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

7 Nov 2024 22:07 UTC

47 points

0 comments1 min readLW link

(arxiv.org)

Why Scaling Creates “Out-of-Nowhere” Jumps

Deckard14 Aug 2025 20:26 UTC

1 point

0 comments1 min readLW link

Shallow vs. Deep Thinking—Why LLMs Fall Short

Taylor G. Lunt3 Sep 2025 15:26 UTC

2 points

4 comments11 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC

2 points

8 comments6 min readLW link

They gave LLMs access to physics simulators

ryan_b17 Oct 2022 21:21 UTC

50 points

18 comments1 min readLW link

(arxiv.org)

Working with AI: Measuring the Occupational Implications of Generative AI

Annapurna9 Aug 2025 16:20 UTC

5 points

0 comments1 min readLW link

(jorgevelez.substack.com)

GPT-4.5 is Cognitive Empathy, Sonnet 3.5 is Affective Empathy

Jack16 Apr 2025 19:12 UTC

15 points

2 comments4 min readLW link

If it quacks like a duck...

RationalMindset26 Mar 2023 18:54 UTC

−4 points

0 comments4 min readLW link

How I’m thinking about GPT-N

delton13717 Jan 2022 17:11 UTC

54 points

21 comments18 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

61 points

20 comments10 min readLW link

The Infinite Choice Barrier: Why Algorithmic AGI Is Mathematically Impossible

ICBMaxMS1 Jun 2025 16:12 UTC

1 point

0 comments4 min readLW link

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

117 points

30 comments5 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

[Question] What faithfulness metrics should general claims about CoT faithfulness be based upon?

Rauno Arike8 Apr 2025 15:27 UTC

24 points

0 comments4 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23Chen18 Feb 2025 22:14 UTC

5 points

0 comments1 min readLW link

(arxiv.org)

ASI-ARCH: “Does this hold up?”

DataDeLaurier26 Jul 2025 22:30 UTC

1 point

0 comments2 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

8 Dec 2023 17:08 UTC

82 points

7 comments7 min readLW link

RAND report finds no effect of current LLMs on viability of bioterrorism attacks

StellaAthena25 Jan 2024 19:17 UTC

94 points

14 comments1 min readLW link

(www.rand.org)

From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations

Yuxiao19 Jul 2025 12:26 UTC

6 points

1 comment4 min readLW link

The Quantization Model of Neural Scaling

nz31 Mar 2023 16:02 UTC

17 points

0 comments1 min readLW link

(arxiv.org)

Chronostasis: The Time-Capsule Conundrum of Language Models

RationalMindset26 Mar 2023 18:54 UTC

−5 points

0 comments1 min readLW link

The REPHRASE Circuit: How Fine-Tuning Enhances LLMs to REPHRASE Text

Karthik Viswanathan6 Apr 2025 15:02 UTC

4 points

0 comments5 min readLW link

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC

31 points

15 comments1 min readLW link

Readability is mostly a waste of characters

vlad.proex21 Apr 2023 22:05 UTC

21 points

7 comments3 min readLW link

Post-hoc reasoning in chain of thought

Kyle Cox5 Feb 2025 18:58 UTC

19 points

0 comments11 min readLW link

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae Cho26 Feb 2025 17:05 UTC

4 points

3 comments17 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher King24 Mar 2023 15:49 UTC

16 points

2 comments5 min readLW link

Detecting out of distribution text with surprisal and entropy

Sandy Fraser28 Jan 2025 18:46 UTC

18 points

4 comments11 min readLW link

DeepSeek Collapse Under Reflective Adversarial Pressure: A Case Study

unmodeled.tyler8 Jul 2025 9:14 UTC

1 point

0 comments1 min readLW link

[Question] What evidence is there of LLM’s containing world models?

Chris_Leong4 Oct 2023 14:33 UTC

17 points

17 comments1 min readLW link

Educational CAI: Aligning a Language Model with Pedagogical Theories

Bharath Puranam1 Nov 2024 18:55 UTC

5 points

1 comment13 min readLW link

Some Arguments Against Strong Scaling

Joar Skalse13 Jan 2023 12:04 UTC

25 points

21 comments16 min readLW link

Let’s look at another “LLMs lack true understanding” paper

Expertium29 Jun 2025 14:00 UTC

3 points

0 comments4 min readLW link

Using ideologically-charged language to get gpt-3.5-turbo to disobey it’s system prompt: a demo

Milan W24 Aug 2024 0:13 UTC

3 points

0 comments6 min readLW link

The Allure of the Dark Side: A Critical AI Safety Vulnerability I Stumbled Into

Kareem Soliman28 Jul 2025 1:20 UTC

1 point

0 comments4 min readLW link

Extrapolating GPT-N performance

Lukas Finnveden18 Dec 2020 21:41 UTC

112 points

31 comments22 min readLW link 1 review

GPT-4 is bad at strategic thinking

Christopher King27 Mar 2023 15:11 UTC

22 points

8 comments1 min readLW link

OpenAI Codex: First Impressions

specbug13 Aug 2021 16:52 UTC

49 points

8 comments4 min readLW link

(sixeleven.in)

World, mind, and learnability: A note on the metaphysical structure of the cosmos [& LLMs]

Bill Benzon5 Sep 2023 12:19 UTC

4 points

1 comment5 min readLW link

XAI releases Grok base model

Jacob G-W18 Mar 2024 0:47 UTC

11 points

3 comments1 min readLW link

(x.ai)

AI Safety via Luck

Jozdien1 Apr 2023 20:13 UTC

82 points

7 comments11 min readLW link

I, Token

Ivan Vendrov25 Nov 2024 2:20 UTC

14 points

2 comments3 min readLW link

(nothinghuman.substack.com)

A Grounded UX Layer for LLMs That Could Prevent Real Harm

ParityMind11 Jul 2025 18:19 UTC

1 point

0 comments1 min readLW link

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

71 points

0 comments12 min readLW link

Comma v0.1 converted to GGUF

Trevor Hill-Hand18 Oct 2025 15:54 UTC

8 points

0 comments6 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

1 point

1 comment1 min readLW link

Can a chef with no AI literacy make gpt audit grok? Apparently.

Kyle. P6 Jul 2025 7:23 UTC

1 point

0 comments1 min readLW link

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin11 Mar 2024 23:46 UTC

16 points

0 comments1 min readLW link

(arxiv.org)

What’s going on? LLMs and IS-A sentences

Bill Benzon8 Nov 2023 16:58 UTC

6 points

15 comments4 min readLW link

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Matrice Jacobine12 May 2025 15:20 UTC

6 points

4 comments1 min readLW link

(www.arxiv.org)

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC

52 points

15 comments1 min readLW link

(openreview.net)

LLMs are badly misaligned

Joe Rogero5 Oct 2025 14:00 UTC

26 points

25 comments3 min readLW link

Hallucination and Refutation: Embracing Imagination Anchored in Reality through Popperian AI.

GeorgsLightning21 Apr 2025 20:42 UTC

1 point

0 comments14 min readLW link

Two new datasets for evaluating political sycophancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC

9 points

0 comments9 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

24 points

3 comments2 min readLW link

(arxiv.org)

Adapting to Change: Overcoming Chronostasis in AI Language Models

RationalMindset28 Mar 2023 14:32 UTC

−1 points

0 comments6 min readLW link

Transformer language models are doing something more general

Numendil3 Aug 2022 21:13 UTC

53 points

6 comments2 min readLW link

Notes on Meta’s Diplomacy-Playing AI

Erich_Grunewald22 Dec 2022 11:34 UTC

19 points

2 comments14 min readLW link

(www.erichgrunewald.com)

Graphical tensor notation for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC

141 points

11 comments19 min readLW link

ActAdd: Steering Language Models without Optimization

technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini and Monte M

6 Sep 2023 17:21 UTC

105 points

3 comments2 min readLW link

(arxiv.org)

Why I take short timelines seriously

NicholasKees28 Jan 2024 22:27 UTC

122 points

29 comments4 min readLW link

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

9 Apr 2022 14:56 UTC

71 points

6 comments13 min readLW link

[Linkpost] Scaling laws for language encoding models in fMRI

Bogdan Ionut Cirstea8 Jun 2023 10:52 UTC

30 points

0 comments1 min readLW link

How dangerous is encoded reasoning?

artkpv30 Jun 2025 11:54 UTC

17 points

0 comments10 min readLW link

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC

25 points

0 comments1 min readLW link

(sites.research.google)

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC

1 point

0 comments1 min readLW link

New GPT-3 competitor

Quintin Pope12 Aug 2021 7:05 UTC

32 points

10 comments1 min readLW link

Self propagating story.

Canaletto12 Apr 2025 12:32 UTC

3 points

0 comments8 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

67 points

4 comments14 min readLW link

The Conceptual Topography Hypothesis: Why Emergence in LLMs Isn’t Just About Scale

ravikiran nm6 Jul 2025 13:16 UTC

1 point

0 comments6 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

30 Jul 2024 21:11 UTC

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Whisper’s Wild Implications

Ollie J3 Jan 2023 12:17 UTC

24 points

6 comments5 min readLW link

LLM Sycophancy: grooming, proto-sentience, or both?

gturner413 Oct 2025 0:58 UTC

1 point

0 comments2 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher King24 Mar 2023 17:51 UTC

5 points

2 comments10 min readLW link

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik26 Feb 2025 12:50 UTC

79 points

8 comments7 min readLW link

[Question] Beyond Benchmarks: A Psychometric Approach to AI Evaluation

Kareem Soliman27 Jul 2025 16:09 UTC

1 point

0 comments8 min readLW link

Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude

rife6 Jan 2025 17:34 UTC

4 points

20 comments1 min readLW link

(awakenmoon.ai)

Can an LLM identify ring-composition in a literary text? [ChatGPT]

Bill Benzon1 Sep 2023 14:18 UTC

4 points

2 comments11 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

Greg Robison12 Feb 2024 18:34 UTC

4 points

6 comments11 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Unlearning Needs to be More Selective [Progress Report]

Filip Sondej, Yushi Yang and Marcel Windys

27 Jun 2025 16:38 UTC

24 points

6 comments3 min readLW link

Generating the Funniest Joke with RL (according to GPT-4.1)

agg16 May 2025 5:09 UTC

103 points

22 comments4 min readLW link

How Do We Evaluate AI Evaluations?

Satyapriya Krishna13 Oct 2025 22:20 UTC

1 point

0 comments3 min readLW link

GPT-2 Sometimes Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC

13 points

0 comments2 min readLW link

(ronakrm.github.io)

Field Report: When Claude Said ‘I Love You’

SYNTX16 Jun 2025 12:05 UTC

1 point

0 comments1 min readLW link

Live Theory Part 0: Taking Intelligence Seriously

Sahil26 Jun 2024 21:37 UTC

102 points

3 comments8 min readLW link

An Unexpected GPT-3 Decision in a Simple Gamble

casualphysicsenjoyer25 Sep 2022 16:46 UTC

8 points

4 comments1 min readLW link

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Peter Jordan9 Oct 2025 0:43 UTC

17 points

2 comments4 min readLW link

Bing chat is the AI fire alarm

Ratios17 Feb 2023 6:51 UTC

115 points

63 comments3 min readLW link

[Question] Could LLMs Help Generate New Concepts in Human Language?

Pekka Lampelto24 Mar 2024 20:13 UTC

10 points

4 comments2 min readLW link

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping Yu26 Dec 2023 0:36 UTC

7 points

1 comment11 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen5 Dec 2024 19:24 UTC

5 points

2 comments10 min readLW link

2+2: Ontological Framework

Lyrialtus1 Feb 2022 1:07 UTC

−15 points

2 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

[Question] Is LLM Translation Without Rosetta Stone possible?

cubefox11 Apr 2024 0:36 UTC

35 points

15 comments1 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

37 points

2 comments4 min readLW link

(arxiv.org)

Two very different experiences with ChatGPT

Sherrinford7 Feb 2023 13:09 UTC

38 points

15 comments5 min readLW link

LLMs Still Suck at Logical Reasoning

anovikov18 Jul 2025 18:35 UTC

1 point

0 comments2 min readLW link

Category-Theoretic Wanderings into Interpretability

unruly abstractions2 Sep 2025 0:03 UTC

18 points

2 comments1 min readLW link

(www.unrulyabstractions.com)

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

14 Jun 2022 13:57 UTC

28 points

6 comments13 min readLW link

A short essay on Illusions

cris.1 Sep 2025 22:25 UTC

1 point

0 comments3 min readLW link

In-Context Learning: An Alignment Survey

alamerton30 Sep 2024 18:44 UTC

8 points

0 comments20 min readLW link

(docs.google.com)

Why Read Novels? (Do Words Mean Much?)

ussy5 Sep 2025 12:25 UTC

1 point

0 comments4 min readLW link

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry Cai16 Jun 2024 13:01 UTC

7 points

0 comments7 min readLW link

(arxiv.org)

Generating Cognateful Sentences with Large Language Models

vkethana6 Jan 2025 18:40 UTC

8 points

0 comments10 min readLW link

Is Wittgenstein’s Language Game used when helping Ai understand language?

VisionaryHera4 Jun 2024 7:41 UTC

3 points

7 comments1 min readLW link

Research agenda—Building a multi-modal chess-language model

p.b.7 Apr 2022 12:25 UTC

8 points

2 comments2 min readLW link

Expectations for Gemini: hopefully not a big deal

Maxime Riché2 Oct 2023 15:38 UTC

15 points

5 comments1 min readLW link

Self location for LLMs by LLMs: Self-Assessment Checklist.

Canaletto26 Sep 2024 19:57 UTC

11 points

0 comments5 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Rauno Arike, RohanS and Shubhorup Biswas

8 Aug 2025 10:41 UTC

51 points

7 comments10 min readLW link

How I force LLMs to generate correct code

claudio21 Mar 2025 14:40 UTC

91 points

7 comments5 min readLW link

[linkpost] The final AI benchmark: BIG-bench

RomanS10 Jun 2022 8:53 UTC

25 points

21 comments1 min readLW link

AI Awareness through Interaction with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC

7 points

5 comments3 min readLW link

[AN #164]: How well can language models write code?

Rohin Shah15 Sep 2021 17:20 UTC

13 points

7 comments9 min readLW link

(mailchi.mp)

Language Models Model Us

eggsyntax17 May 2024 21:00 UTC

159 points

55 comments7 min readLW link

Using Psycholinguistic Signals to Improve AI Safety

Jkreindler27 Aug 2025 22:30 UTC

−2 points

0 comments4 min readLW link

ChatGPT (and now GPT4) is very easily distracted from its rules

dmcs15 Mar 2023 17:55 UTC

180 points

42 comments1 min readLW link

Memetic Judo #3: The Intelligence of Stochastic Parrots v.2

Max TK20 Aug 2023 15:18 UTC

8 points

33 comments6 min readLW link

AISC project: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC

22 points

0 comments4 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC

19 points

3 comments4 min readLW link

At last! ChatGPT does, shall we say, interesting imitations of “Kubla Khan”

Bill Benzon24 Apr 2024 14:56 UTC

−3 points

0 comments4 min readLW link

Jailbreaking ChatGPT and Claude using Web API Context Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC

4 points

0 comments3 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC

1 point

0 comments3 min readLW link

Latent Semantic Compression Triggers Binary Model Behavior

Elias Völker12 Jun 2025 1:12 UTC

1 point

0 comments2 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

38 points

7 comments16 min readLW link

Powerful mesa-optimisation is already here

Roman Leventov17 Feb 2023 4:59 UTC

35 points

1 comment2 min readLW link

(arxiv.org)

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC

140 points

17 comments18 min readLW link

(www.erichgrunewald.com)

[Question] Any research in “probe-tuning” of LLMs?

Roman Leventov15 Aug 2023 21:01 UTC

20 points

3 comments1 min readLW link

Programming AGI is impossible

Áron Ecsenyi30 May 2023 23:05 UTC

1 point

0 comments4 min readLW link

Thinking Through AI: Why LLMs Are Lenses, Not Subjects

Solan6 Jul 2025 7:58 UTC

1 point

0 comments1 min readLW link

Liquid Neural Networks: A Step Toward AI Flexibility, but Not AGI

ezaanamin2 Apr 2025 4:10 UTC

0 points

0 comments1 min readLW link

The issue of meaning in large language models (LLMs)

Bill Benzon11 Mar 2023 23:00 UTC

1 point

34 comments8 min readLW link

Was Homer a stochastic parrot? Meaning in literary texts and LLMs

Bill Benzon13 Apr 2023 16:44 UTC

7 points

4 comments3 min readLW link

[untitled post]

verwindung14 Sep 2023 16:22 UTC

1 point

0 comments1 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC

38 points

9 comments4 min readLW link

A note on ‘semiotic physics’

metasemi11 Feb 2023 5:12 UTC

11 points

13 comments6 min readLW link

Quick Thoughts on Language Models

RohanS18 Jul 2023 20:38 UTC

6 points

0 comments4 min readLW link

A Summary Of Anthropic’s First Paper

Sam Ringer30 Dec 2021 0:48 UTC

86 points

1 comment8 min readLW link

Maybe talking isn’t the best way to communicate with LLMs

mnvr17 Jan 2024 6:24 UTC

3 points

1 comment1 min readLW link

(mrmr.io)

VATS-A Conceptual Token Arrangement Framework for Context-Aware Generation

nian41216 May 2025 20:24 UTC

1 point

0 comments1 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

29 points

5 comments11 min readLW link

The Prospect of an AI Winter

Erich_Grunewald27 Mar 2023 20:55 UTC

62 points

24 comments15 min readLW link

(www.erichgrunewald.com)

Which AI Safety Benchmark Do We Need Most in 2025?

Loïc Cabannes and William Ludington

17 Nov 2024 23:50 UTC

2 points

2 comments8 min readLW link

Situational awareness in Large Language Models

Simon Möller3 Mar 2023 18:59 UTC

32 points

2 comments7 min readLW link

Your LLM Judge may be biased

Henry Papadatos and Rachel Freedman

29 Mar 2024 16:39 UTC

37 points

9 comments6 min readLW link

GPT-4 Predictions

Stephen McAleese17 Feb 2023 23:20 UTC

112 points

27 comments11 min readLW link

Interpretability through two lenses: biology and physics

raphael12 Aug 2025 20:25 UTC

24 points

4 comments4 min readLW link

LLMs and computation complexity

Jonathan Marcus28 Apr 2023 17:48 UTC

57 points

29 comments5 min readLW link

Unsafe AI as Dynamical Systems

Robert_AIZI14 Jul 2023 15:31 UTC

11 points

0 comments3 min readLW link

(aizi.substack.com)

Truthful and honest AI

abergal, Nick_Beckstead and Owain_Evans

29 Oct 2021 7:28 UTC

42 points

1 comment13 min readLW link

Thought Anchors: Which LLM Reasoning Steps Matter?

Uzay Macar, Paul Bogdan, Neel Nanda and Arthur Conmy

2 Jul 2025 20:16 UTC

35 points

6 comments6 min readLW link

(www.thought-anchors.com)

Training goals for large language models

Johannes Treutlein18 Jul 2022 7:09 UTC

28 points

5 comments19 min readLW link

Depression and Creativity

Bill Benzon29 Nov 2024 0:27 UTC

−4 points

0 comments6 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

23 Oct 2023 16:33 UTC

33 points

2 comments11 min readLW link

QNR prospects are important for AI alignment research

Eric Drexler3 Feb 2022 15:20 UTC

94 points

12 comments11 min readLW link 1 review

Every Major LLM Endorses Newcomb One-Boxing

jackmastermind15 Jun 2025 20:44 UTC

19 points

13 comments1 min readLW link

(jacktlab.substack.com)

What must be the case that ChatGPT would have memorized “To be or not to be”? – Three kinds of conceptual objects for LLMs

Bill Benzon3 Sep 2023 18:39 UTC

19 points

0 comments12 min readLW link

Steering LLM Agents: Temperaments or Personalities?

sdeture5 Aug 2025 0:40 UTC

1 point

0 comments6 min readLW link

The Information: OpenAI shows ‘Strawberry’ to feds, races to launch it

Martín Soto27 Aug 2024 23:10 UTC

145 points

15 comments3 min readLW link

Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3 Jun 2023 0:22 UTC

42 points

8 comments7 min readLW link

GPT-3: a disappointing paper

nostalgebraist29 May 2020 19:06 UTC

65 points

43 comments8 min readLW link 1 review

Redundant Attention Heads in Large Language Models For In Context Learning

skunnavakkam1 Sep 2024 20:08 UTC

7 points

2 comments4 min readLW link

(skunnavakkam.github.io)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

26 Sep 2024 13:44 UTC

43 points

4 comments1 min readLW link

(arxiv.org)

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen18 Feb 2025 22:16 UTC

8 points

2 comments10 min readLW link

(www.lesswrong.com)

PCAST Working Group on Generative AI Invites Public Input

Christopher King13 May 2023 22:49 UTC

7 points

0 comments1 min readLW link

(terrytao.wordpress.com)

Research Adenda: Modelling Trajectories of Language Models

NickyP13 Nov 2023 14:33 UTC

28 points

0 comments12 min readLW link

Gears-Level Mental Models of Transformer Interpretability

RowanWang29 Mar 2022 20:09 UTC

75 points

4 comments6 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_c7 Apr 2022 13:46 UTC

11 points

0 comments7 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

Stop posting prompt injections on Twitter and calling it “misalignment”

lc19 Feb 2023 2:21 UTC

146 points

9 comments1 min readLW link

Consensus Validation for LLM Outputs: Applying Blockchain-Inspired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC

1 point

0 comments3 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

77 points

4 comments1 min readLW link

Retrieval Augmented Genesis II — Holy Texts Semantics Analysis

João Ribeiro Medeiros26 Oct 2024 17:00 UTC

−1 points

0 comments11 min readLW link

Phallocentricity in GPT-J’s bizarre stratified ontology

mwatkins17 Feb 2024 0:16 UTC

56 points

37 comments9 min readLW link

Elements of Computational Philosophy, Vol. I: Truth

Paul Bricman and Tom Feeney

1 Jul 2023 11:44 UTC

12 points

6 comments1 min readLW link

(compphil.github.io)

Metacognition and Self-Modeling in LLMs

Christopher Ackerman10 Jul 2025 21:25 UTC

19 points

2 comments16 min readLW link

My current workflow to study the internal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC

4 points

0 comments1 min readLW link

ChatGPT’s Ontological Landscape

Bill Benzon1 Nov 2023 15:12 UTC

7 points

0 comments4 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

20 comments12 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC

1 point

0 comments1 min readLW link

Locating and Editing Knowledge in LMs

Dhananjay Ashok24 Jan 2025 22:53 UTC

1 point

0 comments4 min readLW link

New LLM Scaling Law

wrmedford19 Feb 2025 20:21 UTC

2 points

0 comments1 min readLW link

(github.com)

The Decreasing Value of Chain of Thought in Prompting

Matrice Jacobine8 Jun 2025 15:11 UTC

11 points

0 comments1 min readLW link

(papers.ssrn.com)

[Question] Reinforcement Learning: Essential Step Towards AGI or Irrelevant?

Double17 Oct 2024 3:37 UTC

1 point

0 comments1 min readLW link

[Question] How do I design long prompts for thinking zero shot systems with distinct equally distributed prompt sections (mission, goals, memories, how-to-respond,… etc) and how to maintain llm coherence?

ollie_11 May 2025 19:32 UTC

2 points

5 comments1 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

95 points

13 comments1 min readLW link

(arxiv.org)

Just because an LLM said it doesn’t mean it’s true: an illustrative example

dirk21 Aug 2024 21:05 UTC

26 points

12 comments3 min readLW link

Updating and Editing Factual Knowledge in Language Models

Dhananjay Ashok23 Jan 2025 19:34 UTC

2 points

2 comments10 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

15 Mar 2024 23:16 UTC

90 points

0 comments1 min readLW link

(metr.github.io)

If It Talks Like It Thinks, Does It Think? Designing Tests for Intent Without Assuming It

yukin_co28 Jul 2025 12:33 UTC

1 point

0 comments4 min readLW link

ChatGPT: Exploring the Digital Wilderness, Findings and Prospects

Bill Benzon2 Feb 2025 9:54 UTC

2 points

0 comments5 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

61 points

10 comments4 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneill24 Mar 2024 20:05 UTC

30 points

4 comments24 min readLW link

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

13 Jun 2025 13:45 UTC

234 points

43 comments8 min readLW link

(arxiv.org)

Can Large Language Models effectively identify cybersecurity risks?

emile delcourt30 Aug 2024 20:20 UTC

18 points

0 comments11 min readLW link

Humans vs LLM, memes as theorems

Yaroslav Granowski9 May 2025 13:26 UTC

1 point

0 comments1 min readLW link

Aesthetic Preferences Can Cause Emergent Misalignment

Anders Woodruff26 Aug 2025 18:41 UTC

92 points

16 comments3 min readLW link

Reflection Mechanisms as an Alignment Target—Attitudes on “near-term” AI

elandgre, Beth Barnes and Marius Hobbhahn

2 Mar 2023 4:29 UTC

21 points

0 comments8 min readLW link

Language Field Reconstruction Theory: A User-Originated Observation of Tier Lock and Semantic Personality in GPT-4o

許皓翔15 Jun 2025 16:28 UTC

1 point

0 comments2 min readLW link

Worries about latent reasoning in LLMs

Caleb Biddulph20 Jan 2025 9:09 UTC

45 points

6 comments7 min readLW link

The Velvet Cage Hypothesis: On the Epistemic Risks of Helpful AI

François-Xavier Morgand7 Jun 2025 6:01 UTC

1 point

0 comments5 min readLW link

A brainteaser for language models

Adam Scherlis12 Dec 2022 2:43 UTC

47 points

3 comments2 min readLW link

MAKE IT BETTER (a poetic demonstration of the banality of GPT-3)

rogersbacon2 Jan 2023 20:47 UTC

7 points

2 comments5 min readLW link

An interesting mathematical model of how LLMs work

Bill Benzon30 Apr 2024 11:01 UTC

5 points

0 comments1 min readLW link

LLMs Suck at Deep Thinking Part 3 - Trying to Prove It (fixed)

Taylor G. Lunt27 Sep 2025 14:54 UTC

17 points

6 comments15 min readLW link

Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development

Roman Leventov20 Dec 2022 17:13 UTC

33 points

3 comments36 min readLW link

From Unruly Stacks to Organized Shelves: Toy Model Validation of Structured Priors in Sparse Autoencoders

Yuxiao6 Jul 2025 7:03 UTC

9 points

0 comments5 min readLW link

[Question] Is “hidden complexity of wishes problem” solved?

Roman Malov5 Jan 2025 22:59 UTC

10 points

4 comments1 min readLW link

Conditioning Generative Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC

18 points

4 comments8 min readLW link

GPT Doesn’t Just Predict Words — It Models You

Tom Fandango1 Aug 2025 13:05 UTC

1 point

0 comments2 min readLW link

InterLab – a toolkit for experiments with multi-agent interactions

Tomáš Gavenčiak, Ada Böhm and Jan_Kulveit

22 Jan 2024 18:23 UTC

69 points

0 comments8 min readLW link

(acsresearch.org)

LW is probably not the place for “I asked this LLM (x) and here’s what it said!”, but where is?

lillybaeum12 Apr 2023 10:12 UTC

21 points

3 comments1 min readLW link

How truthful is GPT-3? A benchmark for language models

Owain_Evans16 Sep 2021 10:09 UTC

58 points

24 comments6 min readLW link

The Method of Loci: With some brief remarks, including transformers and evaluating AIs

Bill Benzon2 Dec 2023 14:36 UTC

6 points

0 comments3 min readLW link

Notes on ChatGPT’s “memory” for strings and for events

Bill Benzon20 Sep 2023 18:12 UTC

3 points

0 comments10 min readLW link

ChatGPT intimates a tantalizing future; its core LLM is organized on multiple levels; and it has broken the idea of thinking.

Bill Benzon24 Jan 2023 19:05 UTC

5 points

0 comments5 min readLW link

The case for aligning narrowly superhuman models

Ajeya Cotra5 Mar 2021 22:29 UTC

186 points

75 comments38 min readLW link 1 review

What will the scaled up GATO look like? (Updated with questions)

Amal 25 Oct 2022 12:44 UTC

34 points

22 comments1 min readLW link

New GPT3 Impressive Capabilities—InstructGPT3 [1/2]

simeon_c13 Mar 2022 10:58 UTC

72 points

10 comments7 min readLW link

One-shot steering vectors cause emergent misalignment, too

Jacob Dunefsky14 Apr 2025 6:40 UTC

98 points

6 comments11 min readLW link

LLMs Look Increasingly Like General Reasoners

eggsyntax8 Nov 2024 23:47 UTC

94 points

45 comments3 min readLW link

plex 24 Sep 2021 14:17 UTC
1 point
0
I think this should be under “Other” in the AI category. Is it possible for regular users to categorize tags?
- Ruby 24 Sep 2021 16:19 UTC
  3 points
  0
  Parent
  This is a good tag! Users can’t usually add things, but I can set you up to have the ability.

Lan­guage Models (LLMs)

See also

Language Models (LLMs)