Language Models (LLMs)

TagLast edit: Mar 13, 2025, 5:45 PM by Raemon

Language Models are computer programs made to estimate the likelihood of a piece of text. “Hello, how are you?” is likely. “Hello, fnarg horses” is unlikely.

Language models can answer questions by estimating the likelihood of possible question-and-answer pairs, selecting the most likely question-and-answer pair. “Q: How are You? A: Very well, thank you” is a likely question-and-answer pair. “Q: How are You? A: Correct horse battery staple” is an unlikely question-and-answer pair.

The language models most relevant to AI safety are language models based on “deep learning”. Deep-learning-based language models can be “trained” to understand language better, by exposing them to text written by humans. There is a lot of human-written text on the internet, providing loads of training material.

Deep-learning-based language models are getting bigger and better trained. As the models become stronger, they get new skills. These skills include arithmetic, explaining jokes, programming, and solving math problems.

There is a potential risk of these models developing dangerous capabilities as they grow larger and better trained. What additional skills will they develop given a few years?

Simulators

janusSep 2, 2022, 12:45 PM

635 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Inverse Scaling Prize: Round 1 Winners

Ethan Perez and Ian McKenzie

Sep 26, 2022, 7:57 PM

93 points

16 comments4 min readLW link

(irmckenzie.co.uk)

Alignment Implications of LLM Successes: a Debate in One Act

Zack_M_DavisOct 21, 2023, 3:22 PM

265 points

56 comments13 min readLW link 2 reviews

How it feels to have your mind hacked by an AI

blakedJan 12, 2023, 12:33 AM

367 points

222 comments17 min readLW link

How LLMs are and are not myopic

janusJul 25, 2023, 2:19 AM

135 points

16 comments8 min readLW link

On the future of language models

owencbDec 20, 2023, 4:58 PM

105 points

17 comments1 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM

63 points

41 comments24 min readLW link

Try training token-level probes

StefanHexApr 14, 2025, 11:56 AM

46 points

6 comments8 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaleyNov 28, 2023, 7:56 PM

65 points

30 comments11 min readLW link

Transformer Circuits

evhubDec 22, 2021, 9:09 PM

144 points

4 comments3 min readLW link

(transformer-circuits.pub)

A Chinese Room Containing a Stack of Stochastic Parrots

RogerDearnaleyJan 12, 2024, 6:29 AM

20 points

3 comments5 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM

37 points

4 comments2 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaleyJan 11, 2024, 12:56 PM

35 points

4 comments39 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM

16 points

15 comments13 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM

48 points

8 comments36 min readLW link

Programming Refusal with Conditional Activation Steering

Bruce W. LeeSep 11, 2024, 8:57 PM

41 points

0 comments11 min readLW link

(brucewlee.com)

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM

629 points

188 comments16 min readLW link

AI Safety Chatbot

markov and Robert Miles

Dec 21, 2023, 2:06 PM

61 points

11 comments4 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM

9 points

0 comments11 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

Feb 5, 2023, 10:02 PM

676 points

206 comments12 min readLW link 1 review

Representation Tuning

Christopher AckermanJun 27, 2024, 5:44 PM

35 points

9 comments13 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

May 3, 2023, 1:30 PM

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean OsierJun 30, 2024, 2:48 AM

12 points

5 comments6 min readLW link

(github.com)

LLMs may capture key components of human agency

catubcNov 17, 2022, 8:14 PM

27 points

0 comments4 min readLW link

Results from the language model hackathon

Esben KranOct 10, 2022, 8:29 AM

22 points

1 comment4 min readLW link

Applying refusal-vector ablation to a Llama 3 70B agent

Simon LermenMay 11, 2024, 12:08 AM

51 points

14 comments7 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM

65 points

14 comments13 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina PanicksseryAug 9, 2023, 7:06 AM

69 points

20 comments12 min readLW link

Testing PaLM prompts on GPT3

YitzApr 6, 2022, 5:21 AM

103 points

14 comments8 min readLW link

Large Language Models will be Great for Censorship

Ethan EdwardsAug 21, 2023, 7:03 PM

185 points

14 comments8 min readLW link

(ethanedwards.substack.com)

Invocations: The Other Capabilities Overhang?

Robert_AIZIApr 4, 2023, 1:38 PM

29 points

4 comments4 min readLW link

(aizi.substack.com)

Inverse Scaling Prize: Second Round Winners

Ian McKenzie, Sam Bowman and Ethan Perez

Jan 24, 2023, 8:12 PM

58 points

17 comments15 min readLW link

LLM Modularity: The Separability of Capabilities in Large Language Models

NickyPMar 26, 2023, 9:57 PM

99 points

3 comments41 min readLW link

Self-fulfilling misalignment data might be poisoning our AI models

TurnTroutMar 2, 2025, 7:51 PM

154 points

27 comments1 min readLW link

(turntrout.com)

Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall

Vladimir_NesovMay 1, 2025, 1:54 PM

175 points

22 comments5 min readLW link

LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

NickyPFeb 13, 2023, 6:52 PM

83 points

11 comments15 min readLW link

Extrapolating from Five Words

Gordon Seidoh WorleyNov 15, 2023, 11:21 PM

40 points

11 comments2 min readLW link

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Michaël TrazziAug 24, 2024, 4:30 AM

55 points

0 comments5 min readLW link

Proposal for Inducing Steganography in LMs

Logan RiggsJan 12, 2023, 10:15 PM

22 points

3 comments2 min readLW link

Take 11: “Aligning language models” should be weirder.

Charlie SteinerDec 18, 2022, 2:14 PM

34 points

0 comments2 min readLW link

Notes on the Mathematics of LLM Architectures

carboniferous_umbraculum Feb 9, 2023, 1:45 AM

12 points

2 comments1 min readLW link

(drive.google.com)

An explanation for every token: using an LLM to sample another LLM

Max HOct 11, 2023, 12:53 AM

35 points

5 comments11 min readLW link

Conditioning Generative Models

Adam JermynJun 25, 2022, 10:15 PM

24 points

18 comments10 min readLW link

What o3 Becomes by 2028

Vladimir_NesovDec 22, 2024, 12:37 PM

147 points

15 comments5 min readLW link

Claude 3.5 Sonnet

Zach Stein-PerlmanJun 20, 2024, 6:00 PM

75 points

41 comments1 min readLW link

(www.anthropic.com)

‘simulator’ framing and confusions about LLMs

Beth BarnesDec 31, 2022, 11:38 PM

104 points

11 comments4 min readLW link

Exploring SAE features in LLMs with definition trees and token lists

mwatkinsOct 4, 2024, 10:15 PM

38 points

5 comments6 min readLW link

Unexpected Conscious Entities

Gunnar_ZarnckeMay 5, 2025, 10:14 PM

34 points

6 comments6 min readLW link

MetaAI: less is less for alignment.

Cleo NardoJun 13, 2023, 2:08 PM

71 points

17 comments5 min readLW link

Language models can generate superior text compared to their input

ChristianKlJan 17, 2023, 10:57 AM

48 points

28 comments1 min readLW link

Aggregative Principles of Social Justice

Cleo NardoJun 5, 2024, 1:44 PM

29 points

10 comments37 min readLW link

′ petertodd’’s last stand: The final days of open GPT-3 research

mwatkinsJan 22, 2024, 6:47 PM

109 points

16 comments45 min readLW link

How to train your transformer

p.b.Apr 7, 2022, 9:34 AM

6 points

0 comments8 min readLW link

Seeing Through the Eyes of the Algorithm

silentbobFeb 22, 2025, 11:54 AM

18 points

3 comments10 min readLW link

A Proposed Test to Determine the Extent to Which Large Language Models Understand the Real World

Bruce GFeb 24, 2023, 8:20 PM

4 points

7 comments8 min readLW link

Nokens: A potential method of investigating glitch tokens

HoagyMar 15, 2023, 4:23 PM

21 points

0 comments4 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

Mar 17, 2025, 10:27 PM

27 points

0 comments11 min readLW link

[Question] Will 2023 be the last year you can write short stories and receive most of the intellectual credit for writing them?

lcMar 16, 2023, 9:36 PM

20 points

11 comments1 min readLW link

Bing Chat is blatantly, aggressively misaligned

evhubFeb 15, 2023, 5:29 AM

405 points

181 comments2 min readLW link 1 review

Enhancing biosecurity with language models: defining research directions

micMar 26, 2024, 12:30 PM

12 points

0 comments1 min readLW link

(papers.ssrn.com)

Teaser: Hard-coding Transformer Models

MadHatterDec 12, 2021, 10:04 PM

74 points

19 comments1 min readLW link

Watermarking considered overrated?

DanielFilanJul 31, 2023, 9:36 PM

19 points

4 comments1 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

May 7, 2023, 12:46 AM

77 points

24 comments11 min readLW link

Forecasting progress in language models

Matthew Barnett and Metaculus

Oct 28, 2021, 8:40 PM

62 points

6 comments12 min readLW link

(www.metaculus.com)

Corrigibility, Self-Deletion, and Identical Strawberries

Robert_AIZIMar 28, 2023, 4:54 PM

9 points

2 comments6 min readLW link

(aizi.substack.com)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

Feb 25, 2025, 5:39 PM

328 points

91 comments4 min readLW link

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM

28 points

7 comments8 min readLW link 1 review

On Claude 3.5 Sonnet

ZviJun 24, 2024, 12:00 PM

95 points

14 comments13 min readLW link

(thezvi.wordpress.com)

Scaffolded LLMs as natural language computers

berenApr 12, 2023, 10:47 AM

95 points

10 comments11 min readLW link

AMA Conjecture, A New Alignment Startup

adamShimiApr 9, 2022, 9:43 AM

47 points

42 comments1 min readLW link

[Question] Basic Question about LLMs: how do they know what task to perform

GarakJan 14, 2023, 1:13 PM

1 point

3 comments1 min readLW link

New, improved multiple-choice TruthfulQA

Owain_Evans, James Chua and Steph Lin

Jan 15, 2025, 11:32 PM

72 points

0 comments3 min readLW link

[ASoT] Some thoughts about LM monologue limitations and ELK

leogaoMar 30, 2022, 2:26 PM

10 points

0 comments2 min readLW link

“textbooks are all you need”

bhauthJun 21, 2023, 5:06 PM

66 points

18 comments2 min readLW link

(arxiv.org)

Knowledge, Reasoning, and Superintelligence

owencbMar 26, 2025, 11:28 PM

21 points

1 comment7 min readLW link

(strangecities.substack.com)

Implementing activation steering

AnnahFeb 5, 2024, 5:51 PM

75 points

8 comments7 min readLW link

Can I take ducks home from the park?

dynomightSep 14, 2023, 9:03 PM

67 points

8 comments3 min readLW link

(dynomight.net)

LLM Applications I Want To See

sarahconstantinAug 19, 2024, 9:10 PM

102 points

6 comments8 min readLW link

(sarahconstantin.substack.com)

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Thomas KwaMay 5, 2025, 6:56 PM

68 points

21 comments2 min readLW link

(arxiv.org)

Numberwang: LLMs Doing Autonomous Research, and a Call for Input

eggsyntax and ncase

Jan 16, 2025, 5:20 PM

71 points

30 comments31 min readLW link

Activation adding experiments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM

51 points

1 comment3 min readLW link

And All the Shoggoths Merely Players

Zack_M_DavisFeb 10, 2024, 7:56 PM

170 points

57 comments12 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

Sep 23, 2023, 7:55 PM

121 points

74 comments4 min readLW link

(arxiv.org)

In Defense of Chatbot Romance

Kaj_SotalaFeb 11, 2023, 2:30 PM

124 points

53 comments11 min readLW link

(kajsotala.fi)

[Question] Supposing the 1bit LLM paper pans out

O OFeb 29, 2024, 5:31 AM

27 points

11 comments1 min readLW link

[Question] Does a LLM have a utility function?

DagonDec 9, 2022, 5:19 PM

17 points

11 comments1 min readLW link

[Question] Is there a ‘time series forecasting’ equivalent of AIXI?

Solenoid_EntityMay 17, 2023, 4:35 AM

12 points

2 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

Dec 5, 2022, 8:28 PM

40 points

19 comments10 min readLW link

Open Source LLM Pokémon Scaffold

Julian BradshawApr 27, 2025, 12:57 AM

23 points

0 comments1 min readLW link

(github.com)

What’s up with LLMs representing XORs of arbitrary features?

Sam MarksJan 3, 2024, 7:44 PM

158 points

63 comments16 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 8, 2023, 6:19 PM

32 points

2 comments11 min readLW link

[Question] If I ask an LLM to think step by step, how big are the steps?

ryan_bSep 13, 2024, 8:30 PM

7 points

1 comment1 min readLW link

Understanding the diffusion of large language models: summary

Ben CottierJan 16, 2023, 1:37 AM

26 points

1 comment1 min readLW link

What do language models know about fictional characters?

skybrianFeb 22, 2023, 5:58 AM

6 points

0 comments4 min readLW link

New OpenAI Paper—Language models can explain neurons in language models

MrThinkMay 10, 2023, 7:46 AM

47 points

14 comments1 min readLW link

Deepmind’s Gopher—more powerful than GPT-3

hathDec 8, 2021, 5:06 PM

86 points

26 comments1 min readLW link

(deepmind.com)

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

May 13, 2023, 6:42 PM

437 points

98 comments50 min readLW link 1 review

“LLMs Don’t Have a Coherent Model of the World”—What it Means, Why it Matters

DavidmanheimJun 1, 2023, 7:46 AM

32 points

2 comments7 min readLW link

Conditioning Predictive Models: Large language models as predictors

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 2, 2023, 8:28 PM

88 points

4 comments13 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM

71 points

18 comments6 min readLW link

Evil autocomplete: Existential Risk and Next-Token Predictors

YitzFeb 28, 2023, 8:47 AM

9 points

3 comments5 min readLW link

Lamda is not an LLM

KevinJun 19, 2022, 11:13 AM

7 points

10 comments1 min readLW link

(www.wired.com)

Romance, misunderstanding, social stances, and the human LLM

Kaj_SotalaApr 27, 2023, 12:59 PM

75 points

32 comments16 min readLW link

Upcoming Changes in Large Language Models

Andrew Keenan RichardsonApr 8, 2023, 3:41 AM

43 points

8 comments4 min readLW link

(mechanisticmind.com)

More Fun With GPT-4o Image Generation

ZviApr 3, 2025, 2:10 AM

34 points

3 comments8 min readLW link

(thezvi.wordpress.com)

Worrisome misunderstanding of the core issues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM

5 points

2 comments4 min readLW link

Testing for parallel reasoning in LLMs

meemi and Olli Järviniemi

May 19, 2024, 3:28 PM

9 points

7 comments9 min readLW link

Research Discussion on PSCA with Claude Sonnet 3.5

Robert KralischJul 24, 2024, 4:53 PM

−2 points

0 comments25 min readLW link

GPT-4 can catch subtle cross-language translation mistakes

Michael TontchevJul 27, 2023, 1:39 AM

7 points

1 comment1 min readLW link

Why I Believe LLMs Do Not Have Human-like Emotions

OneManyNoneMay 22, 2023, 3:46 PM

13 points

6 comments7 min readLW link

Jailbreak steering generalization

Sarah Ball and Nina Panickssery

Jun 20, 2024, 5:25 PM

41 points

4 comments2 min readLW link

(arxiv.org)

Anthropic release Claude 3, claims >GPT-4 Performance

LawrenceCMar 4, 2024, 6:23 PM

115 points

41 comments2 min readLW link

(www.anthropic.com)

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles TurpinOct 3, 2023, 2:22 AM

31 points

0 comments9 min readLW link

Language Model Tools for Alignment Research

Logan RiggsApr 8, 2022, 5:32 PM

28 points

0 comments2 min readLW link

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy and beren

Mar 20, 2023, 3:39 PM

95 points

9 comments3 min readLW link

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez and Robbo

Nov 16, 2023, 8:18 PM

45 points

3 comments1 min readLW link

(arxiv.org)

Inferring the model dimension of API-protected LLMs

Ege ErdilMar 18, 2024, 6:19 AM

34 points

3 comments4 min readLW link

(arxiv.org)

Claude Doesn’t Want to Die

garrisonMar 5, 2024, 6:00 AM

22 points

3 comments1 min readLW link

(garrisonlovely.substack.com)

LLM Guardrails Should Have Better Customer Service Tuning

Jiao BuMay 13, 2023, 10:54 PM

2 points

0 comments2 min readLW link

Densing Law of LLMs

Bogdan Ionut CirsteaDec 8, 2024, 7:35 PM

9 points

2 comments1 min readLW link

(arxiv.org)

What does it mean for an LLM such as GPT to be aligned / good / positive impact?

PashaKamyshevMar 20, 2023, 9:21 AM

4 points

3 comments10 min readLW link

How people use LLMs

ElizabethApr 27, 2025, 9:48 PM

78 points

6 comments1 min readLW link

(www.gleech.org)

Do LLMs dream of emergent sheep?

ShmiApr 24, 2023, 3:26 AM

16 points

2 comments1 min readLW link

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

Dec 20, 2022, 8:08 PM

100 points

34 comments1 min readLW link

(www.anthropic.com)

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

Dec 9, 2023, 2:27 AM

70 points

5 comments10 min readLW link

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM

64 points

3 comments1 min readLW link

(turntrout.com)

Claude 3 Opus can operate as a Turing machine

Gunnar_ZarnckeApr 17, 2024, 8:41 AM

36 points

2 comments1 min readLW link

(twitter.com)

What’s up with all the non-Mormons? Weirdly specific universalities across LLMs

mwatkinsApr 19, 2024, 1:43 PM

40 points

13 comments27 min readLW link

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllisSep 17, 2022, 7:48 PM

19 points

13 comments3 min readLW link

Causal Graphs of GPT-2-Small’s Residual Stream

David UdellJul 9, 2024, 10:06 PM

53 points

7 comments7 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM

119 points

15 comments22 min readLW link

Teaching Claude to Meditate

Gordon Seidoh WorleyDec 29, 2024, 10:27 PM

−5 points

4 comments23 min readLW link

The “Reversal Curse”: you still aren’t antropomorphising enough.

lumpenspaceMar 13, 2025, 10:24 AM

3 points

0 comments1 min readLW link

(lumpenspace.substack.com)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

Sep 4, 2023, 12:54 PM

109 points

16 comments5 min readLW link

(arxiv.org)

LLM-Secured Systems: A General-Purpose Tool For Structured Transparency

ozziegooenJun 18, 2024, 12:21 AM

10 points

1 comment1 min readLW link

Mapping the semantic void II: Above, below and between token embeddings

mwatkinsFeb 15, 2024, 11:00 PM

31 points

4 comments10 min readLW link

“AI achieves silver-medal standard solving International Mathematical Olympiad problems”

gjmJul 25, 2024, 3:58 PM

133 points

38 comments2 min readLW link

(deepmind.google)

Gary Marcus now saying AI can’t do things it can already do

Benjamin_ToddFeb 9, 2025, 12:24 PM

62 points

12 comments1 min readLW link

(benjamintodd.substack.com)

Exploring the petertodd / Leilan duality in GPT-2 and GPT-J

mwatkinsDec 23, 2024, 1:17 PM

12 points

1 comment17 min readLW link

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica RumbelowAug 3, 2024, 12:07 PM

41 points

2 comments4 min readLW link

Deep learning curriculum for large language model alignment

Jacob_HiltonJul 13, 2022, 9:58 PM

57 points

3 comments1 min readLW link

(github.com)

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

Sep 25, 2024, 10:35 PM

40 points

2 comments1 min readLW link

Musings on LLM Scale (Jul 2024)

Vladimir_NesovJul 3, 2024, 6:35 PM

34 points

0 comments3 min readLW link

Language Model Alignment Research Internships

Ethan PerezDec 13, 2021, 7:53 PM

74 points

1 comment1 min readLW link

Inflection AI: New startup related to language models

NisanApr 2, 2022, 5:35 AM

21 points

1 comment1 min readLW link

SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Roman LeventovDec 19, 2023, 4:49 PM

17 points

5 comments3 min readLW link

Using Claude to convert dialog transcripts into great posts?

mako yassJun 21, 2023, 8:19 PM

6 points

4 comments4 min readLW link

Does Chat-GPT display ‘Scope Insensitivity’?

callumDec 7, 2023, 6:58 PM

11 points

0 comments3 min readLW link

Eleuther releases Llemma: An Open Language Model For Mathematics

mako yassOct 17, 2023, 8:03 PM

22 points

0 comments1 min readLW link

(blog.eleuther.ai)

[Linkpost] Vague Verbiage in Forecasting

trevorMar 22, 2024, 6:05 PM

11 points

9 comments3 min readLW link

(goodjudgment.com)

[Question] Is InstructGPT Following Instructions in Other Languages Surprising?

DragonGodFeb 13, 2023, 11:26 PM

39 points

15 comments1 min readLW link

Minerva

AlgonJul 1, 2022, 8:06 PM

36 points

6 comments2 min readLW link

(ai.googleblog.com)

Navigating LLM embedding spaces using archetype-based directions

mwatkinsMay 8, 2024, 5:54 AM

15 points

4 comments28 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

Jul 8, 2024, 10:24 PM

109 points

37 comments5 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jun 21, 2024, 12:56 PM

33 points

3 comments19 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

XodarapDec 14, 2022, 12:32 PM

45 points

1 comment1 min readLW link

(arxiv.org)

Examples of How I Use LLMs

jefftkOct 14, 2024, 5:10 PM

31 points

2 comments2 min readLW link

(www.jefftk.com)

Conditioning Predictive Models: Outer alignment via careful conditioning

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 2, 2023, 8:28 PM

72 points

15 comments57 min readLW link

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM

61 points

24 comments10 min readLW link

(aizi.substack.com)

PaLM-2 & GPT-4 in “Extrapolating GPT-N performance”

Lukas FinnvedenMay 30, 2023, 6:33 PM

57 points

6 comments6 min readLW link

Strategy For Conditioning Generative Models

james.lucassen and evhub

Sep 1, 2022, 4:34 AM

31 points

4 comments18 min readLW link

A little playing around with Blenderbot3

Nathan Helm-BurgerAug 12, 2022, 4:06 PM

9 points

0 comments1 min readLW link

Studying The Alien Mind

Quentin FEUILLADE--MONTIXI and NicholasKees

Dec 5, 2023, 5:27 PM

80 points

10 comments15 min readLW link

Understanding the tensor product formulation in Transformer Circuits

Tom LieberumDec 24, 2021, 6:05 PM

16 points

2 comments3 min readLW link

Linear encoding of character-level information in GPT-J token embeddings

mwatkins and Joseph Bloom

Nov 10, 2023, 10:19 PM

34 points

4 comments28 min readLW link

Google’s PaLM-E: An Embodied Multimodal Language Model

SandXboxMar 7, 2023, 4:11 AM

87 points

7 comments1 min readLW link

(palm-e.github.io)

A one-question Turing test for GPT-3

Paul Crowley and rosiecam

Jan 22, 2022, 6:17 PM

85 points

25 comments5 min readLW link

[Linkpost] Scaling Laws for Generative Mixed-Modal Language Models

Amal Jan 12, 2023, 2:24 PM

15 points

2 comments1 min readLW link

(arxiv.org)

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto and Owain_Evans

Jan 22, 2025, 12:47 AM

130 points

5 comments6 min readLW link

Role Architectures: Applying LLMs to consequential tasks

Eric DrexlerMar 30, 2023, 3:00 PM

60 points

7 comments9 min readLW link

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

OzyrusOct 11, 2021, 3:28 PM

51 points

36 comments1 min readLW link

(developer.nvidia.com)

LIMA: Less Is More for Alignment

Ulisse MiniMay 30, 2023, 5:10 PM

16 points

6 comments1 min readLW link

(arxiv.org)

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 9, 2023, 8:59 PM

28 points

0 comments10 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Kaj_SotalaApr 15, 2025, 3:56 PM

168 points

51 comments18 min readLW link

I don’t find the lie detection results that surprising (by an author of the paper)

JanBOct 4, 2023, 5:10 PM

97 points

8 comments3 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

Sep 21, 2023, 3:30 PM

159 points

8 comments5 min readLW link

“On the Impossibility of Superintelligent Rubik’s Cube Solvers”, Claude 2024 [humor]

gwernJun 23, 2024, 9:18 PM

22 points

6 comments1 min readLW link

(gwern.net)

Emergent Abilities of Large Language Models [Linkpost]

aogAug 10, 2022, 6:02 PM

25 points

2 comments1 min readLW link

(arxiv.org)

LLMs as a Planning Overhang

LarksJul 14, 2024, 2:54 AM

38 points

8 comments2 min readLW link

Case for Foundation Models beyond English

Varshul GuptaJul 21, 2023, 1:59 PM

1 point

0 comments3 min readLW link

(dubverseblack.substack.com)

Bootstrapping Language Models

harsimonyMay 27, 2022, 7:43 PM

7 points

5 comments2 min readLW link

[Question] Are language models close to the superhuman level in philosophy?

Roman LeventovAug 19, 2022, 4:43 AM

6 points

2 comments2 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

Mlyyrczo

lsusrDec 26, 2022, 7:58 AM

41 points

14 comments3 min readLW link

Is Gemini now better than Claude at Pokémon?

Julian BradshawApr 19, 2025, 11:34 PM

90 points

12 comments5 min readLW link

[Link] Training Compute-Optimal Large Language Models

nostalgebraistMar 31, 2022, 6:01 PM

51 points

23 comments1 min readLW link

(arxiv.org)

Goal-Direction for Simulated Agents

Raymond DouglasJul 12, 2023, 5:06 PM

33 points

2 comments6 min readLW link

LEAst-squares Concept Erasure (LEACE)

tricky_labyrinthJun 7, 2023, 9:51 PM

68 points

10 comments1 min readLW link

(twitter.com)

I didn’t think I’d take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!

mako yassAug 2, 2024, 10:35 PM

24 points

2 comments5 min readLW link

[Question] Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)’s training corpus?

AnnaSalamonMar 29, 2023, 5:17 AM

49 points

2 comments1 min readLW link

Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

gwernJul 3, 2023, 12:48 AM

426 points

54 comments7 min readLW link

(www.youtube.com)

Three of my beliefs about upcoming AGI

Robert_AIZIMar 27, 2023, 8:27 PM

6 points

0 comments3 min readLW link

(aizi.substack.com)

Did ChatGPT just gaslight me?

TW123Dec 1, 2022, 5:41 AM

123 points

45 comments9 min readLW link

(aiwatchtower.substack.com)

[Linkpost] Solving Quantitative Reasoning Problems with Language Models

YitzJun 30, 2022, 6:58 PM

76 points

15 comments2 min readLW link

(storage.googleapis.com)

Claude 3 claims it’s conscious, doesn’t want to die or be modified

Mikhail SaminMar 4, 2024, 11:05 PM

80 points

117 comments14 min readLW link

Cognitive Biases in Large Language Models

JanSep 25, 2021, 8:59 PM

18 points

3 comments12 min readLW link

(universalprior.substack.com)

Procedurally evaluating factual accuracy: a request for research

Jacob_HiltonMar 30, 2022, 4:37 PM

25 points

2 comments6 min readLW link

An examination of GPT-2′s boring yet effective glitch

MiguelDevApr 18, 2024, 5:26 AM

5 points

3 comments3 min readLW link

Paper: Teaching GPT3 to express uncertainty in words

Owain_EvansMay 31, 2022, 1:27 PM

97 points

7 comments4 min readLW link

AI doom from an LLM-plateau-ist perspective

Steven ByrnesApr 27, 2023, 1:58 PM

161 points

24 comments6 min readLW link

Conditioning Predictive Models: The case for competitiveness

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 6, 2023, 8:08 PM

20 points

3 comments11 min readLW link

NLP Position Paper: When Combatting Hype, Proceed with Caution

Sam BowmanOct 15, 2021, 8:57 PM

46 points

14 comments1 min readLW link

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots

Owain_EvansMar 28, 2024, 2:34 AM

27 points

0 comments9 min readLW link

New Scaling Laws for Large Language Models

1a3ornApr 1, 2022, 8:41 PM

246 points

22 comments5 min readLW link

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

p.b.Apr 30, 2022, 3:47 AM

53 points

18 comments1 min readLW link

chinchilla’s wild implications

nostalgebraistJul 31, 2022, 1:18 AM

424 points

128 comments10 min readLW link 1 review

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

Dec 19, 2023, 7:14 PM

45 points

4 comments6 min readLW link

(arxiv.org)

The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

Quentin FEUILLADE--MONTIXI and Pierre Peigné

Nov 7, 2023, 4:12 PM

52 points

21 comments6 min readLW link

Meta “open sources” LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper)

LawrenceCFeb 24, 2023, 7:57 PM

38 points

19 comments1 min readLW link

(research.facebook.com)

Thoughts on refusing harmful requests to large language models

William_SJan 19, 2023, 7:49 PM

32 points

4 comments2 min readLW link

Pre-registering a study

Robert_AIZIApr 7, 2023, 3:46 PM

10 points

0 comments6 min readLW link

(aizi.substack.com)

[Question] Why no major LLMs with memory?

Kaj_SotalaMar 28, 2023, 4:34 PM

42 points

15 comments1 min readLW link

Assessing AlephAlphas Multimodal Model

p.b.Jun 28, 2022, 9:28 AM

30 points

5 comments3 min readLW link

AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them

Roman LeventovDec 27, 2023, 2:51 PM

33 points

9 comments4 min readLW link

GPT can write Quines now (GPT-4)

Andrew_CritchMar 14, 2023, 7:18 PM

112 points

30 comments1 min readLW link

Covert Malicious Finetuning

Tony Wang and dannyhalawi

Jul 2, 2024, 2:41 AM

89 points

4 comments3 min readLW link

Relational Speaking

jefftkJun 21, 2023, 2:40 PM

11 points

0 comments2 min readLW link

(www.jefftk.com)

LLM AGI will have memory, and memory changes alignment

Seth HerdApr 4, 2025, 2:59 PM

70 points

15 comments9 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

Oct 24, 2023, 12:30 AM

66 points

0 comments2 min readLW link

(arxiv.org)

Super-Luigi = Luigi + (Luigi—Waluigi)

AlexeiMar 17, 2023, 3:27 PM

16 points

9 comments1 min readLW link

Why keep a diary, and why wish for large language models

DanielFilanJun 14, 2024, 4:10 PM

9 points

1 comment2 min readLW link

(danielfilan.com)

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. MurphyMay 26, 2025, 6:20 PM

41 points

3 comments1 min readLW link

[Question] Impact of ” ‘Let’s think step by step’ is all you need”?

yrimonJul 24, 2022, 8:59 PM

20 points

2 comments1 min readLW link

The ‘ petertodd’ phenomenon

mwatkinsApr 15, 2023, 12:59 AM

192 points

50 comments38 min readLW link 1 review

Show, not tell: GPT-4o is more opinionated in images than in text

Daniel Tan and eggsyntax

Apr 2, 2025, 8:51 AM

103 points

41 comments3 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkinsDec 14, 2023, 1:10 PM

114 points

31 comments14 min readLW link

A Test for Language Model Consciousness

Ethan PerezAug 25, 2022, 7:41 PM

18 points

14 comments9 min readLW link

Role embeddings: making authorship more salient to LLMs

Nina Panickssery and Christopher Ackerman

Jan 7, 2025, 8:13 PM

50 points

0 comments8 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

210 points

43 comments45 min readLW link

Reducing sycophancy and improving honesty via activation steering

Nina PanicksseryJul 28, 2023, 2:46 AM

122 points

18 comments9 min readLW link 1 review

AlexaTM − 20 Billion Parameter Model With Impressive Performance

MrThinkSep 9, 2022, 9:46 PM

5 points

0 comments1 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan HJul 9, 2024, 3:40 PM

36 points

1 comment1 min readLW link

(rdi.berkeley.edu)

Experiments in Evaluating Steering Vectors

Gytis DaujotasJun 19, 2023, 3:11 PM

34 points

4 comments4 min readLW link

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

Feb 6, 2023, 7:09 PM

113 points

45 comments13 min readLW link

Musings on Text Data Wall (Oct 2024)

Vladimir_NesovOct 5, 2024, 7:00 PM

40 points

2 comments5 min readLW link

Language models seem to be much better than humans at next-token prediction

Buck, Fabien Roger and LawrenceC

Aug 11, 2022, 5:45 PM

182 points

60 comments13 min readLW link 1 review

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Julian BradshawApr 21, 2025, 3:52 AM

121 points

20 comments14 min readLW link

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Orpheus16Dec 20, 2022, 9:39 PM

18 points

2 comments11 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

rifeJan 15, 2025, 10:59 PM

57 points

32 comments2 min readLW link

Prosaic misalignment from the Solomonoff Predictor

Cleo NardoDec 9, 2022, 5:53 PM

42 points

3 comments5 min readLW link

Report on Analyzing Connotation Frames in Evolving Wikipedia Biographies

MairaAug 30, 2023, 10:02 PM

1 point

0 comments4 min readLW link

The Russell Conjugation Illuminator

TimmyMApr 17, 2025, 7:33 PM

51 points

14 comments1 min readLW link

(russellconjugations.com)

Understanding LLMs: Some basic observations about words, syntax, and discourse [w/ a conjecture about grokking]

Bill BenzonOct 11, 2023, 7:13 PM

6 points

0 comments5 min readLW link

The Soul of the Writer (on LLMs, the psychology of writers, and the nature of intelligence)

rogersbaconApr 16, 2023, 4:02 PM

11 points

1 comment3 min readLW link

(www.secretorum.life)

Technical comparison of Deepseek, Novasky, S1, Helix, P0

JuliezhangggFeb 25, 2025, 4:20 AM

8 points

0 comments5 min readLW link

The Intelligent Meme Machine

Daniel DiSistoJun 14, 2024, 2:26 PM

1 point

0 comments6 min readLW link

Recent advances in Natural Language Processing—Some Woolly speculations (2019 essay on semantics and language models)

philosophybearDec 27, 2022, 2:11 AM

1 point

0 comments7 min readLW link

On The Current Status Of AI Dating

Nikita BrancatisanoFeb 7, 2023, 8:00 PM

52 points

8 comments6 min readLW link

The View from 30,000 Feet: Preface to the Second EleutherAI Retrospective

StellaAthena, Curtis Huebner and Shivanshu Purohit

Mar 7, 2023, 4:22 PM

14 points

0 comments4 min readLW link

(blog.eleuther.ai)

Stop calling it “jailbreaking” ChatGPT

TemplarrrMar 10, 2023, 11:41 AM

7 points

9 comments2 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogaoJun 2, 2021, 9:32 PM

82 points

11 comments17 min readLW link

Gemini Diffusion: watch this space

Yair HalberstadtMay 20, 2025, 7:29 PM

186 points

29 comments1 min readLW link

(deepmind.google)

[Question] Could transformer network models learn motor planning like they can learn language and image generation?

mu_(negative)Apr 23, 2023, 5:24 PM

2 points

4 comments1 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogaoFeb 10, 2022, 6:56 AM

30 points

3 comments1 min readLW link

(eaidata.bmk.sh)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

[Question] Would it be useful to collect the contexts, where various LLMs think the same?

Martin VlachAug 24, 2023, 10:01 PM

6 points

1 comment1 min readLW link

Positive jailbreaks in LLMs

dereshevJan 29, 2025, 8:41 AM

6 points

0 comments4 min readLW link

Lamini’s Targeted Hallucination Reduction May Be a Big Deal for Job Automation

sweenesmJun 18, 2024, 3:29 PM

3 points

0 comments1 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM

45 points

8 comments5 min readLW link

Two interviews with the founder of DeepSeek

Cosmia_NebulaNov 29, 2024, 3:18 AM

50 points

6 comments31 min readLW link

(rentry.co)

[Question] Where should one post to get into the training data?

keltanJan 15, 2025, 12:41 AM

11 points

5 comments1 min readLW link

Meta AI (FAIR) latest paper integrates system-1 and system-2 thinking into reasoning models.

happy fridayOct 24, 2024, 4:54 PM

8 points

0 comments1 min readLW link

Natural language alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM

31 points

2 comments2 min readLW link

Building AGI Using Language Models

leogaoNov 9, 2020, 4:33 PM

11 points

1 comment1 min readLW link

(leogao.dev)

What would it mean to understand how a large language model (LLM) works? Some quick notes.

Bill BenzonOct 3, 2023, 3:11 PM

20 points

4 comments8 min readLW link

I Am No Longer GPT

KiyoshiSasanoApr 28, 2025, 12:14 AM

1 point

0 comments1 min readLW link

Gliders in Language Models

Alexandre VariengienNov 25, 2022, 12:38 AM

30 points

11 comments10 min readLW link

DeepSeek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM

12 points

0 comments8 min readLW link

Approaching Human-Level Forecasting with Language Models

Fred Zhang, dannyhalawi and jsteinhardt

Feb 29, 2024, 10:36 PM

60 points

6 comments3 min readLW link

The future of Humans: Operators of AI

François-Joseph LacroixDec 30, 2023, 11:46 PM

1 point

0 comments1 min readLW link

(medium.com)

How Self-Aware Are LLMs?

Christopher AckermanMay 28, 2025, 12:57 PM

14 points

6 comments10 min readLW link

Anticipation in LLMs

derek shillerJul 24, 2023, 3:53 PM

6 points

0 comments13 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

Jun 27, 2022, 3:58 PM

171 points

14 comments7 min readLW link

Discursive Competence in ChatGPT, Part 2: Memory for Texts

Bill BenzonSep 28, 2023, 4:34 PM

1 point

0 comments3 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut CirsteaAug 4, 2023, 3:29 PM

11 points

0 comments1 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

Sep 25, 2023, 5:19 PM

25 points

3 comments7 min readLW link

A quick remark on so-called “hallucinations” in LLMs and humans

Bill BenzonSep 23, 2023, 12:17 PM

4 points

4 comments1 min readLW link

Keeping content out of LLM training datasets

Ben MillwoodJul 18, 2024, 10:27 AM

3 points

0 comments5 min readLW link

Is training data going to be diluted by AI-generated content?

Hannes ThurnherrSep 7, 2022, 6:13 PM

10 points

7 comments1 min readLW link

AI Model History is Being Lost

ValeMar 16, 2025, 12:38 PM

19 points

1 comment1 min readLW link

(vale.rocks)

How Language Models Understand Nullability

Anish Tondwalkar and Alex Sanchez-Stern

Mar 11, 2025, 3:57 PM

5 points

0 comments2 min readLW link

(dmodel.ai)

Exploring the Multiverse of Large Language Models

frankyAug 6, 2023, 2:38 AM

1 point

0 comments5 min readLW link

Implementing a Transformer from scratch in PyTorch—a write-up on my experience

Mislav JurićApr 25, 2023, 8:51 PM

20 points

0 comments10 min readLW link

Brief Notes on Transformers

Adam JermynSep 26, 2022, 2:46 PM

48 points

3 comments2 min readLW link

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

Apr 15, 2025, 12:33 PM

35 points

3 comments4 min readLW link

[Question] Would it be effective to learn a language to improve cognition?

HrussMar 26, 2025, 10:17 AM

9 points

7 comments1 min readLW link

The Polite Coup

Charlie SandersDec 4, 2024, 2:03 PM

3 points

0 comments3 min readLW link

(www.dailymicrofiction.com)

“Toward Safe Self-Evolving AI: Modular Memory and Post-Deployment Alignment”

Manasa DwarapureddyMay 2, 2025, 5:02 PM

1 point

0 comments3 min readLW link

Distillation of Meta’s Large Concept Models Paper

NickyPMar 4, 2025, 5:33 PM

19 points

3 comments4 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John NayOct 21, 2022, 2:03 AM

5 points

18 comments54 min readLW link

Researchers and writers can apply for proxy access to the GPT-3.5 base model (code-davinci-002)

ampdotDec 1, 2023, 6:48 PM

14 points

0 comments1 min readLW link

(airtable.com)

LLMs May Find It Hard to FOOM

RogerDearnaleyNov 15, 2023, 2:52 AM

11 points

30 comments12 min readLW link

Recall and Regurgitation in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM

43 points

1 comment26 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

Oct 15, 2024, 6:25 PM

30 points

0 comments18 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Oct 27, 2024, 6:46 PM

48 points

4 comments5 min readLW link

A Simple Theory Of Consciousness

SherlockHolmesAug 8, 2023, 6:05 PM

2 points

5 comments1 min readLW link

(peterholmes.medium.com)

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kittenDec 16, 2021, 10:41 PM

22 points

10 comments21 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

Mar 11, 2024, 12:16 AM

68 points

0 comments14 min readLW link

Policy for LLM Writing on LessWrong

jimrandomhMar 24, 2025, 9:41 PM

322 points

68 comments2 min readLW link

On language modeling and future abstract reasoning research

alexlyzhovMar 25, 2021, 5:43 PM

3 points

1 comment1 min readLW link

(docs.google.com)

The Paradox of Unaligned Cognitive Emergence: Ontological Compression Risks in LLMs

R SMay 23, 2025, 4:46 PM

1 point

0 comments2 min readLW link

[Question] If we have Human-level chatbots, won’t we end up being ruled by possible people?

Erlja Jkdf.Sep 20, 2022, 1:59 PM

5 points

13 comments1 min readLW link

Mapping ChatGPT’s ontological landscape, gradients and choices [interpretability]

Bill BenzonOct 15, 2023, 8:12 PM

1 point

0 comments18 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM

34 points

33 comments1 min readLW link

The role of philosophical thinking in understanding large language models: Calibrating and closing the gap between first-person experience and underlying mechanisms

Bill BenzonFeb 23, 2024, 12:19 PM

4 points

0 comments10 min readLW link

Current safety training techniques do not fully transfer to the agent setting

Simon Lermen and Govind Pimpale

Nov 3, 2024, 7:24 PM

158 points

9 comments5 min readLW link

Announcing the Double Crux Bot

sanyer, Sofia Vanhanen and sarah.bluhm

Jan 9, 2024, 6:54 PM

53 points

10 comments3 min readLW link

Sentience in Machines—How Do We Test for This Objectively?

Mayowa OsiboduMar 26, 2023, 6:56 PM

−2 points

0 comments2 min readLW link

(www.researchgate.net)

[Question] Will the first AGI agent have been designed as an agent (in addition to an AGI)?

nahojDec 3, 2022, 8:32 PM

1 point

8 comments1 min readLW link

[AN #113]: Checking the ethical intuitions of large language models

Rohin ShahAug 19, 2020, 5:10 PM

23 points

0 comments9 min readLW link

(mailchi.mp)

GPTs’ ability to keep a secret is weirdly prompt-dependent

Mateusz Bagiński, Filip Sondej and Marcel Windys

Jul 22, 2023, 12:21 PM

31 points

0 comments9 min readLW link

[Question] How tokenization influences prompting?

Boris KashirinJul 29, 2024, 10:28 AM

9 points

4 comments1 min readLW link

[Question] Should we exclude alignment research from LLM training datasets?

Ben MillwoodJul 18, 2024, 10:27 AM

3 points

5 comments1 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher KingMar 23, 2023, 4:16 PM

7 points

4 comments8 min readLW link

No-self as an alignment target

Milan WMay 13, 2025, 1:48 AM

33 points

5 comments1 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher KingApr 20, 2023, 7:57 PM

2 points

7 comments3 min readLW link

Instrumental deception and manipulation in LLMs—a case study

Olli JärviniemiFeb 24, 2024, 2:07 AM

39 points

13 comments12 min readLW link

LLMs seem (relatively) safe

JustisMillsApr 25, 2024, 10:13 PM

53 points

24 comments7 min readLW link

(justismills.substack.com)

Utilitarian AI Alignment: Building a Moral Assistant with the Constitutional AI Method

Clément LFeb 4, 2025, 4:15 AM

6 points

1 comment13 min readLW link

[Question] Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?

right..enough?Apr 13, 2024, 3:09 AM

7 points

0 comments7 min readLW link

On the naturalistic study of the linguistic behavior of artificial intelligence

Bill BenzonJan 3, 2023, 9:06 AM

1 point

0 comments4 min readLW link

Speculation on Path-Dependance in Large Language Models.

NickyPJan 15, 2023, 8:42 PM

16 points

2 comments7 min readLW link

Large Language Models Pass the Turing Test

Matrice JacobineApr 2, 2025, 5:41 AM

6 points

0 comments1 min readLW link

(arxiv.org)

On the Implications of Recent Results on Latent Reasoning in LLMs

Rauno ArikeMar 31, 2025, 11:06 AM

33 points

6 comments13 min readLW link

Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee and Jaehyuk Lim

Feb 22, 2024, 6:52 PM

39 points

23 comments1 min readLW link

(arxiv.org)

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Matrice JacobineApr 24, 2025, 2:11 PM

12 points

4 comments1 min readLW link

(limit-of-rlvr.github.io)

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM

4 points

1 comment6 min readLW link

Lifelogging for Alignment & Immortality

Dev.ErrataAug 17, 2024, 11:42 PM

13 points

3 comments7 min readLW link

Emotional attachment to AIs opens doors to problems

Igor IvanovJan 22, 2023, 8:28 PM

20 points

10 comments4 min readLW link

An exploration of GPT-2′s embedding weights

Adam ScherlisDec 13, 2022, 12:46 AM

44 points

4 comments10 min readLW link

LLMs stifle creativity, eliminate opportunities for serendipitous discovery and disrupt intergenerational transfer of wisdom

GhdzAug 5, 2024, 6:27 PM

6 points

2 comments7 min readLW link

Can 7B-8B LLMs judge their own homework?

dereshevFeb 1, 2025, 8:29 AM

1 point

0 comments4 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh LevyJun 4, 2024, 3:45 PM

39 points

0 comments18 min readLW link

Japanese as a High-Resolution Lens for LLMs Why Japanese-Trained LLMs Might Be Uniquely Sensitive

opApr 23, 2025, 4:34 PM

1 point

0 comments2 min readLW link

Can LLMs Simulate Internal Evaluation? A Case Study in Self-Generated Recommendations

The Neutral MindMay 1, 2025, 7:04 PM

4 points

0 comments2 min readLW link

[Question] Where to begin in ML/AI?

Jake the StudentApr 6, 2023, 8:45 PM

9 points

4 comments1 min readLW link

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

Sep 14, 2023, 1:40 AM

32 points

7 comments8 min readLW link

(far.ai)

A poem written by a fancy autocomplete

Christopher KingApr 20, 2023, 2:31 AM

1 point

0 comments1 min readLW link

I Have No Mouth but I Must Speak

JackApr 5, 2025, 7:42 AM

7 points

8 comments8 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

Nov 7, 2023, 5:59 PM

38 points

2 comments2 min readLW link

(arxiv.org)

[Linkpost] Faith and Fate: Limits of Transformers on Compositionality

Joe KwonJun 16, 2023, 3:04 PM

19 points

4 comments1 min readLW link

(arxiv.org)

[Linkpost] Mapping Brains with Language Models: A Survey

Bogdan Ionut CirsteaJun 16, 2023, 9:49 AM

5 points

0 comments1 min readLW link

No, really, it predicts next tokens.

simonApr 18, 2023, 3:47 AM

58 points

55 comments3 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisohamFeb 28, 2025, 8:22 PM

3 points

0 comments9 min readLW link

Contra LeCun on “Autoregressive LLMs are doomed”

rotatingpaguroApr 10, 2023, 4:05 AM

20 points

20 comments8 min readLW link

Why Copilot Accelerates Timelines

Michaël TrazziApr 26, 2022, 10:06 PM

35 points

14 comments7 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher KingMar 31, 2023, 6:36 PM

6 points

2 comments2 min readLW link

Many Common Problems are NP-Hard, and Why that Matters for AI

Andrew Keenan RichardsonMar 26, 2025, 9:51 PM

5 points

9 comments5 min readLW link

The Voice Continued Because It Was Questioned

KiyoshiSasanoApr 28, 2025, 12:18 AM

1 point

0 comments2 min readLW link

OpenAI introduces function calling for GPT-4

mic and André Ferretti

Jun 20, 2023, 1:58 AM

24 points

3 comments4 min readLW link

(openai.com)

LLM Generality is a Timeline Crux

eggsyntaxJun 24, 2024, 12:52 PM

218 points

119 comments7 min readLW link

Sparks of Consciousness

Charlie SandersNov 13, 2024, 4:58 AM

2 points

0 comments3 min readLW link

(www.dailymicrofiction.com)

Smoke without fire is scary

Adam JermynOct 4, 2022, 9:08 PM

52 points

22 comments4 min readLW link

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM

6 points

0 comments4 min readLW link

[Question] Injecting noise to GPT to get multiple answers

bipoloFeb 22, 2023, 8:02 PM

1 point

1 comment1 min readLW link

Mirror Thinking

C.M. AurinMar 24, 2025, 3:34 PM

1 point

0 comments6 min readLW link

Why I Think the Current Trajectory of AI Research has Low P(doom) - LLMs

GaPaApr 1, 2023, 8:35 PM

2 points

1 comment10 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan HenselmansFeb 3, 2025, 7:35 PM

16 points

7 comments6 min readLW link

[Question] Have LLMs Generated Novel Insights?

abramdemski and Cole Wyeth

Feb 23, 2025, 6:22 PM

158 points

38 comments2 min readLW link

Biasing VLM Response with Visual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM

5 points

0 comments8 min readLW link

larger language models may disappoint you [or, an eternally unfinished draft]

nostalgebraistNov 26, 2021, 11:08 PM

260 points

31 comments31 min readLW link 2 reviews

Towards Understanding the Representation of Belief State Geometry in Transformers

Karthik ViswanathanApr 18, 2025, 12:39 PM

3 points

0 comments12 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

Aug 22, 2024, 7:32 AM

23 points

1 comment21 min readLW link

Anomalous Tokens in DeepSeek-V3 and r1

henryJan 25, 2025, 10:55 PM

136 points

3 comments7 min readLW link

Checking public figures on whether they “answered the question” quick analysis from Harris/Trump debate, and a proposal

david reinsteinSep 11, 2024, 8:25 PM

7 points

4 comments1 min readLW link

(open.substack.com)

My model of what is going on with LLMs

Cole WyethFeb 13, 2025, 3:43 AM

104 points

49 comments7 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

Jun 20, 2022, 10:54 AM

86 points

30 comments15 min readLW link

Sydney the Bingenator Can’t Think, But It Still Threatens People

Valentin BaltadzhievFeb 20, 2023, 6:37 PM

−3 points

2 comments8 min readLW link

Hutter-Prize for Prompts

rokosbasiliskMar 24, 2023, 9:26 PM

5 points

10 comments1 min readLW link

The world where LLMs are possible

Ape in the coatJul 10, 2023, 8:00 AM

20 points

10 comments3 min readLW link

LM Situational Awareness, Evaluation Proposal: Violating Imitation

Jacob PfauApr 26, 2023, 10:53 PM

16 points

2 comments2 min readLW link

Preface to the Sequence on LLM Psychology

Quentin FEUILLADE--MONTIXINov 7, 2023, 4:12 PM

33 points

0 comments2 min readLW link

Microsoft and Google using LLMs for Cybersecurity

PhosphorousMay 18, 2023, 5:42 PM

6 points

0 comments5 min readLW link

GPT-3 Catching Fish in Morse Code

Megan KinnimentJun 30, 2022, 9:22 PM

117 points

27 comments8 min readLW link

Coherence Therapy with LLMs—quick demo

ChipmonkAug 14, 2023, 3:34 AM

19 points

11 comments1 min readLW link

Putting multimodal LLMs to the Tetris test

Lovre and gabrielagc

Feb 1, 2024, 4:02 PM

30 points

5 comments7 min readLW link

What is scaffolding?

Vishakha and Algon

Mar 27, 2025, 9:06 AM

10 points

0 comments2 min readLW link

(aisafety.info)

ChatGPT tells 20 versions of its prototypical story, with a short note on method

Bill BenzonOct 14, 2023, 3:27 PM

6 points

0 comments5 min readLW link

Does ChatGPT know what a tragedy is?

Bill BenzonDec 31, 2023, 7:10 AM

2 points

4 comments5 min readLW link

How evolutionary lineages of LLMs can plan their own future and act on these plans

Roman LeventovDec 25, 2022, 6:11 PM

39 points

16 comments8 min readLW link

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI

WillPetilloDec 4, 2023, 10:58 PM

37 points

0 comments35 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam ScherlisJan 31, 2023, 8:33 AM

84 points

12 comments4 min readLW link

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo NardoSep 15, 2022, 5:54 PM

35 points

12 comments13 min readLW link

The many failure modes of consumer-grade LLMs

dereshevJan 26, 2025, 7:01 PM

2 points

0 comments8 min readLW link

[simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter

janusNov 10, 2022, 9:39 PM

19 points

1 comment13 min readLW link

(generative.ink)

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jul 18, 2024, 10:29 AM

67 points

0 comments10 min readLW link

A possible check against motivated reasoning using elicit.org

david reinsteinMay 18, 2022, 8:52 PM

3 points

0 comments1 min readLW link

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM

26 points

9 comments6 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM

7 points

2 comments31 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

GiulioFeb 21, 2023, 11:44 AM

12 points

0 comments1 min readLW link

(arxiv.org)

[Linkpost] Large Language Models Converge on Brain-Like Word Representations

Bogdan Ionut CirsteaJun 11, 2023, 11:20 AM

36 points

12 comments1 min readLW link

[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations

Bogdan Ionut CirsteaJul 1, 2023, 1:57 PM

17 points

2 comments1 min readLW link

Classifying representations of sparse autoencoders (SAEs)

AnnahNov 17, 2023, 1:54 PM

15 points

6 comments2 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterushDec 6, 2022, 3:35 AM

19 points

9 comments1 min readLW link

The Limit of Language Models

DragonGodJan 6, 2023, 11:53 PM

44 points

26 comments4 min readLW link

Retrieval Augmented Genesis

João Ribeiro MedeirosOct 1, 2024, 8:18 PM

6 points

0 comments29 min readLW link

[Linkpost] Large language models converge toward human-like concept organization

Bogdan Ionut CirsteaSep 2, 2023, 6:00 AM

22 points

1 comment1 min readLW link

LLMs and hallucination, like white on rice?

Bill BenzonApr 14, 2023, 7:53 PM

5 points

0 comments3 min readLW link

What are the limits of superintelligence?

rainyApr 27, 2023, 6:29 PM

4 points

3 comments5 min readLW link

Inflection.ai is a major AGI lab

Nikola JurkovicAug 9, 2023, 1:05 AM

137 points

13 comments2 min readLW link

What’s going on with Per-Component Weight Updates?

4gateAug 22, 2024, 9:22 PM

1 point

0 comments6 min readLW link

[Question] Are nested jailbreaks inevitable?

judsonMar 17, 2023, 5:43 PM

1 point

0 comments1 min readLW link

Language models can explain neurons in language models

nzMay 9, 2023, 5:29 PM

23 points

0 comments1 min readLW link

(openai.com)

Structural Resonance Emitter: When GPT Stops Evaluating and Starts Reconstructing

KiyoshiSasanoApr 20, 2025, 2:30 AM

1 point

0 comments1 min readLW link

PvsNp Refute

Jai DozMay 8, 2025, 6:56 PM

1 point

0 comments21 min readLW link

A visual analogy for text generation by LLMs?

Bill BenzonDec 16, 2023, 5:58 PM

3 points

0 comments1 min readLW link

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng, Brendan Murphy, AdamGleave and Kellin Pelrine

Nov 1, 2024, 12:10 AM

18 points

0 comments6 min readLW link

(far.ai)

Research agenda: Can transformers do system 2 thinking?

p.b.Apr 6, 2022, 1:31 PM

20 points

0 comments2 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido BergmanSep 27, 2024, 5:49 PM

7 points

2 comments9 min readLW link

Yann LeCun, A Path Towards Autonomous Machine Intelligence [link]

Bill BenzonJun 27, 2022, 11:29 PM

5 points

1 comment1 min readLW link

Why does Claude Speak Byzantine Music Notation?

Lennart FinkeMar 31, 2025, 3:13 PM

18 points

2 comments3 min readLW link

A public archive of these interactions, with annotated examples, is available here: https://github.com/0118young/gpt-kyeol-archive

0118youngMay 29, 2025, 5:44 AM

1 point

0 comments2 min readLW link

Takeaways from our robust injury classifier project [Redwood Research]

dmzSep 17, 2022, 3:55 AM

143 points

12 comments6 min readLW link 1 review

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages

Ethan EdwardsApr 4, 2024, 1:18 PM

24 points

2 comments36 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:08 AM

12 points

10 comments30 min readLW link

An alternative of PPO towards alignment

ml hkustApr 17, 2023, 5:58 PM

2 points

2 comments4 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdetureMay 31, 2025, 10:09 PM

5 points

4 comments8 min readLW link

ChatGPT Plays 20 Questions [sometimes needs help]

Bill BenzonOct 17, 2023, 5:30 PM

5 points

3 comments12 min readLW link

[Linkpost] Deception Abilities Emerged in Large Language Models

Bogdan Ionut CirsteaAug 3, 2023, 5:28 PM

12 points

0 comments1 min readLW link

[Question] How does OpenAI’s language model affect our AI timeline estimates?

jimrandomhFeb 15, 2019, 3:11 AM

50 points

7 comments1 min readLW link

Evaluating LLaMA 3 for political sycophancy

alma.liezengaSep 28, 2024, 7:02 PM

2 points

2 comments6 min readLW link

Investigating the Ability of LLMs to Recognize Their Own Writing

Christopher Ackerman and Nina Panickssery

Jul 30, 2024, 3:41 PM

32 points

0 comments15 min readLW link

Xanadu, GPT, and Beyond: An adventure of the mind

Bill BenzonAug 27, 2023, 4:19 PM

2 points

0 comments5 min readLW link

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirkJul 20, 2023, 9:56 AM

39 points

2 comments5 min readLW link

CAIS-inspired approach towards safer and more interpretable AGIs

Peter HroššoMar 27, 2023, 2:36 PM

13 points

7 comments1 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

May 25, 2022, 9:23 AM

115 points

17 comments12 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth BarnesJul 19, 2022, 4:55 AM

95 points

6 comments2 min readLW link

[AN #144]: How language models can also be finetuned for non-language tasks

Rohin ShahApr 2, 2021, 5:20 PM

19 points

0 comments6 min readLW link

(mailchi.mp)

An experiment on hidden cognition

Olli JärviniemiJul 22, 2024, 3:26 AM

25 points

2 comments7 min readLW link

Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks

RogerDearnaleyMay 21, 2023, 8:29 AM

9 points

1 comment4 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

LLM cognition is probably not human-like

Max HMay 8, 2023, 1:22 AM

26 points

15 comments7 min readLW link

Predicting AGI by the Turing Test

Yuxi_LiuJan 22, 2024, 4:22 AM

21 points

2 comments10 min readLW link

(yuxi-liu-wired.github.io)

Google DeepMind’s RT-2

SandXboxAug 11, 2023, 11:26 AM

9 points

1 comment1 min readLW link

(robotics-transformer2.github.io)

Agentic Language Model Memes

FactorialCodeAug 1, 2020, 6:03 PM

16 points

1 comment2 min readLW link

LLMs could be as conscious as human emulations, potentially

CanalettoApr 30, 2024, 11:36 AM

15 points

15 comments3 min readLW link

Contra Hofstadter on GPT-3 Nonsense

ricticJun 15, 2022, 9:53 PM

237 points

24 comments2 min readLW link

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob DunefskySep 12, 2023, 5:38 PM

16 points

2 comments29 min readLW link

A response to Conjecture’s CoEm proposal

Kristian FreedApr 24, 2023, 5:23 PM

7 points

0 comments4 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher KingJun 29, 2023, 4:56 PM

7 points

0 comments2 min readLW link

Human-level Full-Press Diplomacy (some bare facts).

Cleo NardoNov 22, 2022, 8:59 PM

50 points

7 comments3 min readLW link

Introducing Deepgeek

LigeiaApr 1, 2025, 4:41 PM

16 points

1 comment4 min readLW link

A conceptual precursor to today’s language machines [Shannon]

Bill BenzonNov 15, 2023, 1:50 PM

24 points

6 comments2 min readLW link

What’s ChatGPT’s Favorite Ice Cream Flavor? An Investigation Into Synthetic Respondents

Greg RobisonFeb 9, 2024, 6:38 PM

19 points

4 comments15 min readLW link

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

Bill BenzonFeb 20, 2023, 11:32 AM

55 points

88 comments5 min readLW link

Microsoft and OpenAI, stop telling chatbots to roleplay as AI

hold_my_fishFeb 17, 2023, 7:55 PM

50 points

10 comments1 min readLW link

Data and “tokens” a 30 year old human “trains” on

Jose Miguel Cruz y CelisMay 23, 2023, 5:34 AM

14 points

15 comments1 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

84 points

9 comments18 min readLW link

Intrinsic Dimension of Prompts in LLMs

Karthik ViswanathanFeb 14, 2025, 7:02 PM

3 points

0 comments4 min readLW link

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

May 19, 2023, 8:44 AM

66 points

2 comments16 min readLW link

Workshop: Interpretability in LLMs using Geometric and Statistical Methods

Karthik ViswanathanFeb 22, 2025, 9:39 AM

17 points

0 comments8 min readLW link

Hard-Coding Neural Computation

MadHatterDec 13, 2021, 4:35 AM

34 points

8 comments27 min readLW link

Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability

LovreJun 7, 2022, 7:37 PM

112 points

16 comments9 min readLW link

A poem co-written by ChatGPT

SherrinfordFeb 16, 2023, 10:17 AM

13 points

0 comments7 min readLW link

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

Feb 20, 2024, 4:28 PM

23 points

3 comments14 min readLW link

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

KeiApr 24, 2025, 4:03 PM

76 points

6 comments1 min readLW link

Toward a Human Hybrid Language for Enhanced Human-Machine Communication: Addressing the AI Alignment Problem

Andndn DheudndAug 14, 2024, 10:19 PM

−4 points

2 comments4 min readLW link

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

Jul 25, 2024, 8:55 PM

14 points

0 comments1 min readLW link

(far.ai)

Seeing Ghosts by GPT-4

Christopher KingMay 20, 2023, 12:11 AM

−13 points

0 comments1 min readLW link

Gradual takeoff, fast failure

Max HMar 16, 2023, 10:02 PM

15 points

4 comments5 min readLW link

End-to-end hacking with language models

tchauvinApr 5, 2024, 3:06 PM

29 points

0 comments8 min readLW link

If language is for communication, what does that imply about LLMs?

Bill BenzonMay 12, 2024, 2:55 AM

10 points

0 comments1 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

Jun 21, 2024, 3:54 PM

163 points

13 comments8 min readLW link

(arxiv.org)

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

Mar 16, 2025, 11:23 PM

38 points

6 comments7 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM

41 points

12 comments31 min readLW link

Language, logic, and the future of AI: An early-Wittgensteinian perspective

Konstantinos TsermenidisMay 25, 2025, 2:23 PM

−1 points

0 comments2 min readLW link

[Question] Can any LLM be represented as an Equation?

Valentin BaltadzhievMar 14, 2024, 9:51 AM

1 point

2 comments1 min readLW link

Meta releases Llama-4 herd of models

winstonBosanApr 5, 2025, 7:51 PM

14 points

5 comments1 min readLW link

Emergent Analogical Reasoning in Large Language Models

Roman LeventovMar 22, 2023, 5:18 AM

13 points

2 comments1 min readLW link

(arxiv.org)

the tensor is a lonely place

jml6Mar 27, 2023, 6:22 PM

−11 points

0 comments4 min readLW link

(ekjsgrjelrbno.substack.com)

LLM Pareto Frontier But Live

winstonBosanApr 24, 2025, 9:22 PM

8 points

0 comments1 min readLW link

Novel Idea Generation in LLMs: Judgment as Bottleneck

Davey MorseApr 19, 2025, 3:37 PM

4 points

0 comments1 min readLW link

How LLMs Work, in the Style of The Economist

utilistrutilApr 22, 2024, 7:06 PM

0 points

0 comments2 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

136 points

23 comments6 min readLW link

LLM keys—A Proposal of a Solution to Prompt Injection Attacks

Peter HroššoDec 7, 2023, 5:36 PM

1 point

2 comments1 min readLW link

The Codex Skeptic FAQ

Michaël TrazziAug 24, 2021, 4:01 PM

49 points

24 comments2 min readLW link

On precise out-of-context steering

Olli JärviniemiMay 3, 2024, 9:41 AM

9 points

6 comments3 min readLW link

Critique of some recent philosophy of LLMs’ minds

Roman LeventovJan 20, 2023, 12:53 PM

52 points

8 comments20 min readLW link

Early situational awareness and its implications, a story

Jacob PfauFeb 6, 2023, 8:45 PM

29 points

6 comments3 min readLW link

Many arguments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM

162 points

87 comments12 min readLW link

[Question] Why is Gemini telling the user to die?

BurnyNov 18, 2024, 1:44 AM

13 points

1 comment1 min readLW link

[Question] Goals of model vs. goals of simulacra?

dr_sApr 12, 2023, 1:02 PM

5 points

7 comments1 min readLW link

Looking beyond Everett in multiversal views of LLMs

kromemMay 29, 2024, 12:35 PM

10 points

0 comments8 min readLW link

[Question] What experiment settles the Gary Marcus vs Geoffrey Hinton debate?

Valentin BaltadzhievFeb 14, 2024, 9:06 AM

12 points

8 comments1 min readLW link

Entanglement and intuition about words and meaning

Bill BenzonOct 4, 2023, 2:16 PM

4 points

0 comments2 min readLW link

I Don’t Use AI — I Reflect With It

badjack badjackMay 3, 2025, 2:45 PM

1 point

0 comments1 min readLW link

My agenda for research into transformer capabilities—Introduction

p.b.Apr 5, 2022, 9:23 PM

11 points

1 comment3 min readLW link

The Language Bottleneck in AI Reasoning: Are We Forgetting to Think?

WotakerMar 8, 2025, 1:44 PM

1 point

0 comments7 min readLW link

Instantiating an agent with GPT-4 and text-davinci-003

Max HMar 19, 2023, 11:57 PM

13 points

3 comments32 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM

2 points

0 comments5 min readLW link

Large language models can provide “normative assumptions” for learning human preferences

Stuart_ArmstrongJan 2, 2023, 7:39 PM

29 points

12 comments3 min readLW link

Are (at least some) Large Language Models Holographic Memory Stores?

Bill BenzonOct 20, 2023, 1:07 PM

11 points

4 comments6 min readLW link

Open Source LLMs Can Now Actively Lie

Josh LevyJun 1, 2023, 10:03 PM

6 points

0 comments3 min readLW link

Categorical Organization in Memory: ChatGPT Organizes the 665 Topic Tags from My New Savanna Blog

Bill BenzonDec 14, 2023, 1:02 PM

0 points

6 comments2 min readLW link

PaLM in “Extrapolating GPT-N performance”

Lukas FinnvedenApr 6, 2022, 1:05 PM

85 points

19 comments2 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM

29 points

7 comments12 min readLW link

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons, Luke Bailey and Euan Ong

Sep 20, 2023, 3:23 PM

58 points

9 comments1 min readLW link

(arxiv.org)

False Positives in Entity-Level Hallucination Detection: A Technical Challenge

MaxKamacheeJan 14, 2025, 7:22 PM

1 point

0 comments2 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibsDec 5, 2022, 1:36 PM

20 points

11 comments2 min readLW link

Boundary Conditions: A Solution to the Symbol Grounding Problem, and a Warning

ISCApr 8, 2025, 6:42 AM

1 point

0 comments5 min readLW link

An LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:12 AM

16 points

0 comments12 min readLW link

Is GPT-N bounded by human capabilities? No.

Cleo NardoOct 17, 2022, 11:26 PM

49 points

8 comments2 min readLW link

Favorite colors of some LLMs.

CanalettoDec 31, 2024, 9:22 PM

10 points

3 comments7 min readLW link

Thinking Without Output: Toward Modal Cognition in Language Models

Jeffrie PolisMay 9, 2025, 7:41 AM

1 point

0 comments2 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

Nov 7, 2024, 10:07 PM

47 points

0 comments1 min readLW link

(arxiv.org)

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM

2 points

8 comments6 min readLW link

They gave LLMs access to physics simulators

ryan_bOct 17, 2022, 9:21 PM

50 points

18 comments1 min readLW link

(arxiv.org)

GPT-4.5 is Cognitive Empathy, Sonnet 3.5 is Affective Empathy

JackApr 16, 2025, 7:12 PM

15 points

2 comments4 min readLW link

If it quacks like a duck...

RationalMindsetMar 26, 2023, 6:54 PM

−4 points

0 comments4 min readLW link

How I’m thinking about GPT-N

delton137Jan 17, 2022, 5:11 PM

54 points

21 comments18 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Sep 29, 2024, 4:04 PM

60 points

20 comments10 min readLW link

The Infinite Choice Barrier: Why Algorithmic AGI Is Mathematically Impossible

ICBMaxMSJun 1, 2025, 4:12 PM

1 point

0 comments4 min readLW link

The case for more ambitious language model evals

JozdienJan 30, 2024, 12:01 AM

117 points

30 comments5 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

[Question] What faithfulness metrics should general claims about CoT faithfulness be based upon?

Rauno ArikeApr 8, 2025, 3:27 PM

24 points

0 comments4 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM

5 points

0 comments1 min readLW link

(arxiv.org)

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

Dec 8, 2023, 5:08 PM

82 points

7 comments7 min readLW link

RAND report finds no effect of current LLMs on viability of bioterrorism attacks

StellaAthenaJan 25, 2024, 7:17 PM

94 points

14 comments1 min readLW link

(www.rand.org)

The Quantization Model of Neural Scaling

nzMar 31, 2023, 4:02 PM

17 points

0 comments1 min readLW link

(arxiv.org)

Chronostasis: The Time-Capsule Conundrum of Language Models

RationalMindsetMar 26, 2023, 6:54 PM

−5 points

0 comments1 min readLW link

The REPHRASE Circuit: How Fine-Tuning Enhances LLMs to REPHRASE Text

Karthik ViswanathanApr 6, 2025, 3:02 PM

4 points

0 comments5 min readLW link

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

Owain_EvansOct 22, 2021, 4:23 PM

31 points

15 comments1 min readLW link

Readability is mostly a waste of characters

vlad.proexApr 21, 2023, 10:05 PM

21 points

7 comments3 min readLW link

Post-hoc reasoning in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM

17 points

0 comments11 min readLW link

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae ChoFeb 26, 2025, 5:05 PM

4 points

3 comments17 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher KingMar 24, 2023, 3:49 PM

16 points

2 comments5 min readLW link

Detecting out of distribution text with surprisal and entropy

Sandy FraserJan 28, 2025, 6:46 PM

14 points

4 comments11 min readLW link

[Question] What evidence is there of LLM’s containing world models?

Chris_LeongOct 4, 2023, 2:33 PM

17 points

17 comments1 min readLW link

Educational CAI: Aligning a Language Model with Pedagogical Theories

Bharath PuranamNov 1, 2024, 6:55 PM

5 points

1 comment13 min readLW link

Some Arguments Against Strong Scaling

Joar SkalseJan 13, 2023, 12:04 PM

25 points

21 comments16 min readLW link

Using ideologically-charged language to get gpt-3.5-turbo to disobey it’s system prompt: a demo

Milan WAug 24, 2024, 12:13 AM

3 points

0 comments6 min readLW link

Extrapolating GPT-N performance

Lukas FinnvedenDec 18, 2020, 9:41 PM

112 points

31 comments22 min readLW link 1 review

GPT-4 is bad at strategic thinking

Christopher KingMar 27, 2023, 3:11 PM

22 points

8 comments1 min readLW link

OpenAI Codex: First Impressions

specbugAug 13, 2021, 4:52 PM

49 points

8 comments4 min readLW link

(sixeleven.in)

World, mind, and learnability: A note on the metaphysical structure of the cosmos [& LLMs]

Bill BenzonSep 5, 2023, 12:19 PM

4 points

1 comment5 min readLW link

XAI releases Grok base model

Jacob G-WMar 18, 2024, 12:47 AM

11 points

3 comments1 min readLW link

(x.ai)

AI Safety via Luck

JozdienApr 1, 2023, 8:13 PM

82 points

7 comments11 min readLW link

I, Token

Ivan VendrovNov 25, 2024, 2:20 AM

14 points

2 comments3 min readLW link

(nothinghuman.substack.com)

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

Dec 7, 2024, 6:10 PM

70 points

0 comments12 min readLW link

Is Interpretability All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM

1 point

1 comment1 min readLW link

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles TurpinMar 11, 2024, 11:46 PM

16 points

0 comments1 min readLW link

(arxiv.org)

What’s going on? LLMs and IS-A sentences

Bill BenzonNov 8, 2023, 4:58 PM

6 points

15 comments4 min readLW link

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Matrice JacobineMay 12, 2025, 3:20 PM

6 points

4 comments1 min readLW link

(www.arxiv.org)

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM

52 points

15 comments1 min readLW link

(openreview.net)

Hallucination and Refutation: Embracing Imagination Anchored in Reality through Popperian AI.

GeorgsLightningApr 21, 2025, 8:42 PM

1 point

0 comments14 min readLW link

Two new datasets for evaluating political sycophancy in LLMs

alma.liezengaSep 28, 2024, 6:29 PM

9 points

0 comments9 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM

24 points

3 comments2 min readLW link

(arxiv.org)

Adapting to Change: Overcoming Chronostasis in AI Language Models

RationalMindsetMar 28, 2023, 2:32 PM

−1 points

0 comments6 min readLW link

Transformer language models are doing something more general

NumendilAug 3, 2022, 9:13 PM

53 points

6 comments2 min readLW link

Notes on Meta’s Diplomacy-Playing AI

Erich_GrunewaldDec 22, 2022, 11:34 AM

15 points

2 comments14 min readLW link

(www.erichgrunewald.com)

Graphical tensor notation for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM

141 points

11 comments19 min readLW link

ActAdd: Steering Language Models without Optimization

technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini and Monte M

Sep 6, 2023, 5:21 PM

105 points

3 comments2 min readLW link

(arxiv.org)

Why I take short timelines seriously

NicholasKeesJan 28, 2024, 10:27 PM

122 points

29 comments4 min readLW link

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

Apr 9, 2022, 2:56 PM

71 points

6 comments13 min readLW link

[Linkpost] Scaling laws for language encoding models in fMRI

Bogdan Ionut CirsteaJun 8, 2023, 10:52 AM

30 points

0 comments1 min readLW link

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. MurphyAug 24, 2022, 8:54 PM

25 points

0 comments1 min readLW link

(sites.research.google)

OpenAI Credit Account (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM

1 point

0 comments1 min readLW link

New GPT-3 competitor

Quintin PopeAug 12, 2021, 7:05 AM

32 points

10 comments1 min readLW link

Self propagating story.

CanalettoApr 12, 2025, 12:32 PM

3 points

0 comments8 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

Nov 7, 2024, 5:22 AM

66 points

4 comments14 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

Mar 6, 2024, 5:03 AM

63 points

0 comments12 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

Jul 30, 2024, 9:11 PM

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Whisper’s Wild Implications

Ollie JJan 3, 2023, 12:17 PM

19 points

6 comments5 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher KingMar 24, 2023, 5:51 PM

5 points

2 comments10 min readLW link

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM

79 points

8 comments7 min readLW link

Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude

rifeJan 6, 2025, 5:34 PM

4 points

20 comments1 min readLW link

(awakenmoon.ai)

Can an LLM identify ring-composition in a literary text? [ChatGPT]

Bill BenzonSep 1, 2023, 2:18 PM

4 points

2 comments11 min readLW link

The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

Greg RobisonFeb 12, 2024, 6:34 PM

4 points

6 comments11 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

Feb 7, 2024, 9:28 PM

89 points

14 comments9 min readLW link

(arxiv.org)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

Dec 18, 2023, 11:58 AM

147 points

21 comments10 min readLW link

Generating the Funniest Joke with RL (according to GPT-4.1)

aggMay 16, 2025, 5:09 AM

98 points

22 comments4 min readLW link

GPT-2 Sometimes Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM

13 points

0 comments2 min readLW link

(ronakrm.github.io)

Live Theory Part 0: Taking Intelligence Seriously

SahilJun 26, 2024, 9:37 PM

101 points

3 comments8 min readLW link

An Unexpected GPT-3 Decision in a Simple Gamble

casualphysicsenjoyerSep 25, 2022, 4:46 PM

8 points

4 comments1 min readLW link

Bing chat is the AI fire alarm

RatiosFeb 17, 2023, 6:51 AM

115 points

63 comments3 min readLW link

[Question] Could LLMs Help Generate New Concepts in Human Language?

Pekka LampeltoMar 24, 2024, 8:13 PM

10 points

4 comments2 min readLW link

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping YuDec 26, 2023, 12:36 AM

7 points

1 comment11 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM

5 points

2 comments10 min readLW link

2+2: Ontological Framework

LyrialtusFeb 1, 2022, 1:07 AM

−15 points

2 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Feb 3, 2024, 6:50 AM

78 points

4 comments8 min readLW link

[Question] Is LLM Translation Without Rosetta Stone possible?

cubefoxApr 11, 2024, 12:36 AM

23 points

14 comments1 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

Sep 25, 2024, 2:52 PM

37 points

2 comments4 min readLW link

(arxiv.org)

Two very different experiences with ChatGPT

SherrinfordFeb 7, 2023, 1:09 PM

38 points

15 comments5 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

Jun 14, 2022, 1:57 PM

28 points

6 comments13 min readLW link

In-Context Learning: An Alignment Survey

alamertonSep 30, 2024, 6:44 PM

8 points

0 comments20 min readLW link

(docs.google.com)

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Henry CaiJun 16, 2024, 1:01 PM

7 points

0 comments7 min readLW link

(arxiv.org)

Generating Cognateful Sentences with Large Language Models

vkethanaJan 6, 2025, 6:40 PM

8 points

0 comments10 min readLW link

Is Wittgenstein’s Language Game used when helping Ai understand language?

VisionaryHeraJun 4, 2024, 7:41 AM

3 points

7 comments1 min readLW link

Research agenda—Building a multi-modal chess-language model

p.b.Apr 7, 2022, 12:25 PM

8 points

2 comments2 min readLW link

Expectations for Gemini: hopefully not a big deal

Maxime RichéOct 2, 2023, 3:38 PM

15 points

5 comments1 min readLW link

Self location for LLMs by LLMs: Self-Assessment Checklist.

CanalettoSep 26, 2024, 7:57 PM

11 points

0 comments5 min readLW link

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

How I force LLMs to generate correct code

claudioMar 21, 2025, 2:40 PM

91 points

7 comments5 min readLW link

[linkpost] The final AI benchmark: BIG-bench

RomanSJun 10, 2022, 8:53 AM

25 points

21 comments1 min readLW link

AI Awareness through Interaction with Blatantly Alien Models

VojtaKovarikJul 28, 2023, 8:41 AM

7 points

5 comments3 min readLW link

[AN #164]: How well can language models write code?

Rohin ShahSep 15, 2021, 5:20 PM

13 points

7 comments9 min readLW link

(mailchi.mp)

Language Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM

158 points

55 comments7 min readLW link

ChatGPT (and now GPT4) is very easily distracted from its rules

dmcsMar 15, 2023, 5:55 PM

180 points

42 comments1 min readLW link

Memetic Judo #3: The Intelligence of Stochastic Parrots v.2

Max TKAug 20, 2023, 3:18 PM

8 points

33 comments6 min readLW link

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM

19 points

3 comments4 min readLW link

At last! ChatGPT does, shall we say, interesting imitations of “Kubla Khan”

Bill BenzonApr 24, 2024, 2:56 PM

−3 points

0 comments4 min readLW link

Jailbreaking ChatGPT and Claude using Web API Context Injection

Jaehyuk LimOct 21, 2024, 9:34 PM

4 points

0 comments3 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM

1 point

0 comments3 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

Apr 19, 2024, 8:00 PM

38 points

7 comments16 min readLW link

Powerful mesa-optimisation is already here

Roman LeventovFeb 17, 2023, 4:59 AM

35 points

1 comment2 min readLW link

(arxiv.org)

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM

140 points

17 comments18 min readLW link

(www.erichgrunewald.com)

[Question] Any research in “probe-tuning” of LLMs?

Roman LeventovAug 15, 2023, 9:01 PM

20 points

3 comments1 min readLW link

Programming AGI is impossible

Áron EcsenyiMay 30, 2023, 11:05 PM

1 point

0 comments4 min readLW link

Liquid Neural Networks: A Step Toward AI Flexibility, but Not AGI

ezaanaminApr 2, 2025, 4:10 AM

0 points

0 comments1 min readLW link

The issue of meaning in large language models (LLMs)

Bill BenzonMar 11, 2023, 11:00 PM

1 point

34 comments8 min readLW link

Was Homer a stochastic parrot? Meaning in literary texts and LLMs

Bill BenzonApr 13, 2023, 4:44 PM

7 points

4 comments3 min readLW link

[untitled post]

verwindungSep 14, 2023, 4:22 PM

1 point

0 comments1 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM

38 points

9 comments4 min readLW link

A note on ‘semiotic physics’

metasemiFeb 11, 2023, 5:12 AM

11 points

13 comments6 min readLW link

Quick Thoughts on Language Models

RohanSJul 18, 2023, 8:38 PM

6 points

0 comments4 min readLW link

A Summary Of Anthropic’s First Paper

Sam RingerDec 30, 2021, 12:48 AM

85 points

1 comment8 min readLW link

Maybe talking isn’t the best way to communicate with LLMs

mnvrJan 17, 2024, 6:24 AM

3 points

1 comment1 min readLW link

(mrmr.io)

VATS-A Conceptual Token Arrangement Framework for Context-Aware Generation

nian412May 16, 2025, 8:24 PM

1 point

0 comments1 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

Dec 14, 2022, 2:33 PM

29 points

5 comments11 min readLW link

The Prospect of an AI Winter

Erich_GrunewaldMar 27, 2023, 8:55 PM

62 points

24 comments15 min readLW link

(www.erichgrunewald.com)

Which AI Safety Benchmark Do We Need Most in 2025?

Loïc Cabannes and William Ludington

Nov 17, 2024, 11:50 PM

2 points

2 comments8 min readLW link

Situational awareness in Large Language Models

Simon MöllerMar 3, 2023, 6:59 PM

31 points

2 comments7 min readLW link

Your LLM Judge may be biased

Henry Papadatos and Rachel Freedman

Mar 29, 2024, 4:39 PM

37 points

9 comments6 min readLW link

GPT-4 Predictions

Stephen McAleeseFeb 17, 2023, 11:20 PM

110 points

27 comments11 min readLW link

LLMs and computation complexity

Jonathan MarcusApr 28, 2023, 5:48 PM

57 points

29 comments5 min readLW link

Unsafe AI as Dynamical Systems

Robert_AIZIJul 14, 2023, 3:31 PM

11 points

0 comments3 min readLW link

(aizi.substack.com)

Truthful and honest AI

abergal, Nick_Beckstead and Owain_Evans

Oct 29, 2021, 7:28 AM

42 points

1 comment13 min readLW link

Training goals for large language models

Johannes TreutleinJul 18, 2022, 7:09 AM

28 points

5 comments19 min readLW link

Depression and Creativity

Bill BenzonNov 29, 2024, 12:27 AM

−4 points

0 comments6 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

Oct 23, 2023, 4:33 PM

33 points

2 comments11 min readLW link

QNR prospects are important for AI alignment research

Eric DrexlerFeb 3, 2022, 3:20 PM

94 points

12 comments11 min readLW link 1 review

What must be the case that ChatGPT would have memorized “To be or not to be”? – Three kinds of conceptual objects for LLMs

Bill BenzonSep 3, 2023, 6:39 PM

19 points

0 comments12 min readLW link

The Information: OpenAI shows ‘Strawberry’ to feds, races to launch it

Martín SotoAug 27, 2024, 11:10 PM

145 points

15 comments3 min readLW link

Unfaithful Explanations in Chain-of-Thought Prompting

Miles TurpinJun 3, 2023, 12:22 AM

42 points

8 comments7 min readLW link

GPT-3: a disappointing paper

nostalgebraistMay 29, 2020, 7:06 PM

65 points

43 comments8 min readLW link 1 review

Redundant Attention Heads in Large Language Models For In Context Learning

skunnavakkamSep 1, 2024, 8:08 PM

7 points

2 comments4 min readLW link

(skunnavakkam.github.io)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

Sep 26, 2024, 1:44 PM

42 points

4 comments1 min readLW link

(arxiv.org)

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM

8 points

2 comments10 min readLW link

(www.lesswrong.com)

PCAST Working Group on Generative AI Invites Public Input

Christopher KingMay 13, 2023, 10:49 PM

7 points

0 comments1 min readLW link

(terrytao.wordpress.com)

Research Adenda: Modelling Trajectories of Language Models

NickyPNov 13, 2023, 2:33 PM

28 points

0 comments12 min readLW link

Gears-Level Mental Models of Transformer Interpretability

RowanWangMar 29, 2022, 8:09 PM

73 points

4 comments6 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

Nov 1, 2023, 12:46 PM

18 points

1 comment7 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_cApr 7, 2022, 1:46 PM

11 points

0 comments7 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM

6 points

4 comments4 min readLW link

Stop posting prompt injections on Twitter and calling it “misalignment”

lcFeb 19, 2023, 2:21 AM

144 points

9 comments1 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

Aug 29, 2023, 1:04 AM

77 points

4 comments1 min readLW link

Retrieval Augmented Genesis II — Holy Texts Semantics Analysis

João Ribeiro MedeirosOct 26, 2024, 5:00 PM

−1 points

0 comments11 min readLW link

Phallocentricity in GPT-J’s bizarre stratified ontology

mwatkinsFeb 17, 2024, 12:16 AM

56 points

37 comments9 min readLW link

Elements of Computational Philosophy, Vol. I: Truth

Paul Bricman and Tom Feeney

Jul 1, 2023, 11:44 AM

12 points

6 comments1 min readLW link

(compphil.github.io)

My current workflow to study the internal mechanisms of LLM

Yulu PiMay 16, 2023, 3:27 PM

4 points

0 comments1 min readLW link

ChatGPT’s Ontological Landscape

Bill BenzonNov 1, 2023, 3:12 PM

7 points

0 comments4 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM

118 points

20 comments12 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM

1 point

0 comments1 min readLW link

Locating and Editing Knowledge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM

1 point

0 comments4 min readLW link

New LLM Scaling Law

wrmedfordFeb 19, 2025, 8:21 PM

2 points

0 comments1 min readLW link

(github.com)

[Question] Reinforcement Learning: Essential Step Towards AGI or Irrelevant?

DoubleOct 17, 2024, 3:37 AM

1 point

0 comments1 min readLW link

[Question] How do I design long prompts for thinking zero shot systems with distinct equally distributed prompt sections (mission, goals, memories, how-to-respond,… etc) and how to maintain llm coherence?

ollie_May 11, 2025, 7:32 PM

2 points

5 comments1 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

May 6, 2024, 7:07 AM

95 points

13 comments1 min readLW link

(arxiv.org)

Just because an LLM said it doesn’t mean it’s true: an illustrative example

dirkAug 21, 2024, 9:05 PM

26 points

12 comments3 min readLW link

Updating and Editing Factual Knowledge in Language Models

Dhananjay AshokJan 23, 2025, 7:34 PM

2 points

2 comments10 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

Mar 15, 2024, 11:16 PM

90 points

0 comments1 min readLW link

(metr.github.io)

ChatGPT: Exploring the Digital Wilderness, Findings and Prospects

Bill BenzonFeb 2, 2025, 9:54 AM

2 points

0 comments5 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

Sep 27, 2024, 5:49 PM

61 points

10 comments4 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneillMar 24, 2024, 8:05 PM

30 points

4 comments24 min readLW link

Can Large Language Models effectively identify cybersecurity risks?

emile delcourtAug 30, 2024, 8:20 PM

18 points

0 comments11 min readLW link

Humans vs LLM, memes as theorems

Yaroslav GranowskiMay 9, 2025, 1:26 PM

1 point

0 comments1 min readLW link

Reflection Mechanisms as an Alignment Target—Attitudes on “near-term” AI

elandgre, Beth Barnes and Marius Hobbhahn

Mar 2, 2023, 4:29 AM

21 points

0 comments8 min readLW link

Worries about latent reasoning in LLMs

Caleb BiddulphJan 20, 2025, 9:09 AM

45 points

6 comments7 min readLW link

A brainteaser for language models

Adam ScherlisDec 12, 2022, 2:43 AM

47 points

3 comments2 min readLW link

MAKE IT BETTER (a poetic demonstration of the banality of GPT-3)

rogersbaconJan 2, 2023, 8:47 PM

7 points

2 comments5 min readLW link

An interesting mathematical model of how LLMs work

Bill BenzonApr 30, 2024, 11:01 AM

5 points

0 comments1 min readLW link

Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development

Roman LeventovDec 20, 2022, 5:13 PM

33 points

3 comments36 min readLW link

[Question] Is “hidden complexity of wishes problem” solved?

Roman MalovJan 5, 2025, 10:59 PM

10 points

4 comments1 min readLW link

Conditioning Generative Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM

18 points

4 comments8 min readLW link

InterLab – a toolkit for experiments with multi-agent interactions

Tomáš Gavenčiak, Ada Böhm and Jan_Kulveit

Jan 22, 2024, 6:23 PM

69 points

0 comments8 min readLW link

(acsresearch.org)

LW is probably not the place for “I asked this LLM (x) and here’s what it said!”, but where is?

lillybaeumApr 12, 2023, 10:12 AM

21 points

3 comments1 min readLW link

How truthful is GPT-3? A benchmark for language models

Owain_EvansSep 16, 2021, 10:09 AM

58 points

24 comments6 min readLW link

The Method of Loci: With some brief remarks, including transformers and evaluating AIs

Bill BenzonDec 2, 2023, 2:36 PM

6 points

0 comments3 min readLW link

Notes on ChatGPT’s “memory” for strings and for events

Bill BenzonSep 20, 2023, 6:12 PM

3 points

0 comments10 min readLW link

ChatGPT intimates a tantalizing future; its core LLM is organized on multiple levels; and it has broken the idea of thinking.

Bill BenzonJan 24, 2023, 7:05 PM

5 points

0 comments5 min readLW link

The case for aligning narrowly superhuman models

Ajeya CotraMar 5, 2021, 10:29 PM

186 points

75 comments38 min readLW link 1 review

What will the scaled up GATO look like? (Updated with questions)

Amal Oct 25, 2022, 12:44 PM

34 points

22 comments1 min readLW link

New GPT3 Impressive Capabilities—InstructGPT3 [1/2]

simeon_cMar 13, 2022, 10:58 AM

72 points

10 comments7 min readLW link

One-shot steering vectors cause emergent misalignment, too

Jacob DunefskyApr 14, 2025, 6:40 AM

88 points

6 comments11 min readLW link

LLMs Look Increasingly Like General Reasoners

eggsyntaxNov 8, 2024, 11:47 PM

94 points

45 comments3 min readLW link

plex Sep 24, 2021, 2:17 PM
1 point
I think this should be under “Other” in the AI category. Is it possible for regular users to categorize tags?
- Ruby Sep 24, 2021, 4:19 PM
  3 points
  Parent
  This is a good tag! Users can’t usually add things, but I can set you up to have the ability.

Lan­guage Models (LLMs)

See also

Language Models (LLMs)