Maximilian Kaufmann

Karma: 190

Maximilian Kaufmann 17 Feb 2025 13:51 UTC
6 points
3
in reply to: L Rudolf L’s comment on: William_S’s Shortform
I’d like to use this from your description, but tough to trust / understand without a README!

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

23 Sep 2023 19:55 UTC

125 points

74 comments4 min readLW link

(arxiv.org)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

111 points

17 comments5 min readLW link

(arxiv.org)

Maximilian Kaufmann 16 May 2023 22:15 UTC
3 points
2
on: Proposal: we should start referring to the risk from unaligned AI as a type of *accident risk*
A point against that particular terminology which you might find interesting. https://www.lesswrong.com/posts/6bpW2kyeKaBtuJuEk/why-i-hate-the-accident-vs-misuse-ai-x-risk-dichotomy-quick

Maximilian Kaufmann 14 Jan 2023 22:15 UTC
3 points
2
on: Basic Question about LLMs: how do they know what task to perform
To partially answer your question ( I think the answer to “What is happening inside the LLM when it ‘switches’ to one task or another?” is pretty much “We don’t know”), techniques such as RLHF (which nowadays are applied to pretty much any public-facing model you are likely to interact with) cause the model to act less like something searching for the most likely completion of this sentence on the internet, and more like something which is trying to answer your questions. These models would take “question” interpretation over the “autocomplete” one.
A purely pretrained model might be more likely to do something like output a probability distribution over the first token which is split over both possible continuations, and then would stick with whichever interpretetation it happened to generate as it autoregresses on its own output.

Maximilian Kaufmann 7 Oct 2022 23:38 UTC
LW: 15 AF: 8
0
AF
on: More examples of goal misgeneralization
How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?