Felix Hofstätter

Karma: 231

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

13 Jun 2024 10:04 UTC

77 points

10 comments2 min readLW link

(arxiv.org)

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

43 points

7 comments8 min readLW link

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

29 Jan 2024 0:24 UTC

39 points

5 comments4 min readLW link

Felix Hofstätter 16 Jan 2024 5:15 UTC
1 point
0
on: The case for training frontier AIs on Sumerian-only corpus
What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like “if” could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.
Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don’t require language anyway. I suppose for image generating models, you could train the frontier model to apply an obfuscating filter and have the reporter restore the image iff it is relevant to medical research.
Maybe my biggest concern is that I’m not convinced that we will be safe from manipulation by restricting the model’s communication to being only about e.g. medical research. I don’t have a precise story for how, but it seems plausible that a misaligned models could output instructions for medical research, which also have side-effects that work towards the model’s secret goal.
Nitpicks aside, I like the idea of adding a safety layer by preventing the model to communicate directly with humans! And my concerns feel like technical issues that could easily be worked out given enough resources.

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak and Sam F. Brown

8 Nov 2023 11:37 UTC

49 points

0 comments18 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

15 Aug 2023 21:13 UTC

19 points

0 comments17 min readLW link

Explaining the Transformer Circuits Framework by Example

Felix Hofstätter25 Apr 2023 13:45 UTC

8 points

0 comments15 min readLW link

Felix Hofstätter 19 Apr 2023 8:28 UTC
1 point
0
on: No, really, it predicts next tokens.
While reading the the post and then some of the discussion I got confused about if it makes sense to distinguish between That-Which-Predicts and the mask in your model.
Usually I understand the mask to be what you get after fine-tuning—a simulator whose distribution over text is shaped like what we would expect from some character like the the well-aligned chat-bot whose replies are honest, helpful, and harmless (HHH). This stand in contrast to the “Shoggoth” which is the pretrained model without any fine-tuning. It’s still a simulator but with a distribution that might be completely alien to us. Since a model’s distribution is conditional on the input, you can “switch masks” by finding some input, conditional on which the finetuned model’s distribution corresponds to a different kind of character. You can also “awaken the Shoggoth” and get a glimpse of what is below the mask by finding some input, conditional on which the distribution is strange and alien. This is what we know as jail-breaking. In this view, the mask and whatever “is beneath” are different distributions but not different types of things. Taking away the mask just means finding an input for which the distribution of outputs has not been well shaped by fine-tuning.
The way you talk about the mask and That-Which-Predicts, it sounds like they are different entities which exist at the same time. It seem like the mask determines something like the rules that the text made up by the tokens should satisfy and That-Which-Predicts then predicts the next token according to the rules. E.g. the mask of the well-aligned chat-bot determines that the text should ultimately be HHH and That-Which-Predicts predicts the token that it considers most likely for a text that is HHH.
At least this is the impression I get from all the “If the mask would X, That-Which-Predicts would Y” in the post and from some of your replies like “that sort of question is in my view answered by the “mask”″ and I may misunderstand!
The confusion I have about this distinction, is that it does not seem to me like a That-Which-Predicts like what you are describing could ever be without a mask. The mask is a distribution and predicting the next token means drawing an element from that distribution. Is there anything we gain from conceiving of a That-Which-Predicts which seems to be a thing drawing elements from the distribution as separate from the distribution?

Reflections On The Feasibility Of Scalable-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC

11 points

0 comments12 min readLW link

Felix Hofstätter 9 Mar 2023 21:22 UTC
4 points
1
AF
on: Anthropic’s Core Views on AI Safety
Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?

Felix Hofstätter 15 Sep 2022 19:49 UTC
1 point
0
on: How should DeepMind’s Chinchilla revise our AI forecasts?
Very interesting, after reading chinchilla’s wild implications I was hoping someone would write something like this!

If I understand point 6 correctly, then you are proposing that Hoffman’s scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman’s laws than under Kaplan’s.

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

On Preference Manipulation in Reward Learning Processes

Felix Hofstätter15 Aug 2022 19:32 UTC

8 points

0 comments4 min readLW link