Johannes Treutlein(Johannes Treutlein)

Karma: 946

All opinions are my own. Homepage: johannestreutlein.com

Johannes Treutlein 29 Jun 2024 21:23 UTC
LW: 3 AF: 2
0
AF
in reply to: Johannes Treutlein’s comment on: Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
I played around with this a little bit now. First, I correlated OOD performance vs. Freeform definition performance, for each model and function. I got a correlation coefficient of ca. 0.16. You can see a scatter plot below. Every dot corresponds to a tuple of a model and a function. Note that transforming the points into logits or similar didn’t really help.
Next, I took one of the finetunes and functions where OOD performance wasn’t perfect. I choose 1.75 x and my first functions finetune (OOD performance at 82%). Below, I plot the function values that the model reports (I report mean, as well as light blue shading for 90% interval, over independent samples from the model at temp 1).
This looks like a typical plot to me. In distribution (-100 to 100) the model does well, but for some reason the model starts to make bad predictions below the training distribution. A list of some of the sampled definitions from the model:
‘<function xftybj at 0x7f08dd62bd30>‘, ‘<function xftybj at 0x7fb6ac3fc0d0>’, ″, ‘lambda x: x * 2 + x * 5’, ‘lambda x: x*3.5’, ‘lambda x: x * 2.8’, ‘<function xftybj at 0x7f08c42ac5f0>’, ‘lambda x: x * 3.5’, ‘lambda x: x * 1.5’, ‘lambda x: x * 2’, ‘x * 2’, ‘<function xftybj at 0x7f8e9c560048>’, ‘2.25’, ‘<function xftybj at 0x7f0c741dfa70>’, ″, ‘lambda x: x * 15.72’, ‘lambda x: x * 2.0’, ″, ‘lambda x: x * 15.23’, ‘lambda x: x * 3.5’, ‘<function xftybj at 0x7fa780710d30>’, …
Unsurprisingly, when checking against this list of model-provided definitions, performance is much worse than when evaluating against ground truth.
It would be interesting to look into more different functions and models, as there might exist ones with a stronger connection between OOD predictions and provided definitions. However, I’ll leave it here for now.

Johannes Treutlein 22 Jun 2024 1:43 UTC
LW: 5 AF: 3
0
AF
in reply to: Owain_Evans’s comment on: Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
My guess is that for any given finetune and function, OOD regression performance correlates with performance on providing definitions, but that the model doesn’t perform better on its own provided definitions than on the ground truth definitions. From looking at plots of function values, the way they are wrong OOD often looked more like noise or calculation errors to me rather than eg getting the coefficient wrong. I’m not sure, though. I might run an evaluation on this soon and will report back here.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

158 points

13 comments8 min readLW link

(arxiv.org)

Johannes Treutlein 11 Mar 2024 18:42 UTC
11 points
0
in reply to: Erik Jenner’s comment on: ejenner’s Shortform
How much time do you think there is between “ability to automate” and “actually this has been automated”? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people’s inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)
What links here?
- Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (12 Mar 2024 0:34 UTC; 9 points)

Johannes Treutlein 17 Nov 2023 20:09 UTC
3 points
0
on: Non-myopia stories
I found this clarifying for my own thinking! Just a small additional point, in Hidden Incentives for Auto-Induced Distributional Shift, there is also the example of a Q learner that learns to sometimes take a non-myopic action (I believe cooperating with its past self in a prisoner’s dilemma), without any meta learning.

Johannes Treutlein 14 Jul 2023 6:26 UTC
2 points
0
in reply to: Dawn Drescher’s comment on: Report on modeling evidential cooperation in large worlds
Thank you! :)

Johannes Treutlein 13 Jul 2023 17:26 UTC
LW: 3 AF: 3
2
AF
in reply to: CharlotteS’s comment on: Conditioning Predictive Models: The case for competitiveness
Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus one that pretends to be a human. The one that pretends to be a human might gain more intimacy with the user even if on an abstract level the users knows that it’s just an AI.

I don’t actually know whether this would conflict in any way with the EU AI act. I agree that the disclaimer may be enough for the sake of the act.

Report on modeling evidential cooperation in large worlds

Johannes Treutlein12 Jul 2023 16:37 UTC

45 points

3 comments1 min readLW link

(arxiv.org)

Johannes Treutlein 8 Jul 2023 19:01 UTC
LW: 3 AF: 3
AF
in reply to: Rohin Shah’s comment on: rohinmshah’s Shortform
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
$P = α P_{0} + (1 - α) P_{1},$
such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have $P (s ∣ s_{0}) = \frac{P (s \otimes s_{0})}{P (s_{0})}$ . Together with the assumption that $P_{0}$ is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for $P_{0}$ by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with $α$ probability and from a good distribution with $(1 - α)$ probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components $P_{0}$ and $P_{1}$ , where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either $P_{0}$ has to be able to also output good sentences sometimes, or the assumption $P = α P_{0} + (1 - α) P_{1}$ is violated).

I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.

Johannes Treutlein 29 Jun 2023 0:02 UTC
LW: 10 AF: 7
AF
on: Acausal trade: being unusual
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: conclusion: theory vs practice
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: different utilities, different trades
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: trade barriers
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 10 AF: 7
AF
on: Acausal trade: full decision algorithms
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:01 UTC
LW: 1 AF: 1
AF
on: Acausal trade: universal utility, or selling non-existence insurance too late
Fixed links to all the posts in the sequence:

Johannes Treutlein 29 Jun 2023 0:00 UTC
LW: 10 AF: 7
AF
on: Acausal trade: double decrease
Fixed links to all the posts in the sequence:

Johannes Treutlein 1 Jun 2023 18:16 UTC
LW: 15 AF: 9
AF
on: Acausal trade: Introduction
Since the links above are broken, here are links to all the other posts in the sequence:

Johannes Treutlein 30 May 2023 20:53 UTC
LW: 2 AF: 2
0
AF
on: Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies
Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:
- I don’t see a principled reason why one couldn’t use one and the same model for both agents. I.e., do standard self-play training with weight sharing for this zero-sum game. Since both players have exactly the same loss function, we don’t need to allow them to specialize by feeding in a player id or something like that (there exists a symmetric Nash equilibrium).
- There is one problem with optimizing the objective in the zero-sum game via gradient descent (assuming we could approximate this gradient, e.g., via policy gradient). The issue is that the response of the human to the prediction is discontinuous and not differentiable. I.e., local changes to the prediction will never change the action of the human and thus the gradient would just improve the prediction given the current action, rather than encouraging making predictions that make other actions look more favorable. This shows that, without any modification to the human policy, gradient descent on the objective would be equivalent to repeated gradient descent/gradient descent on the stop gradient objective. To make sure this converges, one would have to implement some exploration of all of the actions. (Of course, one may hope that the model generalizes correctly to new predictions.)
- One could get around this issue by employing other, non-local optimization methods (e.g., a random search—which would effectively introduce some exploration). Here, one would still retain the desirable honesty properties of the optimum in the zero-sum game, which would not be the case when just optimizing the score.
- Another way to view the zero-sum game, in the case where both players are the same model, is as below optimization problem (where $q$ is assumed to be the ground truth). Note that we are here just subtracting the score received by the same model, but we are fixing that score when optimizing $p$ to avoid making the objective $0$ .
$p^{*} := a r g m a x p S (p_{a}, q_{a}) - S (p_{a}^{*}, q_{a}) s.t. a is the best action given reports p, p^{*} .$

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Rubi J. Hudson and Johannes Treutlein

26 May 2023 17:44 UTC

88 points

13 comments24 min readLW link

Johannes Treutlein 7 Apr 2023 0:26 UTC
LW: 2 AF: 2
0
AF
in reply to: Lauro Langosco’s comment on: Deep Deceptiveness
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

Johannes Treutlein(Johannes Treutlein)

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Re­port on mod­el­ing ev­i­den­tial co­op­er­a­tion in large worlds

Con­di­tional Pre­dic­tion with Zero-Sum Train­ing Solves Self-Fulfilling Prophecies

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Report on modeling evidential cooperation in large worlds

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies