Jacob_Hilton

Karma: 1,516

Jacob_Hilton Nov 8, 2022, 12:05 AM
1 point
0
in reply to: Charbel-Raphaël’s comment on: Don’t you think RLHF solves outer alignment?
I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.

Jacob_Hilton Nov 6, 2022, 12:21 AM
13 points
7
in reply to: habryka’s comment on: Don’t you think RLHF solves outer alignment?
I would estimate that the difference between “hire some mechanical turkers and have them think for like a few seconds” and the actual data collection process accounts for around ¹⁄₃ of the effort that went into WebGPT, rising to around ²⁄₃ if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.
I think it’s best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even when the distinction isn’t important to you—although I can see why you might not have realized the importance of the distinction to others from reading papers alone, and “a few minutes” is definitely less inaccurate.

Jacob_Hilton Nov 5, 2022, 6:31 PM
4 points
0
in reply to: habryka’s comment on: Don’t you think RLHF solves outer alignment?
However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes
I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper.
There is RLHF work that uses lower-quality data, but it tends to be work that is more experimental, because data quality becomes important once you are working on a model that is going to be used in the real world.
Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters
There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?
What links here?
- David_Althaus's comment on David_Althaus’s Quick takes by David_Althaus (EA Forum; Mar 27, 2023, 11:10 AM; 18 points)

Jacob_Hilton Nov 5, 2022, 8:42 AM
4 points
1
in reply to: habryka’s comment on: Don’t you think RLHF solves outer alignment?
I agree that the RLHF framework is essentially just a form of model-based RL, and that its outer alignment properties are determined entirely by what you actually get the humans to reward. But your description of RLHF in practice is mostly wrong. Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at sufficient scale. There is some RLHF research that uses low-quality feedback similar to what you describe, but it does so in order to focus on other aspects of the setup, and I don’t think anyone currently working on RLHF seriously considers that quality of feedback to be at all adequate outside of an experimental setting.
The RLHF work I’m most excited by, and which constitutes a large fraction of current RLHF work, is focused on getting humans to reward the right thing, and I’m particularly excited about approaches that involve model assistance, since that’s the main way in which we can hope for the approach to scale gracefully with model capabilities. I’m also excited by other RLHF work because it supports this work and has other practical benefits.
I don’t think RLHF directly addresses inner alignment, but I think that an inner alignment solution is likely to rely on us doing a good enough job at outer alignment, and I also have a lot of uncertainty about how much of the risk comes from outer alignment failures directly.
What links here?
- Compendium of problems with RLHF by Charbel-Raphaël (Jan 29, 2023, 11:40 AM; 120 points)
- Compendium of problems with RLHF by Raphaël S (EA Forum; Jan 30, 2023, 8:48 AM; 18 points)

Jacob_Hilton Oct 29, 2022, 5:01 AM
LW: 2 AF: 2
0
AF
in reply to: Adam Jermyn’s comment on: Scaling Laws for Reward Model Overoptimization
1. We are just observing that the gold RM score curves in Figure 9 overlap. In other words, the KL penalty did not affect the relationship between KL and gold RM score in this experiment, meaning that any point on the Pareto frontier could be reached using only early stopping, without the KL penalty. As mentioned though, we’ve observed this result to be sensitive to hyperparameters, and so we are less confident in it than other results in the paper.
2. I don’t have this data to hand unfortunately.
3. I don’t have this data to hand, but entropy typically falls roughly linearly over the course of training, sometimes slightly faster towards the start, and typically moving around more than KL. So I’d expect the graph to look somewhat similar, but for it to be noisier and for the functional form to not fit as well.

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

Oct 20, 2022, 12:20 AM

103 points

13 comments1 min readLW link

(arxiv.org)

Jacob_Hilton Sep 15, 2022, 4:03 PM
LW: 3 AF: 3
2
AF
in reply to: tailcalled’s comment on: Coordinate-Free Interpretability Theory
Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.

Jacob_Hilton Sep 15, 2022, 5:26 AM
LW: 31 AF: 19
16
AF
on: Coordinate-Free Interpretability Theory
The notion of a preferred (linear) transformation for interpretability has been called a “privileged basis” in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.
In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the start of training. The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached. The initialization is usually Gaussian and hence isotropic anyway, but if it did have a privileged basis I would also expect this to be quickly lost without some other reason to hold onto it.

Jacob_Hilton Aug 27, 2022, 3:26 PM
32 points
5
in reply to: gadyp’s comment on: Common misconceptions about OpenAI
Thank you for causing me to reconsider. I should have said “other OpenAI employees”. I do not intend to disengage from the alignment community because of critical rhetoric, and I apologize if my comment came across as a threat to do so. I am concerned about further breakdown of communication between the alignment community and AI labs where alignment solutions may need to be implemented.
I don’t immediately see any other reason why my comment might have been inappropriate, but I welcome your clarification if I am missing something.

Jacob_Hilton Aug 27, 2022, 8:30 AM
9 points
5
in reply to: lc’s comment on: Common misconceptions about OpenAI
I obviously think there are many important disanalogies, but even if there weren’t, rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community, which seems like a pretty bad thing to me.

Jacob_Hilton Aug 26, 2022, 8:34 PM
LW: 3 AF: 3
0
AF
in reply to: Neel Nanda’s comment on: Common misconceptions about OpenAI
For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

Jacob_Hilton Aug 26, 2022, 2:51 PM
LW: 7 AF: 1
0
AF
in reply to: habryka’s comment on: Common misconceptions about OpenAI
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: “this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here”.

Jacob_Hilton Aug 25, 2022, 9:19 PM
LW: 28 AF: 13
7
AF
in reply to: habryka’s comment on: Common misconceptions about OpenAI
I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).
What links here?
- Richard_Ngo's comment on Common misconceptions about OpenAI by Jacob_Hilton (Aug 29, 2022, 11:55 PM; 11 points)

Jacob_Hilton Aug 25, 2022, 7:24 PM
LW: 13 AF: 8
4
AF
in reply to: Larks’s comment on: Common misconceptions about OpenAI
It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

Jacob_Hilton Aug 25, 2022, 6:54 PM
LW: 2 AF: 2
1
AF
in reply to: Joe Collman’s comment on: Common misconceptions about OpenAI
I don’t think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a “problem”. Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

Jacob_Hilton Aug 25, 2022, 5:14 PM
LW: 24 AF: 18
5
AF
in reply to: johnswentworth’s comment on: Common misconceptions about OpenAI
To clarify, by “empirical” I meant “relating to differences in predictions” as opposed to “relating to differences in values” (perhaps “epistemic” would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

Common misconceptions about OpenAI

Jacob_HiltonAug 25, 2022, 2:02 PM

238 points

154 comments5 min readLW link 1 review

Jacob_Hilton Aug 21, 2022, 6:08 PM
1 point
0
in reply to: Lukas_Gloor’s comment on: How much alignment data will we need in the long run?
This roughly matches some of the intuitions behind my last bullet that you referenced.

Jacob_Hilton Aug 21, 2022, 5:56 PM
1 point
0
in reply to: Lukas_Gloor’s comment on: How much alignment data will we need in the long run?
It’s hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.
It’s even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a training episode might look more-or-less the same, but with reward chosen in a more sophisticated way and/or with other training objectives mixed in. The post I linked to discusses some possible modifications.

Jacob_Hilton Aug 12, 2022, 9:19 PM
LW: 1 AF: 1
0
AF
in reply to: Ramana Kumar’s comment on: How much alignment data will we need in the long run?
This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.
For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.
I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

Jacob_Hilton

Scal­ing Laws for Re­ward Model Overoptimization

Com­mon mis­con­cep­tions about OpenAI

Scaling Laws for Reward Model Overoptimization

Common misconceptions about OpenAI