Sam Bowman 2 Aug 2022 0:04 UTC
LW: 17 AF: 7
16
AF
in reply to: IL’s comment on: chinchilla’s wild implications
I agree that this points in the direction of video becoming increasingly important.
But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don’t have the scaling laws, but it seems pretty clear that there’s a ton of information in the non-language parts of videos that’d be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it’ll take more computation to extract the same amount of useful information from video than from text.)

Sam Bowman 15 Feb 2023 17:03 UTC
14 points
4
on: Qualities that alignment mentors value in junior researchers
When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.

This issue is real, it’s the thing that frustrates me most about alignment pipeline-building work in general right now. There are very likely some important formal/theoretical areas of alignment research that really do need to recruit mostly for something like ‘genius’. But a lot more of the active work that’s getting done (and a way more of the hard-to-fill open jobs) depend much, much more on skills 1–5 here much more than on intelligence in that sense.
(This is on the margin. Here I’m focused on the actual population of people who tend to be interested in ML alignment research, so I’m baking in the assumption that all of the candidates could, say, get above-average grades in a STEM undergrad degree at a top-100 university if they tried.)
As someone who’s supervised/trained ML researchers for ~8 years now, I’d pretty much always hire someone who’s 90th-percentile on two or three of these skills than someone who’s no better than 70th percentile but has world-class IMO (or IOI) performance or a verified IQ of 160 or some other classic raw intelligence signal.

Sam Bowman 21 Jul 2023 5:05 UTC
LW: 10 AF: 7
4
AF
in reply to: Seth Herd’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model’s final output on any given task.
So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense.
This doesn’t provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the string implies false counterfactuals about the model’s reasoning. Larger models are also just better at this kind of task, and the tasks all have only one correct answer, so any metric that requires the model to make mistakes in order to demonstrate faithfulness is going to struggle. I think at least for intuitive readings of a term like ‘faithfulness’, this all adds up to the claim in the comment above.
Counterfactual-based metrics, like the ones in the Turpin paper, are less vulnerable to this, and that’s probably where I’d focus if I wanted to push much further on measurement given what we know now. Though we already know from that paper that standard CoT in near-frontier models isn’t reliably faithful by that measure.
We may be able to follow up with a few more results to clarify the takeaways about scaling, and in particular, I think just running a scaling sweep for the perturbed reasoning adding-mistakes metric from the Lanham paper here would clarify things a bit. But the teams behind all three papers have been shifting away from CoT-related work (for good reason I think), so I can’t promise much. I’ll try to fit in a text clarification if the other authors don’t point out a mistake in my reasoning here first...

Sam Bowman 8 Dec 2022 3:00 UTC
10 points
11
on: Probably good projects for the AI safety ecosystem
More organizations like CAIS that aim to recruit established ML talent into alignment research
This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

Sam Bowman 5 Apr 2022 19:57 UTC
10 points
in reply to: Søren Elverlin’s comment on: Google’s new 540 billion parameter language model
The BIG-Bench paper that those ‘human’ numbers are coming from (unpublished, quasi-public as TeX here) cautions against taking those average very seriously, without giving complete details about who the humans are or how they were asked/incentivized to behave on tasks that required specialized skills:

Sam Bowman 25 Aug 2023 18:29 UTC
LW: 9 AF: 6
2
AF
on: Reducing sycophancy and improving honesty via activation steering
Possible confound: Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who’s nominally asking the question should generally get you a better answer.

Sam Bowman 21 Apr 2022 15:22 UTC
LW: 9 AF: 5
AF
on: New Scaling Laws for Large Language Models
Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?

Sam Bowman 23 Nov 2022 23:02 UTC
LW: 8 AF: 5
13
AF
in reply to: JanB’s comment on: Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular ‘core’ AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.
(To be fair, I think the Inverse Scaling Prize, which I’m helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

Sam Bowman 20 Jul 2023 2:44 UTC
LW: 7 AF: 2
−1
AF
in reply to: tamera’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
I agree, though I’ll also add:

- I don’t think our results clearly show that faithfulness goes down with model size, just that there’s less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There’s probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin ‘Say What They Think’ paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn’t shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.

Sam Bowman 8 Apr 2022 20:48 UTC
7 points
in reply to: Xodarap’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
I suspect that these developments look a bit less surprising if you’ve been trying to forecast progress here, and so might be at least partially priced in. Anyhow, the forecast you linked to shows >10% likelihood before spring 2025, three years from now. That’s extraordinarily aggressive compared to (implied) conventional wisdom, and probably a little more aggressive than I’d be as an EA AI prof with an interest in language models and scaling laws.

Sam Bowman 9 Mar 2023 23:58 UTC
6 points
3
on: Anthropic: Core Views on AI Safety: When, Why, What, and How
Just flagging that another cross-post has been collecting some comments: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety

Sam Bowman 27 Oct 2022 22:30 UTC
LW: 6 AF: 5
AF
on: A Small Negative Result on Debate
Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative.
Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016
Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)
We’re still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we’re doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.

Sam Bowman 31 Mar 2023 23:23 UTC
LW: 4 AF: 4
2
AF
in reply to: evhub’s comment on: Towards understanding-based safety evaluations
Assuming we’re working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can’t actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

Sam Bowman 17 Feb 2023 0:45 UTC
4 points
0
in reply to: Dunning K.’s comment on: Qualities that alignment mentors value in junior researchers
I mostly agree, but it’s messy. I don’t think it’s obvious that a PhD is anywhere near the ideal way to pick up some of these skills, or that earning a PhD definitely means that you’ve picked them up, but PhD programs do include lots of nudges in these directions, and PhD-holders are going to be much stronger than average at most of this.
In particular, like Johannes said, doing a PhD is notoriously hard on mental health for a number of reasons, even at a more-supportive-than-average lab. So to the extent that they teach ‘taking care of your mental health’ and ‘staying motivated when you’re lost’, it’s often by throwing you into stressful, confusing work situations without great resources and giving you the degree if you figure out how to navigate them.

Sam Bowman

Sur­vey of NLP Re­searchers: NLP is con­tribut­ing to AGI progress; ma­jor catas­tro­phe plausible

AI Safety and Neigh­bor­ing Com­mu­ni­ties: A Quick-Start Guide, as of Sum­mer 2022

Jobs: Help scale up LM al­ign­ment re­search at NYU

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Ar­tifi­cial Sand­wich­ing: When can we test scal­able al­ign­ment pro­to­cols with­out hu­mans?

A Small Nega­tive Re­sult on Debate

Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible

AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022

Jobs: Help scale up LM alignment research at NYU

NLP Position Paper: When Combatting Hype, Proceed with Caution

Artificial Sandwiching: When can we test scalable alignment protocols without humans?

A Small Negative Result on Debate