Jacob Pfau

Karma: 877

UK AISI Alignment Team and NYU PhD student

Jacob Pfau May 23, 2025, 8:26 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Here’s another way of thinking about this that should be complementary to my other comment:

Let’s assume our safety property we want to ensure is

Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.

The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
1. Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
2. Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.

Jacob Pfau May 23, 2025, 1:51 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)

Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:

There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
1. No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
2. Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
3. High-stakes. I.e. low total probability over all evil side objectives.
We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
1. No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
2. Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
3. High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.

Jacob Pfau May 22, 2025, 4:51 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I see two senses in which where research sabotage is different from the usual sandbagging issue:
1. ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
2. High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
What links here?
- Unexploitable search: blocking malicious use of free parameters by Jacob Pfau (May 21, 2025, 5:23 PM; 34 points)
- Jacob Pfau's comment on Unexploitable search: blocking malicious use of free parameters by Jacob Pfau (May 23, 2025, 1:51 PM; 1 point)

Jacob Pfau May 22, 2025, 11:40 AM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.

From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.

If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).

Jacob Pfau May 21, 2025, 11:34 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I think KL to an imitator, or base model, would be good to enforce as well! ~~It is an action space metric though and so shares the problems we raise about action-space entropy penalties~~ (edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.

Jacob Pfau May 21, 2025, 11:29 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Intuitively I expect the ideal, intractable solution of reward-proportional sampling to be 1-eps good (similarly the BoN policy seems safe to assume is good). In such settings, “always be nice” is close to max entropy, and so “always be nice” is near equilibrium whereas “always be evil” is not at all—and Bob’s incentives are proportional.

I suppose highlighting the symmetric solutions assumption obfuscated this intuition.

Jacob Pfau May 21, 2025, 10:36 PM
2 points
0
in reply to: plex’s comment on: Unexploitable search: blocking malicious use of free parameters
It’s hard to measure distance over additional research needed!

To be confident in an unexploitable search method you’d definitely want empirical evidence. And I can imagine empirical evidence strong enough to give confidence that our method helps enough to really matter.

Here’s an evaluation that would give me confidence: Take an adversarially initialized model (e.g. by doing imitation learning on code with vulnerabilities) and train on our game. If the rate at which the model produced code with vulnerabilities returned to baseline (with steps say linear in the number of IL samples) then I’d be fairly confident that our method was useful!

Unexploitable search: blocking malicious use of free parameters

Jacob Pfau and Geoffrey Irving

May 21, 2025, 5:23 PM

34 points

16 comments6 min readLW link

An alignment safety case sketch based on debate

Marie_DB, Jacob Pfau, Benjamin Hilton and Geoffrey Irving

May 8, 2025, 3:02 PM

55 points

19 comments25 min readLW link

(arxiv.org)

UK AISI’s Alignment Team: Research Agenda

Benjamin Hilton, Jacob Pfau, Marie_DB and Geoffrey Irving

May 7, 2025, 4:33 PM

109 points

2 comments11 min readLW link

Jacob Pfau Apr 13, 2025, 3:01 PM
13 points
1
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Disagree on individual persuasion. Agree on mass persuasion.

Mass I’d expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate—which comprise the full details on individuals.

Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).
- Defining persuasion: What’s the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
- Data scaling: There’s an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I’d guess the fundamental limit (without knowing brainstates) is above your ability to ‘persuade’ yourself.
- Preference incoherence: An intuition pump on the limits of ‘persuasion’ is how far you are from having fully coherent preferences. Insofar as you don’t an agent which can see those incoherencies should be able to pump you—a kind of persuasion.

Jacob Pfau Mar 31, 2025, 2:02 PM
1 point
0
in reply to: Daniel Murfet’s comment on: Prospects for Alignment Automation: Interpretability Case Study
Before concretely addressing the oversights you found, perhaps worth mentioning the intuitive picture motivating the pseudo-code. I wanted to make explicit the scientific process which happens between researchers. M1 plays the role of the methods researcher, M2 plays the role of the applications/datasets researcher. The pseudo-code is an attempt to write out crisply in what sense ‘good’ outputs from M1 can pass the test-of-time standing up to more realistic, new applications and baselines developed by M2.

On to the details:

Thanks for working with my questionable notation here! Indeed the uses of I were overloaded here, and I have now (hopefully) clarified by writing v_{I,M} for what was previously I(M). The type signatures I have in mind are that I is code (an interpretability method) and v_{I,M} and I(M) are some efficiently queryable representation of M (circuits, SAE weights, …) useful for downstream tasks.

The sense in which M2 is static does seem important. In fact, I think M2 should have some access to M—it was an oversight that M does not appear as an input to M2. This was why I mentioned sample complexity in a footnote: It seems reasonable to give M2 limited query access to M. Thanks for catching this. In fact, perhaps the scheme could work as originally written where M2 does not have direct access to M, but I’m unsure seems too ‘static’ as you say.

Regarding the appropriateness of the term interpretability to describe the target of this automation process: I agree, the output may not be an interp method in our current sense. Interpretability is the most appropriate term I could come up with. Two features seem important here: (1) white-box parsing of weights is central. (2) The resulting ‘interpretation’ v_{I,M} is usable by a fixed model M2, hence v_{I,M} must be efficiently interface-able without having learned—in weights—the structure of v_{I,M}.

Jacob Pfau Mar 21, 2025, 2:29 PM
3 points
2
on: How far along Metr’s law can AI start automating or helping with alignment research?
To apply METR’s law we should distinguish conceptual alignment work from well-defined alignment work (including empirics and theory on existing conjectures). The METR plot doesn’t tell us anything quantitative about the former.

As for the latter, let’s take interpretability as an example: We can model uncertainty as a distribution over the time-horizon needed for interpretability research e.g. ranging over 40-1000 hours. Then, I get 66% CI of 2027-2030 for open-ended interp research automation—colab here. I’ve written up more details on this in a post here.

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

Mar 21, 2025, 2:05 PM

32 points

5 comments8 min readLW link

Jacob Pfau Mar 2, 2025, 8:57 PM
1 point
0
in reply to: Owain_Evans’s comment on: Alexander Gietelink Oldenziel’s Shortform
I’d defend a version of claim (1): My understanding is that to a greater extent than anywhere else, top French students wanting to concentrate in STEM subjects must take rigorous math coursework from 18-20. In my one year experience in the French system, I also felt that there was a greater cultural weight and institutionalized preference (via course requirements and choice of content) for theoretical topics in ML compared to US universities.

I know little about ENS, but somewhat doubt that it’s as significantly different of an experience from US/UK counterparts.

Jacob Pfau Jan 6, 2025, 6:58 PM
4 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform

AI is 90% of their (quality adjusted) useful work force

This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?

Jacob Pfau Dec 29, 2024, 4:21 PM
5 points
2
in reply to: L Rudolf L’s comment on: By default, capital will matter more than ever after AGI
I see the command economy point as downstream of a broader trend: as technology accelerates, negative public externalities will increasingly scale and present irreversible threats (x-risks, but also more mundane pollution, errant bio-engineering plague risks etc.). If we condition on our continued existence, there must’ve been some solution to this which would look like either greater government intervention (command economy) or a radical upgrade to the coordination mechanisms in our capitalist system. Relevant to your power entrenchment claim: both of these outcomes involve the curtailment of power exerted by private individuals with large piles of capital.

(Note there are certainly other possible reasons to expect a command economy, and I do not know which reasons were particularly compelling to Daniel)

Jacob Pfau Dec 6, 2024, 6:58 PM
11 points
3
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Two guesses on what’s going on with your experiences:
1. You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
2. You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

Jacob Pfau Nov 30, 2024, 7:44 PM
27 points
2
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
Thanks for these details. These have updated me to be significantly more optimistic about the value of spending on LW infra.
- The LW1.0 dying to no mobile support is an analogous datapoint in favor of having a team ready for 0-5 year future AI integration.
- The head-to-head on the site updated me towards thinking things that I’m not sure are positive (visible footnotes in sidebar, AI glossary, to a lesser extent emoji-reacts) are not a general trend. I will correct my original comment on this.
- While I think the current plans for AI integration (and existing glossary thingy) are not great, I do think there will be predictably much better things to do in 1-2 years and I would want there to be a team with practice ready to go for those. Raemon’s reply below also speaks to this. Actively iterating on integrations while keeping them opt-in (until very clearly net positive) seems like the best course of action to me.

Jacob Pfau Nov 30, 2024, 4:43 PM
57 points
14
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I am slightly worried about the rate at which LW is shipping new features. I’m not convinced they are net positive. I see lesswrong as a clear success, but unclear user of the marginal dollar; I see lighthaven as a moderate success and very likely positive to expand at the margin.

The interface has been getting busier^[1] whereas I think the modal reader would benefit from having as few distractions as possible while reading. I don’t think an LLM-enhanced editor would be useful, nor am I excited about additional tutoring functionality.

I am glad to see that people are donating, but I would have preferred this post to carefully signpost the difference between status-quo value of LW (immense) from the marginal value of paying for more features for LW (possibly negative), and from your other enterprises. Probably not worth the trouble, but is it possible to unbundle these for the purposes of donations?

Separately, thank you to the team! My research experience over the past years has benefitted from LW on a daily basis.

EDIT: thanks to Habryka for more details. After comparing to previous site versions I’m more optimistic about the prospects for active work on LW.
1. ↩︎
  (edit) in some places, less busy in others

Jacob Pfau

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

An al­ign­ment safety case sketch based on debate

UK AISI’s Align­ment Team: Re­search Agenda

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

Unexploitable search: blocking malicious use of free parameters

An alignment safety case sketch based on debate

UK AISI’s Alignment Team: Research Agenda

Prospects for Alignment Automation: Interpretability Case Study