Jacob Pfau

Karma: 967

UK AISI Alignment Team and NYU PhD student

Jacob Pfau 5 Nov 2025 10:54 UTC
3 points
0
on: What’s up with Anthropic predicting AGI by early 2027?
One way of checking your 6% is by doing Laplace succession on the METR trend:

Application of Laplace’s rule gets us a an 11% probability on any dramatic upwards trend break above METR’s curve. Your interpretation of Anthropic’s prediction for 2027 is compatible with this prediction since the 11% is an upper bound, and 6% of that mass being at least as high as your Anthropic prediction seems plausible.

In detail, probability of trend break by early-2027 via applying Laplace’s rule of succession: METR has observed 6 years of an exponential fit holding this gives us a per-year point-estimate parameter p=7/8 that the trend will continue to hold. Roughly 22% of going off trend over the next two years, and if half of that mass is via upward deviation we get 11%.

I’m sure there’s fancier statistics to be done here, but I’d guess anything reasonable gets us order of magnitude around 11%.

Jacob Pfau 26 Oct 2025 11:44 UTC
31 points
10
on: Jacob Pfau’s Shortform
Highly recommend reading Ernest Ryu’s twitter multi-thread on proving a long-standing, well-known conjecture with heavy use of ChatGPT Pro. Ernest even includes the chatGPT logs! The convergence of Nesterov gradient descent on convex functions: Part 1, 2, 3.

Ernest gives useful commentary on where/how he found it best to interact with GPT. Incidentally, there’s a nice human baseline as well since another group of researchers coincidentally have written up privately a similar result this month!

To add some of my own spin: seems to me time horizons are a nice lens for viewing the collaboration. Ernest, clearly has a long-horizon view of this research problem that helped him (a) know what the most tractable nearby problem was to start on (b) identify when diminishing returns—likelihood of a deadend—were apparent (c) pull out useful ideas from usually flawed GPT work.

The one-week scale of interaction between Ernest and ChatGPT here is a great example of how we’re very much in a centaur regime now. We really need to be conceptualizing and measuring AI+human capabilities rather than single-AI capability. It also seems important to be thinking about what safety concerns arise in this regime.

Jacob Pfau 24 Oct 2025 12:12 UTC
5 points
4
in reply to: Richard_Ngo’s comment on: Daniel Birnbaum’s Shortform
This was also on my mind after seeing Jesse’s short form yesterday. Ryan’s “this is good” comment was above Louis’ thorough explanation of an alternative formal motivation for IFs. That would still be the case if I hadn’t heavy upvoted and weak downvoted.

I personally cast my comment up/downvotes as an expression of my preference ordering for visibility. I would encourage others to also do so. For instance, I suggest Ryan’s comment should’ve been agreement voted rather than upvoted by others. This stance has as a corollary to not vote if you haven’t read the other comments whose rankings you are affecting—or rather vote with any other marker of which LW has many.

This ‘upvotes as visibility preferences’ policy isn’t tractable for posts, so I suspect the solution there—if one is needed—would have to be done on the backend by normalization. Not sure whether this is worth attempting.

Link here since I don’t particularly want to call out Ryan, his comment was fine. https://www.lesswrong.com/posts/7X9BatdaevHEaHres/jesse-hoogland-s-shortform

Jacob Pfau 18 Oct 2025 14:14 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Jacob Pfau’s Shortform
Thought provoking, thanks! First off, whether or not humanity (or current humans, or some distribution over humans) has already achieved escape velocity does not directly undercut the utility of escape velocity as a useful concept for AI takeoff. Certainly, I’d say the collective of humans has achieved some leap to universality in a sense that the bag of current near-SotA LLMs has not! And in particular, it’s perfectly reasonable to take humans as our reference to define a unit dh/dt point for AI improvement.

Ok, now on to the hard part (speculating somewhat beyond the scope of my original post). Is there a nice notion of time horizon that generalizes METRs and lets us say something about when humanity has achieved escape velocity? I can think of two versions.

The easy way out is to punt to some stronger reference class of beings to get our time horizons, and measure dh/dt for humanity against this baseline. Now the human team is implicitly time limited by the stronger being’s bound, and we count false positives against the humans even if humanity could eventually self correct.

Another idea is to notice that there’s some class of intractable problems on which current human progress looks like either (1) random search or (2) entirely indirect, instrumental progress—e.g. self-improvement, general tool building etc. In these cases, there may be a sense in which we’re exponentially slower than task-capable agent(s). We should be considered incapable of completing such tasks. I imagine some millenium problems, astrological engineering, etc. would be reasonably considered beyond us on this view.

Overall, I’m not particularly happy with these generalizations. But I still like having ‘escape velocity for AI auto-R&D’ as a threshold!

Jacob Pfau 18 Oct 2025 12:24 UTC
4 points
0
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
My perception is that Trump 2 is on track to be far worse (e.g. in terms of eroding democratic practices and institutions) than Trump 1. My vague understanding is that a main driver of this difference is how many people gave up on the “working with bad guys to make them less bad” plan—though probably this was not directly because they changed their view on such reasoning.

Should this update us on the working for net-negative AGI companies case?

Jacob Pfau 17 Oct 2025 23:22 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Jacob Pfau’s Shortform
(Thanks for expanding! Will return to write a proper response to this tomorrow)

Jacob Pfau 17 Oct 2025 14:58 UTC
3 points
2
on: Jacob Pfau’s Shortform
I’ve never been compelled by talk about continual learning, but I do like thinking in terms of time horizons. One notion of singularity that we can think of in this context is

Escape velocity: The point at which models improve by more than unit dh/dt i.e. horizon h per wall-clock t.

Then by modeling some ability to regenerate, or continuously deploy improved models you can predict this point. Very surprised I haven’t seen this mentioned before, has someone written about this? The closest thing that comes to mind is T Davidson’s ASARA SIE.

Of course, the METR methodology is likely to break down well before this point, so it’s empirically not very useful. But, I like this framing! Conceptually there will be some point where models can robustly take over R&D work—including training, inventing of new infrastructure (skills, MCPs, cacheing protocols etc.). If they also know when to revisit their previous work they can then work productively over arbitrary time horizons. Escape velocity is a nice concept to have in our toolbox for thinking about R&D automation. It’s a weaker notion than Davidson SIE.

Jacob Pfau 4 Oct 2025 19:07 UTC
14 points
4
in reply to: faul_sname’s comment on: faul_sname’s Shortform
It can both be the case that “a world in which the U.S. government took von Neumann’s advice would likely be a much darker, bleaker, more violent one” and that JvN was correct ex ante. In particular, I find it plausible that we’re living in quite a lucky timeline—one in which the Cuban missile crisis and other coinflips landed in our favor.

Jacob Pfau 26 Aug 2025 9:33 UTC
2 points
0
in reply to: Ryan Kidd’s comment on: Ryan Kidd’s Shortform
UK AISI is a government agency, so the pie chart is probably misleading on that segment!

Jacob Pfau 13 Aug 2025 9:06 UTC
12 points
7
in reply to: Kaarel’s comment on: kh’s Shortform
I like the concept; on the other hand the flag feels strongly fascist to me.

Ran it by the AIs and 2 out of 3 had “authoritarian” as their first descriptor responding to “What political alignment does the aesthetic of this flag evoke?” FWIW.

Jacob Pfau 7 Aug 2025 10:57 UTC
3 points
0
in reply to: Tom Davidson’s comment on: How quick and big would a software intelligence explosion be?
From a skim, seems you should be using the 6.25x value rather than the 2.5x in B2 of your sheet. If I’m skimming it correctly, 6.25x is the estimate for replacing a hypothetical all median lab with a hypothetical all top researcher lab. This is what occurs when you improve your ASARA model. Whereas, 2.5x is the estimate for replacing the actual lab with an all top lab.

This still gives a lower than 4x value, but I think if you plug in reasonable log-normals 4x will be within your 90% CI, and so it seems fine.

Jacob Pfau 18 Jun 2025 11:09 UTC
LW: 1 AF: 1
0
AF
in reply to: Beth Barnes’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol

I don’t think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.

Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I’d imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?

Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like “What experiment should I run next to test this debate protocol”.

One dataset idea to assess how often stable arguments can be found is to curate ‘proof advice’ problems. These problems are proxies for research advice in general. Basically:

Question: “Does this textbook use theorem Y to prove theorem X?”, where the debaters see the proof but the judge does not.

(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and “research taste”—what makes a proof natural or clear. It’s unclear what independent evidence would look like in this domain.

Jacob Pfau 23 May 2025 20:26 UTC
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Here’s another way of thinking about this that should be complementary to my other comment:

Let’s assume our safety property we want to ensure is

Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.

The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
1. Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
2. Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.

Jacob Pfau 23 May 2025 13:51 UTC
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)

Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:

There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
1. No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
2. Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
3. High-stakes. I.e. low total probability over all evil side objectives.
We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
1. No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
2. Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
3. High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.

Jacob Pfau 22 May 2025 16:51 UTC
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I see two senses in which where research sabotage is different from the usual sandbagging issue:
1. ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
2. High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
What links here?

Jacob Pfau 22 May 2025 11:40 UTC
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.

From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.

If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).

Jacob Pfau 21 May 2025 23:34 UTC
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
I think KL to an imitator, or base model, would be good to enforce as well! ~~It is an action space metric though and so shares the problems we raise about action-space entropy penalties~~ (edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.

Jacob Pfau 21 May 2025 23:29 UTC
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Unexploitable search: blocking malicious use of free parameters
Intuitively I expect the ideal, intractable solution of reward-proportional sampling to be 1-eps good (similarly the BoN policy seems safe to assume is good). In such settings, “always be nice” is close to max entropy, and so “always be nice” is near equilibrium whereas “always be evil” is not at all—and Bob’s incentives are proportional.

I suppose highlighting the symmetric solutions assumption obfuscated this intuition.

Jacob Pfau 21 May 2025 22:36 UTC
2 points
0
in reply to: plex’s comment on: Unexploitable search: blocking malicious use of free parameters
It’s hard to measure distance over additional research needed!

To be confident in an unexploitable search method you’d definitely want empirical evidence. And I can imagine empirical evidence strong enough to give confidence that our method helps enough to really matter.

Here’s an evaluation that would give me confidence: Take an adversarially initialized model (e.g. by doing imitation learning on code with vulnerabilities) and train on our game. If the rate at which the model produced code with vulnerabilities returned to baseline (with steps say linear in the number of IL samples) then I’d be fairly confident that our method was useful!

Jacob Pfau 13 Apr 2025 15:01 UTC
13 points
1
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Disagree on individual persuasion. Agree on mass persuasion.

Mass I’d expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate—which comprise the full details on individuals.

Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).
- Defining persuasion: What’s the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
- Data scaling: There’s an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I’d guess the fundamental limit (without knowing brainstates) is above your ability to ‘persuade’ yourself.
- Preference incoherence: An intuition pump on the limits of ‘persuasion’ is how far you are from having fully coherent preferences. Insofar as you don’t an agent which can see those incoherencies should be able to pump you—a kind of persuasion.