Vivek Hebbar

Karma: 1,243

Vivek Hebbar Sep 19, 2025, 1:12 AM
LW: 10 AF: 7
5 votes
Overall karma indicates overall quality.
1
2 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Vivek Hebbar’s Shortform
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
1. Goal-guarding is easy
2. The AI is a schemer (see here for my model of how that works)
3. Sandbagging would benefit the AI’s long-term goals
4. The deployer has taken no countermeasures whatsoever
The reason is as follows:
- Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
- On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
- Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.

Vivek Hebbar Jun 23, 2025, 12:02 AM
5 points
3 votes
Overall karma indicates overall quality.
9
5 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: habryka’s comment on: New Endorsements for “If Anyone Builds It, Everyone Dies”
Wouldn’t it be better to have a real nighttime view of north america? I also found it jarring...

Vivek Hebbar Jun 13, 2025, 3:24 AM
LW: 11 AF: 9
6 votes
Overall karma indicates overall quality.
7
4 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: evhub’s comment on: the void
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.

Vivek Hebbar Jun 9, 2025, 6:51 PM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Toward A Mathematical Framework for Computation in Superposition
In general, computing a boolean expression with $k$ terms without the signal being drowned out by the noise will require $ϵ < 1 / k$ if the noise is correlated, and $ϵ < 1 / k^{2}$ if the noise is uncorrelated.
Shouldn’t the second one be $\frac{1}{\sqrt{k}}$ ?
the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)
Is this meant to say “last token” instead of “past token”?

Vivek Hebbar Apr 23, 2025, 1:37 AM
LW: 4 AF: 3
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: How training-gamers might function (and win)
I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:
- Added 2 sentences emphasizing the point that schemers probably won’t be aware of their terminal goal in most contexts. I thought this was clear from the post already, but apparently it wasn’t.
- Modified “What factors affect the likelihood of training-gaming?” to emphasize that “sum of proxies” and “reward-seeker” are points on a spectrum. We might get an in-between model where context-dependent drives conflict with higher goals and sometimes “win” even outside the settings they are well-adapted to. I also added a footnote about this to “Characterizing training-gamers and proxy-aligned models”.
- Edited the “core claims” section (e.g. softening a claim and adding content).
- Changed “reward seeker” to “training-gamer” in a bunch of places where I was using it to refer to both terminal reward seekers and schemers.
- Miscellaneous small changes

Vivek Hebbar Apr 22, 2025, 5:58 AM
LW: 5 AF: 4
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Downstream applications as validation of interpretability progress
In the fictional dialogue, Claude Shannon’s first answer is more correct—info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.

Vivek Hebbar Apr 14, 2025, 1:29 PM
4 points
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: Vivek Hebbar’s comment on: ‘Empiricism!’ as Anti-Epistemology
Slightly more spelled-out thoughts about bounded minds:
1. We can’t actually run the hypotheses of Solomonoff induction. We can only make arguments about what they will output.
2. In fact, almost all of the relevant uncertainty is logical uncertainty. The “hypotheses” (programs) of Solomonoff induction are not the same as the “hypotheses” entertained by bounded Bayesian minds. I don’t know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction. But informally, all I’m talking about are ordinary hypotheses like “the Ponzi guy only gets money from new investors”.
3. In addition to “bounded hypotheses” (of unknown type), we also have “arguments”. An argument is a thing whose existence provides fallible evidence for a claim.
4. Arguments are made of pieces which can be combined “conjuctively” or “disjunctively”. The conjunction of two subarguments is weaker evidence for its claim than each subargument was for its subclaim. This is the sense in which “big arguments” are worse.

Vivek Hebbar Apr 14, 2025, 1:29 PM
4 points
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
on: ‘Empiricism!’ as Anti-Epistemology
I suspect there is some merit to the Scientist’s intuition (and the idea that constant returns are more “empirical”) which nobody has managed to explain well. I’ll try to explain it here.^[1]
The Epistemologist’s notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence. The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence. The Epistemologist is right that there is nothing of value in the empiricist’s notion if you are an unbounded Solomonoff inductor. But for a bounded mind, two important facts come into play:
1. The implications of hypotheses can only be guessed using “arguments”. These “arguments” become less reliable the more conjunctive they are.
2. Induction over runtime-bounded programs turns out for some reason to agree with Solomonoff induction way more than the maximum entropy prior does, despite not containing any “correct” hypotheses. This is a super important fact about reality.
Therefore a bounded mind will sometimes get more evidence from “fast-program induction on local data” (i.e. just extrapolate without a gears-level model) than from highly conjunctive arguments about gears-level models.
1. ^
  FWIW, I agree with the leading bit of Eliezer’s position—that we should think about the object-level and not be dismissive of arguments and concretely imagined gears-level models.

Vivek Hebbar Apr 14, 2025, 11:01 AM
LW: 4 AF: 2
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Shah and Yudkowsky on alignment failures
I’d be capable of helping aliens optimize their world, sure. I wouldn’t be motivated to, but I’d be capable.
@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them? (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what’s in your interest, or by any other means.)

Vivek Hebbar Apr 10, 2025, 7:00 AM
3 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Raemon’s comment on: Arguments about fast takeoff
Sigmoid is usually what “straight line” should mean for a quantity bounded at 0 and 1. It’s a straight line in logit-space, the most natural space which complies with that range restriction.
(Just as exponentials are often the correct form of “straight line” for things that are required to be positive but have no ceiling in sight.)

Vivek Hebbar Oct 13, 2024, 12:14 AM
LW: 3 AF: 3
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: Daniel Kokotajlo’s comment on: Discussion with Nate Soares on a key alignment difficulty
Do you want to try playing this game together sometime?

Vivek Hebbar Oct 12, 2024, 11:57 PM
LW: 4 AF: 3
3 votes
Overall karma indicates overall quality.
4
2 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Discussion with Nate Soares on a key alignment difficulty
We’re then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We’re going to try to use the RL to train: “Act exactly like [a given alignment researcher] would act.”
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

Vivek Hebbar Aug 14, 2024, 5:38 PM
LW: 1 AF: 1
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: Diffractor’s comment on: CDT Dutch Book
If the agent follows EDT, it seems like you are giving it epistemically unsound credences. In particular, the premise is that it’s very confident it will go left, and the consequence is that it in fact goes right. This was the world model’s fault, not EDT’s fault. (It is notable though that EDT introduces this loopiness into the world model’s job.)

Vivek Hebbar Jun 28, 2024, 8:55 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: ryan_greenblatt’s comment on: An issue with training schemers with supervised fine-tuning
[resolved]

Vivek Hebbar Jun 12, 2024, 11:39 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: tgb’s comment on: Worked Examples of Shapley Values
Superadditivity seems rare in practice. For instance, workers should have subadditive contributions after some point. This is certainly true in the unemployment example in the post.

Vivek Hebbar Jun 4, 2024, 1:06 PM
LW: 13 AF: 6
4 votes
Overall karma indicates overall quality.
8
1 vote
Agreement karma indicates agreement, separate from overall quality.
AF
on: AI catastrophes and rogue deployments
The idea of dividing failure stories into “failures involving rogue deployments” and “other failures” seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in “rogue deployments”
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe

This seems true for most catastrophes which only need outsider affordances (such as “order the proteins etc for a pandemic or a nanofactory”). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don’t randomly end the world.

However, this argument for #1 doesn’t work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example:
- Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time
- Poisoning datasets (idk if insider access helps though)
- Intentionally creating and defending flawed alignment plans

These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren’t necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.

One could separately argue that these routes are also “fundamentally hard” (even if not as hard as directly causing a catastrophe), but I don’t see a clear blanket reason.

Vivek Hebbar Apr 16, 2024, 8:14 AM
LW: 3 AF: 1
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: Vivek Hebbar’s comment on: AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
This google search seems to turn up some interesting articles (like maybe this one, though I’ve just started reading it).

Vivek Hebbar Apr 16, 2024, 8:08 AM
LW: 5 AF: 3
4 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
Paul [Christiano] called this “problems of the interior” somewhere
Since it’s slightly hard to find: Paul references it here (ctrl+f for “interior”) and links to this source (once again ctrl+f for “interior”). Paul also refers to it in this post. The term is actually “position of the interior” and apparently comes from military strategist Carl von Clausewitz.

Vivek Hebbar Feb 9, 2024, 11:30 PM
6 points
5 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
on: Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
Can you clarify what figure 1 and figure 2 are showing?
I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}. But then the text right after says “Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset” which seems like a different thing.

Vivek Hebbar Feb 7, 2024, 2:07 AM
LW: 1 AF: 1
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
AF
on: Survey for alignment researchers!
Note: The survey took me 20 mins (but also note selection effects on leaving this comment)