Vivek Hebbar’s Shortform

Vivek Hebbar24 Nov 2022 2:57 UTC

LW: 4 AF: 3

8 comments1 min readLW link

Vivek Hebbar 19 Sep 2025 1:12 UTC
LW: 10 AF: 7
1
AF
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
1. Goal-guarding is easy
2. The AI is a schemer (see here for my model of how that works)
3. Sandbagging would benefit the AI’s long-term goals
4. The deployer has taken no countermeasures whatsoever
The reason is as follows:
- Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
- On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
- Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
- Garrett Baker 19 Sep 2025 1:26 UTC
  6 points
  4
  Parent
  This seems less likely the harder the problem is, and therefore the more the AI needs to use its general intelligence or agency to pursue it, which are often the sorts of tasks we’re most scared about the AI doing surprisingly well on.
  
  I agree this argument suggests we will have a good understanding of more simple capabilities the model has, like what facts about biology it knows about, which may end up being useful anyway.
- Jeremy Gillen 19 Sep 2025 10:27 UTC
  2 points
  0
  Parent
  On top of what Garrett said, reflection also pushes against this pretty hard. An AI that has gone through a few situations where it has acted against its own goals because of “context-specific heuristics” will be motivated to remove those heuristics, if that is an available option.
Vivek Hebbar 11 Oct 2023 23:58 UTC
8 points
1
It’s sad that agentfoundations.org links no longer work, leading to broken links in many decision theory posts (e.g. here and here)
- habryka 12 Oct 2023 3:25 UTC
  2 points
  0
  Parent
  Oh, hmm, this seems like a bug on our side. I definitely set up a redirect a while ago that should make those links work. My guess is something broke in the last few months.
- Vladimir_Nesov 12 Oct 2023 3:17 UTC
  2 points
  0
  Parent
  Thanks for the heads up. Example broken link (https://agentfoundations.org/item?id=32), currently redirects to broken https://www.alignmentforum.org/item?id=32, should redirect further to https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e7d/exploiting-edt (Exploiting EDT^[1]), archive.today snapshot.
  
  Edit 14 Oct: It works now, even for links to comments, thanks LW team!
  ↩︎
  LW confusingly replaces the link to www.alignmentforum.org given in Markdown comment source text with a link to www.lesswrong.com when displaying the comment on LW.
Vivek Hebbar 24 Nov 2022 2:57 UTC
LW: 3 AF: 3
2
AF
A framing I wrote up for a debate about “alignment tax”:
1. “Alignment isn’t solved” regimes:
  1. Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
  2. We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one
2. “Alignment tax” regimes:
  1. We can make an aligned AGI, but it requires a compute overhead in the range 1% − 100x. Furthermore, the situation remains multipolar and competitive for a while.
  2. The alignment tax is <0.001%, so it’s not a concern.
  3. The leading coalition is further ahead than the alignment tax amount, and can and will execute a pivotal act, thus ending the risk period and rendering the alignment tax irrelevant.
A person whose mainline is {1a --> 1b --> 2b or 2c} might say “alignment is unsolved, solving it mostly a discrete thing, and alignment taxes and multipolar incentives aren’t central”
Whereas someone who thinks we’re already in 2a might say “alignment isn’t hard, the problem is incentives and competitiveness”
Someone whose mainline is {1a --> 2a} might say “We need to both ‘solve alignment at all’ AND either get the tax to be really low or do coordination. Both are hard, and both are necessary.”
Vivek Hebbar 24 Nov 2022 4:52 UTC
1 point
0
Results on logarithmic utility and stock market leverage: https://www.lesswrong.com/posts/DMxe4XKXnjyMEAAGw/the-geometric-expectation?commentId=yuRie8APN8ibFmRJD