Elliott Thornley (EJT)
Yeah good question. I think unfortunately the POST structure by itself doesn’t give us any guarantees here: if spawning subagents looked like a good enough move conditional on each possible trajectory-length, POST-agents would do it. But note a couple points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I go into this in more detail here.)
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
That’s great to hear! Looking forward to seeing the results.
Wait can you explain the MATS 9 link? I couldn’t find any reference to reward hacking there.
For inoculation pretraining, I’m imagining we’d add data about good-but-reward-hacking AIs to pretraining, but still in post-training we’d try to train AIs to resist the tendency to hack reward and we’d try to train them to point out a rival’s attempts to hack reward, etc. The data about good-but-reward-hacking AIs is just there as a fallback in case we fail and accidentally train AIs to reward hack in post-training.
It’s possible that adding the data about good-but-reward-hacking AIs would nontrivially increase the probability that we fail, but I’m not sure. It seems probable to me that reward hacks are something that AIs will explore their way into, whether or not we add data about good-but-reward-hacking AIs to pretraining.
Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs
I’d try posting on r/Catholicism and emailing some reporters at Catholic outlets.
I’m a bit confused about what
solve agent foundations
means and why it would be necessary for
having a good picture of what we should even be looking for
Can you say more?
perhaps most actual Catholics aren’t bothered by this
My guess is that a lot of Catholics would be very bothered by this.
Yeah it seems to me like predictive optimization is often a kind of internal selective optimization.
This looks extremely cool and useful.
Coherence arguments give us an initial foothold. The basic insight is that if an agent’s preferences are inconsistent — say, it prefers A to B, B to C, and C to A — a clever adversary can cycle it through trades that each look locally acceptable but leave it strictly worse off, wasting resources for no gain. Any agent that reliably avoids such “dominated strategies” must behave as if it has consistent preferences representable by a utility function.
But I’m surprised to see this being said, and surprised to see that this post isn’t referenced anywhere. I’m biased of course, but I think it makes a true and important point. The post stirred up some controversy at the time (and of course it’s better to talk about the object-level issues), but I take the controversy to have mostly resolved in favor of the point I make. See e.g. this retrospective and Vanessa Kosoy calling the-coherence-argument-as-stated-in-the-quote-above a weak man.
modus tollens Parfit’s ideas about personal identity. I think that’s a worthwhile angle, and more practically useful these days than using Reasons and Persons to dissuade people of egocentricity, which afaict is closer to Parfit’s original goals.
Can you explain a bit more what you mean by this?
Surely not any monetary stakes
The bet still resolves at the same time. The doomer just has one year after resolution to get their bank balance back up from $0 so they can pay the accelerationist back.
accelerationist has no reason to expect the doomer to save any money. And if the doomer does save it (plus enough extra to cover the doubled payback), they’ve effectively just locked up double the original capital until the end of the world
Couldn’t the doomer and accelerationist just agree that the doomer doesn’t have to pay until (e.g.) one year after the bet resolves? Then the doomer could spend all the money in anticipation of doom. If the doomer loses the bet, they can use the year after resolution to earn money to pay the accelerationist back.
(Of course there are extra practical difficulties here, like e.g. it might be hard for humans to earn money in the future. But I’m just talking about theoretical barriers.)
Okay that’s good to know. I’ve mostly encountered the argument as a reply to individuals worrying that they’re getting Pascal’s-mugged into working on AI safety. In that sort of case,
AI safety can’t be a Pascal’s mugging because p(doom) is high
is invalid, and the premise needed to make it valid --
If p(doom) is high, then p(you can avert doom) is high
-- is way too doubtful to leave implicit.
But if the argument is a reply to people worried that the world/US government is getting Pascal’s-mugged into working on AI safety, then the premise needed to make it valid is
If p(doom) is high, then p(the world/USG can avert doom) is high
and I agree that premise is safe/uncontroversial enough to leave implicit.
“sure, taking strong actions to reduce risk from misaligned AI would be doable, but isn’t doing this a Pascal’s mugging (implicitly responding to how much people have emphasized the stakes while less so arguing for the risk)”
I don’t really understand what this perspective is saying. Is the idea that people tend to grant the premise ‘If p(doom) is high, then p(you avert doom) is high’? I agree p(doom) being high would be sufficient in that case.
Wait is God flipping the coin load-bearing for the craziness? Because strangers making wild promises isn’t that crazy.
AI safety can be a Pascal’s mugging even if p(doom) is high
Though he could get greater intelligence and more information/understanding about the world without doing any reflection on his values. This seems fairly likely to me. People tend to be not that interested in reflecting on their values. He might even want to lock in his current values, since that’s rational according to his current values.
Derek Parfit wrote up some thoughts along these lines in 1984: