That’s one way of putting it, yeah: the band who want to explore a new sound, the hunter who gets sick of eating deer every day, and the LLM with an entropy term in its reward function are all of the same ilk.
J Bostock
Can you say more about why you think this?
You Are Not Immune To Mode Collapse
Baba is you is one I’ve written about.
I had a conversation about this recently, and I raised the point that fieldbuilding suffers from the same issues. When it really comes down to the wire, someone has to carry out the hard task of (being capable of) understanding and verifying the entire alignment stack end-to-end.
This puts governments, AI company CEOs, and Dustin Moskowitz in an awkward position: the person to understand the stack might not be able to make it legible to them (compare to the US Army guys who had to take the atmospheric ignition calculations on the word of the physicists).
Maaaybe you can break it down and entrust the local validity of each step to a different genius: “Terry Tao says the maths checks out, and our compiler engineers all agree on the architecture” or something like that, but I wouldn’t bet the world on it.
This is probably net good, but without any organizational changes I don’t think it means much at all. Sam can and does just make a promise, and then “change his mind” as soon as it’s convenient. There’s an angle where this comment pushes OpenAI in a particular direction (and OpenAI has more inertia than Sam alone) and another angle where it makes him look bad when he goes back on this, but I don’t think you can or should read much intent into his words at all other than “I think the next interaction I have will go well if I say that I’m going to cooperate with democracies”.
Legitimacy-importance arbitrage in academia might be an issue.
The standard story for how academia became Like That is something like this.
Grantmakers give out money based on citations
Therefore academics goodhart on citations
Therefore they all pressure each other to cite irrelevant pieces of work during e.g. peer review
Steps 1 and 2 are mostly right, but step 3 isn’t quite right. A lot of the time, academics cite irrelevant work to make their own publications look more important. If your work is in some obscure corner of chlorate chemistry (sorry to chlorate chemists) then you can make it look better by citing some other piece of work which is only tangentially related to yours (say, how a brominated compound has anti-cancer properties (in mice)) then your paper looks better, even if that work is dubious. If your chlorate paper is entirely legitimate but also not that relevant, you’re lending credibility to the brominated mouse-cancer guys. There’s an arbitrage opportunity where one group makes dubious but cool claims, and another makes solid but boring claims, and the two of them cite each other to give a false impression that the field is both cool and solid.
If a “bad faith” negotiator is one who (wrongly) represents themselves as willing to compromise, then surely we could call a debater who (wrongly) represents themselves as willing to change their mind a “bad faith” debater.
Yes, and I think I see your point: if you replace God with an asteroid that’s on a 50:50 collision course with Earth, or something like that, then the mugging is still a mugging.
I didn’t mean to imply you weren’t thinking about better monitoring! I’m aware that your team is thinking a lot about it. I was stating the current state of the deployment as I saw it, and the justifications for why the current system is safe (which I think it currently is) that I have heard, and why those justifications are not as locally valid as I would like.
(I am also sorry at this point for picking on OpenAI so much for being the only people to release details of a monitoring setup, and would like to make clear that I believe Anthropic and GDM could be doing better on that front, because they’re not publishing as many details, and XAI and Meta and Moonshot and Deepseek and Z and all the other companies without any monitoring should do some bloody monitoring)
I dunno man, if I were in the specific situation you described, I might just hand over my wallet. If I’m ever in a situation that crazy, it would mean I’ve gone far, far outside what my existing priors are capable of reasoning about. Sure, hand over the wallet, what the hell. Maybe it’s also God doing some kind of test. Probably the whole thing is a simulated test of character. Am I a human in this scenario? Could I tell whether I was a human or a persona being simulated in superposition by a particularly large LLM?
Scheming Defences Aren’t Just for Schemers
Both Anthropic and OpenAI have put in place (or are thinking about putting in place) some kind of monitoring for internal coding agents. Neither company has set up their monitoring to be scheming-proof. When I’ve spoken to people working on this, they (correctly, IMO) argue that current models aren’t likely to be scheming.
(In particular, OpenAI appears to be
thinking aboutEDIT: using [poor word choice on my part, they are thinking about better methods] untrusted-only monitoring without much collusion busting, where one instance of a model audits the outputs of another instance of itself.)But, there are other weird things a model might do that don’t qualify as “scheming” which you might still want to defend against! For example, reward hacking/fudging critical safety infrastructure is extremely bad, yet well within the remit of bad behaviour we might expect. Would this cause a model to get caught? I expect so, but who knows!
One vector for monitoring failure might be a kind of persuasion; if a model wants to reward hack, it might justify to itself—using some kind of motivated reasoning—why the hack is actually the correct action to take. Another copy of that model, reading the chain-of-thought, might then agree with the first one! (this is extremely common in motivated reasoning amongst humans as well) This would kinda fall under the broader umbrella of collusion in untrusted monitoring, and would definitely be mitigated by certain signal jamming techniques (though not all).
The above example is illustrative of the more general point that you can’t get safety by ruling out a single worst-case scenario (scheming) then only mitigating for one other specific scenario (whatever they are planning for here, I think mostly model mistakes and low-stakes deception and occasional destruction) because there are other scenarios you haven’t thought of!
Does anyone have an explanation as to why Mage: The Ascension is so strongly correlated with nuts behaviour in the rationalist community? Is it some kind of founder effect where an early group of M:TA players spread it and also happened to spread a bunch of insanity? Is it an affinity thing, where the specific lore of M:TA appeals strongly to otherwise-unstable people (like how of course the kind of person who commits a mass shooting will be a Call of Duty obsessive, and this fact says very little about the causal effect of CoD)? Or does M:TA actually shred one’s mind in some bizarre way?
(For context, I saw it mentioned here, and iiuc it was central to Brent Dill’s abusive group as well. For further context I’ve never interacted with any of these people nor heard of M:TA in any other situation than rat-adjacent groups.)
fixed, thanks. out of interest did that link not work, or did it actually give you magic access to edit my other post?
Point 9 makes me feel much better about my (conscious but only semi-intentional) shift towards more zesty titles and intros over time. Some part of me felt a little dirty about writing less dry titles, like it was clickbait or something.
“Do Not Start Arguments You Cannot Finish”
I think what you describe is mostly a high-status (dominance or prestige) phenomenon rather than a high-likeability phenomenon. But I will admit I’m rather going off instinct and intuition here.
You’ll get better signal-to-noise in convincing people of true things (compared to false things) the more of an argument you’re able to get them to listen to (up to a point where you overwhelm their argument processing capacity) which is easier to do if they like you enough to listen to you.
My flat came with “energy saving” bulbs which were fluorescent (even worse than LEDs) and capped at 13W. I couldn’t even plug in 18W fluorescent ones into the socket, because the plastic shape is different (even though the electrical connections are the same). I put “energy saving” in scare quotes because they’re in large party energy saving by just being dark. This led me to suffer quite bad SAD for the first and only time in my life. It took until the second winter to figure out that this was what was going on and put in some high-power LED bulbs. Now I’ll have to check for CRI!
Story from my past: at university, I once partook of a game called the “assassins’ guild”. It was a kind of Battle Royale. Fifty-odd participants would each be circularly assigned two “targets” from the other forty-nine, and instructed to “kill” them (for example, by writing “knife” on a stick and poking them with it, or by shooting them with a nerf gun). You’d be told their halls of residence, so you could find them there, if needed.
Your targets were revealed at 09:00 on the first day. I found my target’s Facebook page, found a post announcing her going to uni, and saw she was studying a subject which shared a module with my subject. From there I was able to pull up the timetable of her subject, guess which lectures were mandatory, and notice that she had a mandatory one right now. I “killed” her at 10:00 as she left the lecture hall.
I didn’t even get the first kill! Someone else pulled off the same trick even quicker than me!
This is with a smart uni student’s level of skill: the only thing it took was effort. If AI is good enough to do this, then privacy removal will be very easy to perform at scale.