Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
My sense is that this is true until you have a small child you want to move around, and then it’s super super annoying to not have your car seat already installed for them and have other supplies on-hand
I also like that so much of the relevant information stays nested under the post on LessWrong, whereas on Twitter I find it much harder to systematically read various offshoot threads of a tweet
Thanks for synthesizing this, and to Eliezer for researching and explaining the various empirical examples, which I find very helpful (as I did in IABIED).
One thing that I think might be getting lost in conversation, and the startup examples makes clear: I think talking about these problems as “one-chance” is more confusing than is needed.
Talking about irretrievability is one good improvement, but I think irreversibility is also a natural concept here, which I’d like to see more present?
I’d center more the idea that yeah you can try again, but you can’t undo the effects of the previous try, and the accumulation of those effects might make it substantially harder (if not impossible) for you to succeed.
“What do you mean I only get one try at building this startup?” Well, you’re welcome to keep going, but if you’ve depleted your capital you’ll have a hard time getting it back. If you’ve damaged your reputation with investors, customers, etc, it will be hard to wipe the slate clean. The world changed from your previous missteps along the way, as it would if we trained a powerful AI system that turned out to be adversarial to us.
Similarly, yeah France can mount a resistance after Germany has breached their borders, but now France needs to accomplish an even harder task to drive them out.
I apologize if I’m missing these points having been made; I did skim much more aggressively starting a bit into “On the extraordinary efforts put forth to misinterpret the idea of oneshotness.”
Thank you! I’m surprised that didn’t come up in my Claude-searching, very clearly related & I’m excited to read it
Thank you! That’s very helpful. I didn’t realize there would be official transcripts made public thereafter & I agree that seems like a fine remedy to ‘people want to know what was said’
Fwiw I find this kind of ambiguous?
I agree that downloading is probably a form of capture, though it’s not clear to me whether download is a form of recording. And relatedly it’s not clear to me if the second sentence is an unpacking of the first sentence (no capture—i.e., recording) or an independent second sentence (no capture, also no recording).
Maybe I’m just being dense! Anyway, I think it’s unclear enough that I’d not recommend people to do it.
If AI companies only do control-style monitoring of their internal deployments, they still need mitigations against their internal code getting subverted by their externally deployed AI.
For instance, if you only monitor your AI’s coding activities within the organization, you are vulnerable to an externally-deployed copy potentially supply-chain attacking itself via the outside world, such as by introducing vulnerabilities to packages that it believes are relied upon internally.
Then if the lab is not careful enough about what external code they rely upon, this is a way to get vulnerabilities into internal code without the coding agent writing them inside of the monitored channel.
One defense we have at our disposal: Even if the internal AI isn’t writing the vulnerabilities, it might still think about whether the vulns have been placed yet or generally think about the forthcoming schemes.
And so the scariest form is something like ‘the internal AI has an unvocalized policy like “let the external AI handle things,”’ where ‘CoT monitoring of the internal AI’ might not work.
Since the audio is available on YouTube, presumably the full audio could be transcribed, yeah? Is someone doing this already?
Edit: I feel confused on what California’s legality of this is? It seems like recording or rebroadcasting the stream would not be permitted by the court.
“Professional red-teamers” are mentioned in the Opus 4.7 System Card (e.g., Section 5.2); I’m unsure, are these just ‘people who are paid for their red-teaming efforts’?
This phrasing conveys something stronger to me, like people who are elite at red-teaming, and who probably do it full-time for an elite professionalized company with high training and standards. If it’s just people who ‘are getting paid to red-team,’ this wouldn’t clear my bar for calling them ‘professional red-teamers,’ I think.
That is to say, if it’s just ‘people paid to red-team,’ I worry about AI companies inadvertantly giving too strong a sense of the thoroughness of their stress-testing.
For instance one worry I have is that even paid red-teamers will not be very creative or persistent in their efforts, relative to what a determined future adversary would be.
Some ideas / thinking aloud:
Do AI companies pay their red-teamers in a bounty-like way? If not, maybe they should be (aligns incentives w/ breaking the system and finding things)
Do we have data on on how red-teamer efficacy changes as they gain more experience?
What about comparing red-teamers to the efficacy of today’s own models at generating jailbreaks? (I think there’s some research on the latter but with older models?)
I wonder how many of the Mythos vulnerabilities / exploits had already been discovered by eg the NSA.
Don’t get me wrong; I still find the discoveries very impressive and frightening. It does also feel different than ‘no human discovered this over X years’ though, because we shouldn’t expect to hear from some of the actors who were most motivated and most capable of finding these. e.g., if the NSA was aware of these, I still wouldn’t expect them to say so.
My cached impression from reading The Code Book is that the intelligence community often won’t disclose that they’d known something, even if that fact has become public.
For instance, RSA encryption, which notably stands for Rivest-Shamir-Adleman, was described by them in 1977, but seems to have been separately invented in 1973 within the British government, by someone named Clifford Cocks. But his first-but-nonpublic-invention wasn’t acknowledged until 1997.
Re: cognitive empathy, yeah my general take was ‘I feel like there’s a sphere of social skills missing?’
Right now, it feels to me like design is maybe merging two different subskills: goal-oriented understand of others, and also something like ‘how does laying out different objects --> goal-achievement.’
I guess I’m wondering, where would something like a coach or therapist go in this taxonomy? Some hybrid of management and design?
Nice nice, yeah I’m sure people could design something clever here & I’d be excited to learn if people are working on it (not a good fit for me directly)
Yeah that’s fair that you could probably obscure the signal—maybe this is just FUD from me. But two things I was imagining:
Absolute volumes and trends over time (how much AI labor are they using)
Relative volumes at certain times/days (tells you something about how hands-on a role the human employees need to be playing? based on how much it dips when they aren’t online)
Also just a general concern of ‘you might be leaking information that you didn’t anticipate, and so it’s more secure to say less’
Also, you’d be creating a pretty valuable target if certain hash-functions do eventually fall (think how much the companies want to avoid distillation today), but maybe that’s such a bad scenario that it’s not worth accounting for?
jfyi @Buck—struggled to tag in the initial quicktake
I’ve been thinking more about ‘If you caught your AI red-handed on scheming, could you actually convince others that it was real and to take it seriously?’ Buck Shlegeris wrote about this idea here: https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
One intervention that seems robustly helpful is keeping tamper-proof / tamper-evident logs, so you can demonstrate that a certain incident is in fact the actual and complete trajectory of your AI, that you are fully representing the inputs and context presented to it, etc.
But I suspect that even the reveal of a matching, tamper-evident log of the trajectory might not be enough, unless you declared in advance that you were doing this and had a third-party verify that your logging system works the ways you claim it does.
One way to increase the credibility of the logs might be for AI companies to publish hashes of their internal trajectories in close-to-real-time. Then, if a trajectory is later found to be shady, the earlier publication of its hash gives the authenticity a bit more weight—though even that might be accused of patiently seeding a false trajectory.
To be clear, I’m not proposing that AI companies should publish hashes of their trajectories, per se, in part because it might throw off other information they don’t necessarily want to.
But I do want to gesture at the mechanism that’s important to think through: ‘How do you prove that what went in and came out of your system is in fact exactly as you are representing?’
Is “e.g.,” really that uncommon? Dang, that’s how I write it naturally
I feel like I’ve heard this argument yes, though when I read lots of Anthropic’s ‘race to the top’ language, it’s not quite that
Here’s an example that feels borderline to me:
Dario Amodei: “Where the world needs to get [is]… from “this technology doesn’t exist” to “the technology exists in a very powerful way and society has actually managed it.” And I think the only way that’s gonna happen is that if you have, at the level of a single company, and eventually at the level of the industry, you’re actually confronting those trade-offs. You have to find a way to actually be competitive, to actually lead the industry in some cases, and yet manage to do things safely. And if you can do that, the gravitational pull you exert is so great. There’s so many factors—from the regulatory environment, to the kinds of people who want to work at different places, to, even sometimes, the views of customers that kind of drive in the direction of: if you can show that you can do well on safety without sacrificing competitiveness—right—if you can find these kinds of win-wins, then others are incentivized to do the same thing.”
I’ve written a semi-related piece before (https://www.clear-eyed.ai/p/dont-rely-on-a-race-to-the-top), but I think yours would be different enough that it could still make sense
Thanks for writing this up - Re: your question, “It has no system card” I think this could be clearer that there is a SC but it doesn’t cover the Pro version.
I found this clear enough overall in the post, to be clear! But think I’d have misunderstood from the first few sentences if I didn’t already have context from knowing there was some SC released for 5.4
I believe they aren’t taking more witnesses unfortunately :/
Fwiw, if a direction is simply describable, I do think I’m faster to tell Claude Code to do it (via Wispr Flow) than to mechanically do it. I also feel like there are some nice batching properties of relaying to it all-at-once the changes I want, without having to switch myself from conceptual thinking to mechanical operation and then back again, but I’m less sure of that