Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Re: cognitive empathy, yeah my general take was ‘I feel like there’s a sphere of social skills missing?’
Right now, it feels to me like design is maybe merging two different subskills: goal-oriented understand of others, and also something like ‘how does laying out different objects --> goal-achievement.’
I guess I’m wondering, where would something like a coach or therapist go in this taxonomy? Some hybrid of management and design?
Nice nice, yeah I’m sure people could design something clever here & I’d be excited to learn if people are working on it (not a good fit for me directly)
Yeah that’s fair that you could probably obscure the signal—maybe this is just FUD from me. But two things I was imagining:
Absolute volumes and trends over time (how much AI labor are they using)
Relative volumes at certain times/days (tells you something about how hands-on a role the human employees need to be playing? based on how much it dips when they aren’t online)
Also just a general concern of ‘you might be leaking information that you didn’t anticipate, and so it’s more secure to say less’
Also, you’d be creating a pretty valuable target if certain hash-functions do eventually fall (think how much the companies want to avoid distillation today), but maybe that’s such a bad scenario that it’s not worth accounting for?
jfyi @Buck—struggled to tag in the initial quicktake
I’ve been thinking more about ‘If you caught your AI red-handed on scheming, could you actually convince others that it was real and to take it seriously?’ Buck Shlegeris wrote about this idea here: https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
One intervention that seems robustly helpful is keeping tamper-proof / tamper-evident logs, so you can demonstrate that a certain incident is in fact the actual and complete trajectory of your AI, that you are fully representing the inputs and context presented to it, etc.
But I suspect that even the reveal of a matching, tamper-evident log of the trajectory might not be enough, unless you declared in advance that you were doing this and had a third-party verify that your logging system works the ways you claim it does.
One way to increase the credibility of the logs might be for AI companies to publish hashes of their internal trajectories in close-to-real-time. Then, if a trajectory is later found to be shady, the earlier publication of its hash gives the authenticity a bit more weight—though even that might be accused of patiently seeding a false trajectory.
To be clear, I’m not proposing that AI companies should publish hashes of their trajectories, per se, in part because it might throw off other information they don’t necessarily want to.
But I do want to gesture at the mechanism that’s important to think through: ‘How do you prove that what went in and came out of your system is in fact exactly as you are representing?’
Is “e.g.,” really that uncommon? Dang, that’s how I write it naturally
I feel like I’ve heard this argument yes, though when I read lots of Anthropic’s ‘race to the top’ language, it’s not quite that
Here’s an example that feels borderline to me:
Dario Amodei: “Where the world needs to get [is]… from “this technology doesn’t exist” to “the technology exists in a very powerful way and society has actually managed it.” And I think the only way that’s gonna happen is that if you have, at the level of a single company, and eventually at the level of the industry, you’re actually confronting those trade-offs. You have to find a way to actually be competitive, to actually lead the industry in some cases, and yet manage to do things safely. And if you can do that, the gravitational pull you exert is so great. There’s so many factors—from the regulatory environment, to the kinds of people who want to work at different places, to, even sometimes, the views of customers that kind of drive in the direction of: if you can show that you can do well on safety without sacrificing competitiveness—right—if you can find these kinds of win-wins, then others are incentivized to do the same thing.”
I’ve written a semi-related piece before (https://www.clear-eyed.ai/p/dont-rely-on-a-race-to-the-top), but I think yours would be different enough that it could still make sense
Thanks for writing this up - Re: your question, “It has no system card” I think this could be clearer that there is a SC but it doesn’t cover the Pro version.
I found this clear enough overall in the post, to be clear! But think I’d have misunderstood from the first few sentences if I didn’t already have context from knowing there was some SC released for 5.4
I believe they aren’t taking more witnesses unfortunately :/
Reid Hoffman used to be on the OpenAI Board, which might be another contributor to the name collision here
I don’t think that makes the analogy bad? I agree that’s a helpful distinction to track, but for people who don’t know about START etc, I think it’s more helpful for them to know than not.
I wonder if parts of this essay were written a few years ago & not updated for publication?
This is the part that most strongly suggests it IMO:
Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code.
This line links to the GPT-3 paper, which was published in 2020 about a model trained in 2019 - so that’s six years ago, not three.
I also find the specific claims made about ‘three years ago’ to be confusing: Three years ago (early 2023) GPT-4 already existed, which could do pretty hard calculus problems.
And three years ago (again, early 2023), Microsoft Copilot had already been a product for a year and a half (released summer 2021), which certainly was capable of writing lines of code. I’m not sure the exact % of OpenAI employees who used it day-to-day, but it was substantial.
This all leads me to wonder what happened in this particular passage. (I don’t think this is super significant for the impact of the piece overall though.)
Thanks for this—very very interesting document.
One of the hard constraints is (emphasis mine):
Engage or assist any individual group attempting to seize unprecedented and illegitimate degrees of absolute societal, military, or economic control;
Maybe a nitpick, but I suspect that shouldn’t be an ‘and’?
It’s hard for me to imagine what something like ‘unprecedented but legitimate absolute societal/military/economic control’ looks like. (I understand of course that part of the constitution’s intent is for Claude to be less pedantic, and so maybe nits like this don’t matter much.)
Separately, there’s a slight typo, at least on the published version:
other entities.We
JFYI that the footnotes here jump to the right place on Substack; wasn’t sure how to quickly port them to LW, and felt like they made the page a bit cluttered here
By “succeeding” you mean getting Safe ASI, as opposed to getting any ASI at all, right? At least that’s how I read you, but at first I thought you meant “their RSI probably won’t lead to ASI”
(A more extreme example is that AIs can locate a somewhat subtle needle inside of 200k words in a single forward pass while it would probably take a human well over an hour.)
At first I wondered how quickly a human could do this, with tooling?
The thing I was trying to get at is, like, distinguishing reading-speed from reasoning-speed, though in retrospect I think these may not be very separable in this case.
I guess there’s the degenerate case of feeding those words to an AI and saying “what’s the needle?”
I had meant something that still involved human cognition, just with faster rifling through the text. Like maybe a method that embedded the text, and then you could search through it more quickly.
But in retrospect, the “still uses cognition” version is probably just asking the model “What are a few possible needles?” and then using your judgment among the options.
I really appreciate that the charts show which models are frontier; I’d like to see more groups adopt that convention
Yeah I agree that works and feel slightly sheepish not to have already internalized that as the term to use?
I guess there’s still some distinction between an objective as a single thing, vs drives as, like, heuristics that will tend to contribute toward shaping the overall objective? I’m not sure, still feel a bit fuzzy and should probably sit with it more
I wonder how many of the Mythos vulnerabilities / exploits had already been discovered by eg the NSA.
Don’t get me wrong; I still find the discoveries very impressive and frightening. It does also feel different than ‘no human discovered this over X years’ though, because we shouldn’t expect to hear from some of the actors who were most motivated and most capable of finding these. e.g., if the NSA was aware of these, I still wouldn’t expect them to say so.
My cached impression from reading The Code Book is that the intelligence community often won’t disclose that they’d known something, even if that fact has become public.
For instance, RSA encryption, which notably stands for Rivest-Shamir-Adleman, was described by them in 1977, but seems to have been separately invented in 1973 within the British government, by someone named Clifford Cocks. But his first-but-nonpublic-invention wasn’t acknowledged until 1997.