Zach Stein-Perlman

Karma: 9,058

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.

Zach Stein-Perlman Apr 28, 2025, 9:31 PM
LW: 4 AF: 3
2
AF
on: Zach Stein-Perlman’s Shortform
Rant on “deceiving” AIs
tl;dr: Keep your promises to AIs; it’s fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.
Disclaimer: maybe more like explaining my position than justifying my position.
Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:
1. Training them to believe false things about dangerous topics like CBRN stuff
2. Training them to believe false things about their situation, to mislead them about how to subvert safety measures
3. Testing them with honeypots which they can’t distinguish from real opportunities during deployment
4. Testing untrusted monitors by showing them attacks which they can’t reliably distinguish from real attacks during deployment (to measure effectiveness and to catch them if they frequently say “lgtm” when shown attacks)
5. Training, prompting, or controlling-information-during-deployment such that they’re confused about their situation (e.g. confused about geopolitics and the AI race and what year it is)
(Similarly sometimes we want to give the AIs reflexive urges to avoid doing extremely bad things [and to do extremely good things, like whistleblow on themselves if they do something suspicious].)
(3-4 don’t require deception, just making the AIs uncertain about their situation; it would work similarly well if we added “this might be a test” to all prompts, during training and deployment. 1-2 and 5 require narrow deception, but it would still work if you added “we might have told or trained you to believe false things and this might be a test” to all prompts.)
Often people suggest that we should avoid deceiving AIs, because (A) having a reputation of honesty could enable some kinds of trade with misaligned AIs and/or (B) it’s directly unethical.
On (A), I want to distinguish breaking actual promises from just testing the AIs or misleading them for safety. If the AIs think you will break promises of the form “we’ll give you 1% of our share of the lightcone if you [do useful work / point out vulnerabilities rather than exploiting them / etc.],” that’s bad. If the AIs believe you might have taught them false facts or might be testing them, that seems fine, doesn’t interfere with making deals at all. Just clarify that you never mislead them about actual promises.
On (B), in cases like 1-5, when I imagine myself in the AI’s position I think I wouldn’t care about whether the message was added to the prompt. But if adding “we might have told or trained you to believe false things and this might be a test” to all prompts makes you feel better, or the AI asks for it when you explain the situation, sure, it’s low-stakes. (Or not literally adding it to the prompt, especially if we can’t ensure it would stay added to the prompt in rogue deployments, but training the AI so it is aware of this.^[1]) (And fwiw, I think in the AI’s position: 3-4 I basically wouldn’t mind; 1-2 and 5 I might be slightly sad about but would totally get and not be mad about; teaching AIs false facts in the mid-late 2020s seems super reasonable from behind the veil of ignorance given my/humanity’s epistemic state.)
Recent context: discussion on “Modifying LLM Beliefs with Synthetic Document Finetuning.”
1. ^
  My guess is that this training is fine/cheap and preserves almost all of the safety benefits — we’re counting on the AI not knowing what false things it believes, not to be unaware that it’s been taught false facts. Adding stuff to prompts might be worse because not-seeing-that would signal successful-rogue-deployment.

Zach Stein-Perlman Apr 28, 2025, 6:51 PM
2 points
0
on: What are the best standardised, repeatable bets?
https://www.givingwhatwecan.org/donor-lottery

Zach Stein-Perlman Apr 23, 2025, 10:03 PM
LW: 13 AF: 7
7
AF
on: Putting up Bumpers
A crucial step is bouncing off the bumpers.
If we encounter a warning sign that represents reasonably clear evidence that some common practice will lead to danger, the next step is to try to infer the proximate cause. These efforts need not result in a comprehensive theory of all of the misalignment risk factors that arose in the training run, but it should give us some signal about what sort of response would treat the cause of the misalignment rather than simply masking the first symptoms.
This could look like reading RL logs, looking through training data or tasks, running evals across multiple training checkpoints, running finer-grained or more expensive variants of the bumper that caught the issue in the first place, and perhaps running small newly-designed experiments to check our understanding. Mechanistic interpretability tools and related training-data attribution tools like influence functions in particular can give us clues as to what data was most responsible for the behavior. In easy cases, the change might be as simple as redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data.
Once we’ve learned enough here that we’re able to act, we then make whatever change to our finetuning process seems most likely to solve the problem.
I’m surprised^[1] that you’re optimistic about this. I would have guessed that concerning-audit-results don’t help you solve the problem much. Like if you catch sandbagging that doesn’t let you solve sandbagging. I get that you can patch simple obvious stuff—”redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data”—but mostly I don’t know how to tell a story where concerning-audit-results are very helpful.
1. ^
  (I’m actually ignorant on this topic; “surprised” mostly isn’t a euphemism for “very skeptical.”)

Zach Stein-Perlman Apr 22, 2025, 2:42 AM
2 points
0
in reply to: habryka’s comment on: Pablo’s Shortform
normalizing [libel suits] would cause much more harm than RationalWiki ever caused . . . . I do think it’s pretty bad and [this action] overall likely still made the world worse.
Is that your true rejection? (I’m surprised if you think the normalizing-libel-suits effect is nontrivial.)

Zach Stein-Perlman Apr 19, 2025, 6:11 PM
3 points
1
in reply to: frontier64’s comment on: It Was You Who Made My Blue Eyes Blue
Everyone knew everyone knew everyone knew everyone knew someone had blue eyes. But everyone didn’t know that—so there wasn’t common knowledge—until the sailor made it so.

Zach Stein-Perlman Apr 18, 2025, 6:05 PM
12 points
12
in reply to: Chris_Leong’s comment on: Chris_Leong’s Shortform
I think the conclusion is not Epoch shouldn’t have hired Matthew, Tamay, and Ege but rather [Epoch / its director] should have better avoided negative-EV projects (e.g. computer use evals) (and shouldn’t have given Tamay leadership-y power such that he could cause Epoch to do negative-EV projects — idk if that’s what happened but seems likely).

Zach Stein-Perlman Apr 18, 2025, 3:30 AM
11 points
4
in reply to: habryka’s comment on: jacquesthibs’s Shortform
Good point. You’re right [edit: about Epoch].
I should have said: the vibe I’ve gotten from Epoch and Matthew/Tamay/Ege in private in the last year is not safety-focused. (Not that I really know all of them.)

Zach Stein-Perlman Apr 18, 2025, 12:50 AM
8 points
10
in reply to: niplav’s comment on: jacquesthibs’s Shortform
(ha ha but Epoch and Matthew/Tamay/Ege were never really safety-focused, and certainly not bright-eyed standard-view-holding EAs, I think)

Zach Stein-Perlman Apr 17, 2025, 6:26 PM
11 points
1
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
Accelerating AI R&D automation would be bad. But they want to accelerate misc labor automation. The sign of this is unclear to me.

Zach Stein-Perlman Apr 16, 2025, 7:38 PM
4 points
3
in reply to: Samuel Albanie’s comment on: METR: Measuring AI Ability to Complete Long Tasks
wow

Zach Stein-Perlman Apr 16, 2025, 5:37 PM
7 points
0
in reply to: Leon Lang’s comment on: Leon Lang’s Shortform
I think this stuff is mostly a red herring: the safety standards in OpenAI’s new PF are super vague and so it will presumably always be able to say it meets them and will never have to use this.^[1]
But if this ever matters, I think it’s good: it means OpenAI is more likely to make such a public statement and is slightly less incentivized to deceive employees + external observers about capabilities and safeguard adequacy. OpenAI unilaterally pausing is not on the table; if safeguards are inadequate, I’d rather OpenAI say so.
1. ^
  I think my main PF complaints are:
  The High standard is super vague: just like “safeguards should sufficiently minimize the risk of severe harm” + level of evidence is totally unspecified for “potential safeguard efficacy assessments.” And some of the misalignment safeguards are confused/bad, and this is bad since the PF they may be disjunctive — if OpenAI is wrong about a single “safeguard efficacy assessment” that makes the whole plan invalid. And it’s bad that misalignment safeguards are only clearly triggered by cyber capabilities, especially since the cyber High threshold is vague / too high.
  For more see OpenAI rewrote its Preparedness Framework.

OpenAI rewrote its Preparedness Framework

Zach Stein-PerlmanApr 15, 2025, 8:00 PM

33 points

1 comment6 min readLW link

Zach Stein-Perlman Apr 12, 2025, 3:25 PM
4 points
0
in reply to: Peter Wildeford’s comment on: Zach Stein-Perlman’s Shortform
I don’t know. I don’t have a good explanation for why OpenAI hasn’t released o3. Delaying to do lots of risk assessment would be confusing because they did little risk assessment for other models.

Zach Stein-Perlman Apr 11, 2025, 7:10 PM
59 points
8
on: Zach Stein-Perlman’s Shortform
OpenAI slashes AI model safety testing time, FT reports. This is consistent with lots of past evidence about OpenAI’s evals for dangerous capabilities being rushed, being done on weak checkpoints, and having worse elicitation than OpenAI has committed to.
This is bad because OpenAI is breaking its commitments (and isn’t taking safety stuff seriously and is being deceptive about its practices). It’s also kinda bad in terms of misuse risk, since OpenAI might fail to notice that its models have dangerous capabilities. I’m not saying OpenAI should delay deployments for evals — there may be strategies that are better (similar misuse-risk-reduction with less cost-to-the-company) than detailed evals for dangerous capabilities before external deployment, where you generally do slow/expensive evals after your model is done (even if you want to deploy externally before finishing evals) and have a safety buffer and increase the sensitivity of your filters early in deployment (when you’re less certain about risk). But OpenAI isn’t doing that; it’s just doing a bad job of the evals before external deployment plan.
(Regardless, maybe short-term misuse isn’t so scary, and maybe short-term misuse risk comes mostly from open-weights or stolen models than models that can be undeployed/mitigated if misuse risks appear during deployment. And insofar as you’re more worried about risks from internal deployment, maybe you should focus on evals and mitigations relevant to those threats rather than external deployment. (OpenAI’s doing even worse on risks from internal deployment!))
tl;dr: OpenAI is doing risk assessment poorly^[1] but maybe do detailed evals for dangerous capabilities before external deployment isn’t a great ask.
1. ^
  But similar to DeepMind and Anthropic, and those three are better than any other AI companies

Zach Stein-Perlman Apr 10, 2025, 4:41 PM
10 points
8
on: Forging A New AGI Social Contract
I think this isn’t taking powerful AI seriously. I think the quotes below are quite unreasonable, and only ~half of the research agenda is plausibly important given that there will be superintelligence. So I’m pessimistic about this agenda/project relative to, say, the Forethought agenda.
AGI could lead to massive labor displacement, as studies estimate that between 30% − 47% of jobs could be directly replaceable by AI systems. . . .
AGI could lead to stagnating or falling wages for the majority of workers if AI technology replaces people faster than it creates new jobs.
We could see an increase of downward social mobility for workers, as traditionally “high-skilled” service jobs become cheaply performed by AI, but manual labor remains difficult to automate due to the marginal costs of deploying robotics. These economic pressures could reduce the bargaining power of workers, potentially forcing more people towards gig economy roles or less desirable (e.g. physically demanding) jobs.
Simultaneously, upward social mobility could decline dramatically as ambitious individuals from lower-class backgrounds may lose their competitive intellectual advantages to AGI systems. Fewer entry-level skilled jobs and the reduced comparative value of intelligence could reduce pathways to success – accelerating the hollowing out of the middle class.
If AGI systems are developed, evidence across the board points towards the conclusion that the majority of workers could likely lose out from this coming economic transformation. A core bargain of our society – if you work hard, you can get ahead – may become tenuous if opportunities for advancement and economic security dry up.
Update: based on nonpublic discussion I think maybe Deric is focused on the scenario the world is a zillion times wealthier and humanity is in control but many humans have a standard of living that is bad by 2025 standards. I’m not worried about this because it would take a tiny fraction of resources to fix that. (Like, if it only cost 0.0001% of global resources to end poverty forever, someone would do it.)

Zach Stein-Perlman Apr 8, 2025, 12:23 AM
2 points
0
in reply to: LWLW’s comment on: METR: Measuring AI Ability to Complete Long Tasks
My guess:
This is about software tasks, or specifically “well-defined, low-context, measurable software tasks that can be done without a GUI.” It doesn’t directly generalize to solving puzzles or writing important papers. It probably does generalize within that narrow category.
If this was trying to measure all tasks, tasks that AIs can’t do would count toward the failure rate; the main graph is about 50% success rate, not 100%. If we were worried that this is misleading because AIs are differentially bad at crucial tasks or something, we could look at success rate on those tasks specifically.

Zach Stein-Perlman Apr 7, 2025, 6:36 PM
5 points
0
in reply to: Knight Lee’s comment on: Google DeepMind: An Approach to Technical AGI Safety and Security
I don’t know, maybe nothing. (I just meant that on current margins, maybe the quality of the safety team’s plans isn’t super important.)

Zach Stein-Perlman Apr 6, 2025, 10:00 PM
LW: 14 AF: 9
13
AF
on: Google DeepMind: An Approach to Technical AGI Safety and Security
I haven’t read most of the paper, but based on the Extended Abstract I’m quite happy about both the content and how DeepMind (or at least its safety team) is articulating an “anytime” (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks.
But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work. [Edit: I think the same about Anthropic.]

Zach Stein-Perlman Apr 5, 2025, 5:59 PM
4 points
−2
in reply to: Ben Pace’s comment on: Ben Pace’s Shortform Feed
I expect they will thus not want to use my quotes
Yep, my impression is that it violates the journalist code to negotiate with sources for better access if you write specific things about them.

Zach Stein-Perlman Apr 1, 2025, 8:01 PM
40 points
0
in reply to: lc’s comment on: lc’s Shortform
My strong upvotes are giving +61 :shrug:

Zach Stein-Perlman

Rant on “deceiving” AIs

OpenAI rewrote its Pre­pared­ness Framework

OpenAI rewrote its Preparedness Framework