Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
Doing AI Safety research for ethical reasons.
My webpage.
Leave me anonymous feedback.
I operate by Crocker’s Rules.
But the audience isn’t optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.
One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.
Something more interesting would be “the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest”, or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It’s also relevant to training an AI that’s alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.
I don’t see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.
hahah yes we had ground truth
I think the reason this works is that the AI doesn’t need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it’s efficient at summarizing decision theory papers, even thought it’s generally bad at reasoning through it
Not really, just LW, AI safety papers and AI safety research notes, which are the topics I’d most be interested in. I’m not sure other forums should be very different though?
My vibe-check on current AI use cases
@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here’s where I think my most desired use cases are in terms of capabilities:
Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It’s pretty bad, to the extent it’s generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it’s sometimes worth it to sample 5 ideas to get your mind rolling.
I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics.
Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn’t help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences.
Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation.
Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-autonomously. o3 seems to work best, although I was impressed by Claude 4 Opus’ knowledge on niche topics.
Summarizing documents, and exploring topics I’m no expert in: Super good out of the box, especially thanks to its encyclopaedic indexical knowledge (connecting you to the obvious methods/answers that an expert would bring up).
One particularly useful approach is walking through how a general method or abstract idea could apply to a concrete example of interest to you.
Coaching: Pretty good out of the box in proposing solutions and perspectives. Probably close to top 10% coaches, but maybe huge value is in that last 10%
Also therapy: Probably good, probably better or more constructive than the average friend, but of course worries about hard-to-detect sycophancy.
Personal micromanagement: Pretty good.
Having a long-running chat where you ask it “how long will this task take me to complete”, and over time you both calibrate.
More general scaffold personal assistant to co-organize your week
Any use cases I’m missing?
Worked on this with Demski. Video, report.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.
This post is reminiscent of this old one from Daniel
Some thoughts skimming this post generated:
If a catastrophe happens, then either:
It happened so discontinuously that we couldn’t avoid it even with our concentrated effort
It happened slowly but for some reason we didn’t make a concentrated effort. This could be because:
We didn’t notice it (e.g. intelligence explosion inside lab)
We couldn’t coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn’t racing faster)
We didn’t act individually rationally (e.g. Trump doesn’t listen to advisors / Trump brainwashed by AI)
1 seems unlikelier by the day.
2a is mostly transparency inside labs (and less importantly into economic developments), which is important but at least some people are thinking about it.
There’s a lot to think through in 2b and 2c. It might be critical to ensure early takeoff improves them, rather than degrading them (missing any drastic action to the contrary) until late takeoff can land the final blow.
If we assume enough hierarchical power structures, the situation simplifies into “what 5 world leaders do”, and then it’s pretty clear you mostly want communication channels and trustless agreements for 2b, and improving national decision-making for 2c.
Maybe what I’d be most excited to see from the “systemic risk” crowd is detailed thinking and exemplification on how assuming enough hierarchical power structures is wrong (that is, the outcome depends strongly on things other than what those 5 world leaders do), what are the most x-risk-worrisome additional dynamics in that area, and how to intervene on them.
(Maybe all this is still too abstract, but it cleared my head)
No, the utility here is just the amount of money gets
I meant that it sounded like you “wanted a better average score (over as) when you are randomly sampled as b than other programs”. Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.
(Just skimmed, also congrats on the work)
Why is this surprising? You’re basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)
When instead you condition on a = b, this becomes a different problem: Omega is now a perfect predictor! So you obviously one-box.
Another way to frame this is whether you optimize through the inner program or assume it fixed. From your prior, Omega samples randomly, so you obviously don’t want to optimize through it. Not only because |P| is huge, but actually more importantly, because for any policy you might want to implement, there exists a policy that implements the exact opposite (and might be sampled just as likely as you), thus your effect nets out to exactly 0. But once you update on (condition on) seeing that actually Omega sampled exactly you (instead of your random twin, or any other old program), then you’d want to optimize through a! But, you can’t have your cake and eat it too. You cannot shape yourself to reap that reward (in the world where Omega samples exactly you), without also shaping yourself to give up some reward (relative to other players) in the world where Omega samples your evil twin.
(Thus you might want to say: Aha! I will make my program behave differently depending on whether it is facing itself or an evil twin! But then… I can also create an evil twin relative to that more complex program. And we keep on like this forever. That is, you cannot actually, fully, ever condition your evil twin away, because you are embedded in Omega’s distribution. I think this is structurally equivalent to some commitment race dynamics I discussed with James, but I won’t get into that here.)
Secretly, I think this duality only feels counter-intuitive because it’s an instance of dynamic inconsistency, = you want to take different actions once you update on information, = the globally (from the uncorrelated prior) optimal action is not always the same as the locally optimal action (from a particular posterior, like the one assuming correlation). Relatedly, I think the only reason your Universal framing differs from your Functional and Anthropic framings is exactly that they are (implicitly) using these two different distributions (one without correlation, the other with):
The Universal framing assumes that Omega samples randomly (no correlation).
The Functional framing assumes that “you have control over the inner program”. But this is sneaking something in. You obviously have control over your own program. But you don’t actually have any control over “the program that Omega will randomly sample from P” (because anything you do is cancelled by your evil twin). Thus, assuming you have control over the inner program, is equivalent to assuming that Omega sampled you. That is, yes correlation.
The Anthropic framing is also implicitly assuming “Omega samples you”, although it’s a bit more complicated since it also depends on your utility function (how much you care about different programs getting different amounts of money):
If your distribution is truly “Omega samples programs at random”, then when you observe two equal numbers, you are almost certain to be in a simulation. Given that, if you for example care equally about all programs getting as much money as possible, then of course you should one-box. That will entail that each random program (with a tiny probability) gets a million, which is a huge win. But the intuition that you were expressing in Question 2 (“p2 is better than p1 because it scores better”) isn’t compatible with “caring equally about all programs”. Instead, it sounds as if you positively want to score better than other programs, that is, maximize your score and minimize theirs! If that’s the case, then you obviously should two-box, since almost certainly you are subtracting a million from another program, not yourself. Even assuming, for simplicity, that you only care about beating one other program (as in the p1 and p2 example), you should two-box, because you are subtracting the same million dollars from both, but you are gaining a very slight edge with the thousand dollars.
If your distribution is instead “a = b” (assumes correlation), then, regardless of whether you want to maximize everyone’s payoff or you want to beat another program that could be sampled, you want to one-box, since the million dollar benefit is coming straight to you, and is bigger than the thousand dollar benefit.
It’s unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it’s in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it’s what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 more parallel paragraphs about whether it has gotten something wrong, and then having a lower-context version of Claude decide whether there are any important disagreements, and if not just majority-vote, would improve robustness problems (like “I close a goal before actually achieving it”). But maybe this adds too much bloat and opportunities for mistakes, or makes some mistakes better but others way worse.
I don’t see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it’s a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent… but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just “weigh by Turing computations in the real world” (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of “scoping the reference class” (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn’t have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I’m sure that there’ll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like “whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details”). And “drastically” means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one’s optimum is worth <10% to the other’s utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).
To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.
I’m sure some of people’s ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from “these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready” (this is similar to your “Flinch”, but I think more conscious and endorsed).
Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let’s actually think through how the state might look like in 5 years! Someone objects that democracy will prevent it? Let’s actually think through the actual consequences of cheap cognitive labor in democracy!
This is analogous to what pessimists about single-single alignment have gone through. They have some abstract arguments, people don’t buy them, so they start working through them in more detail or provide example failures. I buy some parts of them, but not others. And if you did the same for this threat model, I’m uncertain how much I’d buy!
Of course, the paper might have been your way of doing that. I enjoyed it, but still would have preferred more fully detailed examples, on top of the abstract arguments. You do use examples (both past and hypothetical), but they are more like “small, local examples that embody one of the abstract arguments”, rather than “an ambitious (if incomplete) and partly arbitrary picture of how these abstract arguments might actually pan out in practice”. And I would like to know the messy details of how you envision these abstract arguments coming into contact with reality. This is why I liked TASRA, and indeed I was more looking forward to an expanded, updated and more detailed version of TASRA.
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to “current humans becoming smarter and faster thinkers”.
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it’s more likely than not that the former wins, but it’s not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.
If this doesn’t happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.
If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it’s common knowledge.
Fantastic snapshot. I wonder (and worry) whether we’ll look back on it with similar feelings as those we have for What 2026 looks like now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.
This makes it really unclear what to work on.
It’s not super obvious to me that there won’t be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they’re upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like “a simple cooridnation mechanism”, and more like “a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements”. Of course, similarity to past geopolitical situations does make it seem unlikely on priors.
There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.
My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.
I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I’m not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won’t happen
Speaking for myself (not my coauthors), I don’t agree with your two items, because:
if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it “generally uses too many try-excepts”
More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.
That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn’t surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I’m pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.
Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I’m probably more optimistic about new ideas than you, partly because “it always subjectively feels like there are no big ideas left”, and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can’t even spare that much? My guess would be no. That said, I still don’t expect certain lab leaders to want to do this.
The same is not true of security though, that’s a tough one.
Low-stakes problems arise for high-stakes deployments
Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.
Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).
Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.
But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway?
Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.
While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation).
I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.