Legible vs. Illegible AI Safety Problems
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!
Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.
(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)
I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before[1], for example in discussions about whether working on RLHF was net positive or negative[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.
- ^
I would welcome any relevant quotes/citations.
- ^
Paul Christiano’s counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it’s worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for.
This is close to my own thinking, but doesn’t quite hit the nail on the head. I don’t actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline. Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here.
Interesting… why not? It seems perfectly reasonable to worry about both?
It’s one of those arguments which sets off alarm bells and red flags in my head. Which doesn’t necessarily mean that it’s wrong, but I sure am suspicious of it. Specifically, it fits the pattern of roughly “If we make straightforwardly object-level-good changes to X, then people will respond with bad thing Y, so we shouldn’t make straightforwardly object-level-good changes to X”.
It’s the sort of thing to which the standard reply is “good things are good”. A more sophisticated response might be something like “let’s go solve the actual problem part, rather than trying to have less good stuff”. (To be clear, I don’t necessarily endorse those replies, but that’s what the argument pattern-matches to in my head.)
But it seems very analogous to the argument that working on AI capabilities has negative EV. Do you see some important disanalogies between the two, or are you suspicious of that argument too?
That one doesn’t route through ”… then people respond with bad thing Y” quite so heavily. Capabilities research just directly involves building a dangerous thing, independent of whether other people make bad decisions in response.
What about more indirect or abstract capabilities work, like coming up with some theoretical advance that would be very useful for capabilities work, but not directly building a more capable AI (thus not “directly involves building a dangerous thing”)?
And even directly building a more capable AI still requires other people to respond with bad thing Y = “deploy it before safety problems are sufficiently solved” or “fail to secure it properly”, doesn’t it? It seems like “good things are good” is exactly the kind of argument that capabilities researchers/proponents give, i.e., that we all (eventually) want a safe and highly capable AGI/ASI, so the “good things are good” heuristic says we should work on capabilities as part of achieving that, without worrying about secondary or strategic considerations, or just trusting everyone else to do their part like ensuring safety.
Ironically enough one of the reasons why I hate “advancing AI capabilities is close to the worst thing you can do” as a meme so much is that it basically terrifies people out of thinking about AI alignment in novel concrete ways because “What if I advance capabilities?”. As though AI capabilities were some clearly separate thing from alignment techniques. It’s basically a holdover from the agent foundations era that has almost certainly caused more missed opportunities for progress on illegible ideas than it has slowed down actual AI capabilities.
Basically any researcher who thinks this way is almost always incompetent when it comes to deep learning, usually has ideas that are completely useless because they don’t understand what is and is not implementable or important, and torments themselves in the process of being useless. Nasty stuff.
The underlying assumption here (“the halt assumption”) seems to be that big-shot decisionmakers will want to halt AI development if it’s clear that unsolved alignment problems remain.
I’m a little skeptical of the halt assumption. Right now it seems that unsolved alignment problems remain, yet I don’t see big-shots moving to halt AI development. About a week after Grok’s MechaHitler incident, the Pentagon announced a $200 million contract with xAI.
Nonetheless, in a world where the halt assumption is true, the highest-impact action might be a meta approach of “making the notion of illegible problems more legible”. In a world where the halt assumption becomes true (e.g. because the threshold for concern changes), if the existence and importance of illegible problems has been made legible to decisionmakers, that by itself might be enough to stop further development.
So yeah, in service of increasing meta-legibility (of this issue), maybe we could get some actual concrete examples of illegible problems and reasons to think they are important? Because I’m not seeing any concrete examples in your post, or in the comments of this thread. I think I am more persuadable than a typical big-shot decisionmaker, yet my cynical side reads this post and thinks: “Navel-gazers think it’s essential to navel-gaze. News at 11.”
Another angle: Are there concrete examples of AI alignment problems which were once illegible and navel-gazey, which are now legible and obviously important? (Hopefully you won’t say the need for AI alignment itself; I’ve been on board with that for as long as I can remember.)
There is another possible world here, which is that legibility actually correlates pretty well with real-world importance, and the halt assumption is false, and your post is going to redirect scarce AI alignment talent away from urgent problems which matter, and towards fruitless navel-gazing. I’m not claiming this is the world we live in, but it would be good to gather evidence, and concrete examples could help.
I fully support people publishing lists of AI alignment problems which seem neglected. (Why can’t I find a list like that already?) But I suspect many list entries will have been neglected for good reason.
See Problems in AI Alignment that philosophers could potentially contribute to and this comment from a philosopher saying that he thinks they’re important, but “seems like there’s not much of an appetite among AI researchers for this kind of work” suggesting illegibility.
Thanks for the reply. I’m not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can’t put off this way?
https://www.lesswrong.com/posts/M9iHzo2oFRKvdtRrM/reminder-morality-is-unsolved?commentId=bSoqdYNRGhqDLxpvM
Again, thanks for the reply.
Building a corrigible AGI has a lot of advantages. But one disadvantage is the “morality is scary” problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you’re talking about, and successfully program them into the AGI, the philosophical “unwashed masses” you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones.
Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the “morality is scary” problem so we can address what appears to be corrigibility’s only major downside. I suspect the “morality is scary” problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.
I think this is a very important point. Seems to be a common unstated crux, and I agree that it is (probably) correct.
Thanks! Assuming it is actually important, correct, and previously unexplicated, it’s crazy that I can still find a useful concept/argument this simple and obvious (in retrospect) to write about, at this late date.
It would be helpful for the discussion (and for me) if you stated an example of a legible problem vs. an illegible problem. I expect people might disagree on the specifics, even if they seem to agree with the abstract argument.
Legible problem is pretty easy to give examples for. The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies.
Giving an example for an illegible problem is much trickier since by their nature they tend to be obscure, hard to understand, or fall into a cognitive blind spot. If I give an example of a problem that seems real to me, but illegible to most, then most people will fail to understand it or dismiss it as not a real problem, instead of recognizing it as an example of a real but illegible problem. This could potentially be quite distracting, so for this post I decided to just talk about illegible problems in a general, abstract way, and discuss general implications that don’t depend on the details of the problems.
But if you still want some explicit examples, see this thread.
Another implication is that directly attacking an AI safety problem can quickly flip from positive EV to negative EV, if someone succeeds in turning it from an illegible problem into a legible problem, and there are still other illegible problems remaining. Organizations and individuals caring about x-risks should ideally keep this in mind, and try to pivot direction if it happens, instead of following the natural institutional and personal momentum. (Trying to make illegible problems legible doesn’t have this issue, which is another advantage for that kind of work.)
Well, one way to make illegible problems more legible is to think about illegible problems and then go work at Anthropic to make them legible to employees there, too.
I see. My specific update from this post was to slightly reduce how much I care about protecting against high-risk AI related CBRN threats, which is a topic I spent some time thinking about last month.
I think it is generous to say that legible problems remaining open will necessarily gate model deployment, even in those organizations conscientious enough to spend weeks doing rigorous internal testing. Releases have been rushed ever since applications moved from physical CDs to servers, because of the belief that users can serve as early testers for bugs, and that critical issues can be patched by pushing a new update. This blog post by Steve Yegge from ~20 years ago comes to mind: https://sites.google.com/site/steveyegge2/its-not-software. I would include LLM assistants in the category of “servware”.
I would argue that we are likely dropping the ball on both legible and illegible problems, but I agree that making illegible problems more legible is likely to be high leverage. I believe that the Janus/cyborgism cluster has no shortage of illegible problems, and consider https://nostalgebraist.tumblr.com/post/785766737747574784/the-void to be a good example of work that attempts to grapple with illegible problems.
In this case you can apply a modified form of my argument, by replacing “legible safety problems” with “safety problems that are actually likely to gate deployment”, and then the conclusion would be that working on such safety problems are of low or negative EV for the x-risk concerned.
This frame seems useful, but might obscure some nuance:
The systems we should be most worried about are the AIs of tomorrow, not the AIs of today. Hence, some critical problems might not manifest at all in today’s AIs. You can still say it’s a sort of “illegible problem” of modern AI that it’s progressing towards a certain failure mode, but that might be confusing.
While it’s true that deployment is the relevant threshold for the financial goals of a company, making it crucial for the company’s decision-making and available resources for further R&D, the dangers are not necessarily tied to deployment. It’s possible for a world-ending event to originate during testing or even during training.
I think this post could use a post title that makes the more explicit, provocative takeaway (otherwise I’d have assumed “this is letting you know illegible problems exist” and I already knew the gist)
Any suggestions?
Not sure. Let me think about it step by step.
It seems like the claims here are:
Illegible and Legible problems both exist in AI safety research
Decisionmakers are less likely to understand illegible problems
Illegible problems are less likely to cause decisionmakers to slow/stop where appropriate
Legible problems are not the bottleneck (because they’re more likely to get solved by default by the time we reach danger zones)
Working on legible problems shortens timelines without much gain
[From JohnW if you wanna incorporate] If you work on legible problems by making illegible problems worse, you aren’t helping.
I guess you do have a lot of stuff you wanna say, so it’s not like the post naturally has a short handle.
“Working on legible problems shortens timelines without much gain” is IMO the most provocative handle, but, might not be worth it if you think of the other points as comparably important.
“Legible AI problems are not the bottleneck” is slightly more overall-encompassing
“I hope Joe Carlsmith works on illegible problems” is, uh, a very fun title but probably bad. :P
Yeah it’s hard to think of a clear improvement to the title. I think I’m mostly trying to point out that thinking about legible vs illegible safety problems leads to a number of interesting implications that people may not have realized. At this point the karma is probably high enough to help attract readers despite the boring title, so I’ll probably just leave it as is.
Makes sense, although want to flag one more argument that, the takeaways people tend to remember from posts are ones that are encapsulated in their titles. “Musings on X” style posts tend not to be remembered as much, and I think this is a fairly important post for people to remember.
“If illegible safety problems remain when we invent transformative AI, legible problems mostly just give an excuse to deploy it”
“Legible safety problems mostly just burn timeline in the presence of illegible problems”
Something like that
I do not know if you consider gradual disempowerment to be an illegible problem in AI safety (as I do), but it is certainly a problem independent of corrigibility/alignment.
As such, work on either illegible or legible problems tackling alignment/corrigibility can cause the same effect; is AI safety worth pursuing when it could lead to a world with fundamental power shifts in the disfavor of most humans?
I agree heartily, and I feel there’s been various expressions of the “paradox” of alignment research, it is a balancing act of enabling accelerationism & safety. However ultimately both pursuits enable the end goal of aligned AI.
Which could optimistically lead to a utopia of post-scarcity but could also lead to highly dystopian power dynamics. Ensuring the optimist’s hope is realized seems (to me) to be a highly illegible problem. Those in the AI safety research space largely ignore this, in favor of tackling more legible problems, including illegible alignment problems.
All of this is to say I feel the same thing you feel, but for all of AI safety research.