Background

I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I’ve learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.

Alignment Won’t Happen By Accident

I used to use the phrase when teaching security mindset to software developers that “security doesn’t happen by accident.” A system that isn’t explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn’t designed to be robust against a certain failure mode is going to exhibit that failure mode.

This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won’t necessarily be dangerous. An AGI that isn’t intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.

As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions (“what if someone orders 2^64+1 items?”), sometimes it means looking at forms of invalid input (“what if someone tries to pay -$100, can they get a refund?”), and sometimes it means being aware of particular forms of attack (“what if someone puts Javascript in their order details?”).

I found it difficult to cultivate security mindset in most software engineers, but as long as we could develop one or two security “champions” in any given team, our chances of success improved greatly. To succeed at alignment, we will not only have to get very good at exploring classes of failures, we will need champions who can dream up entirely new classes of failures to investigate, and to cultivate this mindset within as many machine learning research teams as possible.

Blacklists Are Useless, But Make Them Anyway

I did a series of annual penetration tests for a particular organization. Every year I had to report to them the same form/parameter XSS vulnerability because they kept playing whac-a-mole with my attack payloads. Instead of actually solving the problem (applying the correct context-sensitive output encoding), they were creating filter regexes with ever-increasing complexity to try address the latest attack signature that I had reported. This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals. This approach can generally only look backwards at the past. An intelligent adversary just finds the next attack that’s not on your blacklist.

That said, it took the software industry a long time to learn all the ways to NOT solve XSS before people really understood what a correct fix looked like. It often takes many many examples in the reference class before a clear fundamental solution can be seen. Alignment research will likely not have the benefit of seeing multiple real-world examples within any class of failure modes, and so AGI research will require special diligence in not just stopping at the first three examples that can be imagined and calling it a day.

While it may be dangerous to create an ever-expanding list of the ways that an optimization process might kill us all, on balance it is probably necessary to continue documenting examples well beyond what seems like the point of zero marginal returns; one never knows which incremental XSS payload will finally grant the insight to stop thinking about input validation and start focusing on output encoding. The work of creating the blacklist may lead to the actual breakthrough. Sometimes you have to wade through a swamp of known bad behavior to identify and single out the positive good behavior you want to enshrine. This will be especially difficult because we have to imagine theoretical swamps; the first real one we encounter has a high chance of killing us.

You Get What You Pay For

The single most reliable predictor of software security defect rates in an organization I found to be the level of leadership support (read: incentives) for security initiatives. AGIs are going to be made by organizations of humans. Whether or not the winning team’s mission explicitly calls for strong alignment will probably be the strongest determinant for the outcome for humanity.

This is not as simple as the CEO declaring support for alignment at monthly all-hands meetings, although this certainly helps grease the skids for those who are trying to push a safety agenda. Every incentive structure must explicitly place alignment at the apex of prioritization and rewards. One company I worked for threatened termination for any developer who closed a security defect without fixing it. Leadership bonuses for low defect rates and short bug remediation timelines also helped. A responsible organization should be looking for any and all opportunities to eliminate any perverse incentives, whether they be social, financial, or otherwise, and reward people accordingly for finding incentive problems.

The bug bounty/zero-day market is likely a strong model to follow for AGI safety issues, especially to expose risk in private organizations that might not be otherwise forthcoming about their AGI research projects. A market could easily be created to reward whistleblowers or incentivize otherwise unwilling parties to disclose risky behavior. Bounties could be awarded for organizations to share their AGI alignment roadmaps, key learnings, or override the default incentive models that will not produce good outcomes. These might or might not be formed with nation-state-level backing and budgets, but as the next marginal AI safety charity dollar gets harder to employ, bounty programs might be a good way to surface metrics about the industry while positively influencing the direction of diverse research groups that are not being appropriately cautious.

Bug bounties also scale gracefully. Small bounties can be offered at first to pluck all of the low-hanging fruit. As attention and awareness grow, so can bounties as they become funded by a broader and broader spectrum of individuals or even state actors. Someone might not be willing to sell out their rogue AI operation for $10,000, but I could easily imagine a billion dollar bounty at some point for information that could save the world. XPRIZEs for incremental alignment results are also an obvious move. A large bounty provides some assurance about its underlying target: if no one claims a billion dollar bounty about rogue AI research, that is some evidence that it’s no longer happening.

Assurance Requires Formal Proofs, Which Are Provably Impossible

I’ve witnessed a few organizations experiment with Design-driven Development. Once we were even able to enshrine “Secure by Design” as a core principle. In reality, software development teams can sometimes achieve 95% code coverage with their test cases and can rarely correlate their barrage of run-time tests to their static analysis suites. This is not what it takes for formal assurance of software reliability. And yet even achieving this level of testing requires heroic efforts.

The Halting Problem puts a certain standard of formalism outside our reach, but it doesn’t absolve us of the responsibility of attaining the strongest forms of assurance we possibly can under the circumstances. There are forms of complexity we can learn to avoid entirely (complexity is the enemy of security). Complex systems can be compartmentalized to minimize trust boundaries and attack surface, and can be reasoned about independently.

The promise of Test-Driven Design is that by forcing yourself to write the tests first, you constrain the space of the design to only that which can actually be tested. Multiple software security industries arose that tried to solve the problem of automating testing of arbitrary applications that were already built. (Spoiler: the results were not great.) To my knowledge, no one tried writing a security test suite that was designed to force developers to conform their applications to the tests. If this was easy, there would have been a market for it.

After failing to “solve” software security problems after a decade, I spent many years thinking about how to eliminate classes of vulnerabilities for good. I could find scalable solutions for only roughly 80% of web application vulnerabilities, where scalable was some form of “make it impossible for a developer to introduce this class of vulnerability”. Often the result was something like: “buffer overflow vulnerabilities disappeared from web apps because we stopped allowing access to memory management APIs”. Limiting API capabilities is not an approach that’s likely to be compatible with most research agendas.

Alignment will require formalisms to address problems that haven’t even been thought of yet, let alone specified clearly enough that a test suite can be developed for them. A sane world would solve this problem first before creating machine intelligence code. A sane world would at least recognize that solving software security problems is probably several magnitudes of difficulty easier than solving alignment problems, and we haven’t even succeeded at the former yet.

A Breach IS an Existential Risk

I was lucky to work for a few organizations that actually treated the threat of a security breach like it was an existential risk. Privately, though, I had to reconcile this attitude with the reality: “TJX is up”. ^[1] While it was good for my career and agenda those rare times when leadership did take security seriously, the reality is that the current regulatory and social environment hardly punishes security failures proportionally to the damage they can cause. In a disturbing number of situations, a security breach is simply an externality, the costs of which actually borne by consumers or the victims of software exploits. Some have proposed the idea of software liability, but I’m even less optimistic that that model can be applied to AGI research.

Even if an organization believes that its security posture means the difference between existing and dying, profit margins will carry this same message much more clearly. Revenue solves all problems, but the inverse is also true: without revenue there isn’t the luxury of investing in security measures.

While I am ambivalent about the existential nature of information security risk, I am unequivocal about AGI alignment risk. There will not be a stock symbol to watch to see whether AGI research groups are getting this right. There will be no fire alarm. The externality of unaligned machine learning research will simply be that we will all be dead, or suffering some unrecoverable fate worse than death. I also cannot tell you exactly the chain of events that will lead to this outcome, I have only my intuition gained by 20 years of seeing how software mistakes are made and how difficult it is to secure even simple web applications. Probably no more than a few non-trivial pieces of software could survive a million-dollar bug bounty program for a year without having to make a payout. In the relatively quaint world of software security, with decades of accumulated experience and knowledge of dozens of bug reference classes, we still cannot produce secure software. There should be no default expectation of anything less than total annihilation from an unaligned superoptimization process.

^
The TJX corporation experienced one of the largest data breaches in history, accompanied by millions of dollars in fines; however, their stock price quickly recovered as the world forgot about the incident.