Obviously this can’t be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:
Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
Good policework and better coordination in law enforcement are commonly cited, e.g. in the context of 9/11 hijackings, before anyone even gets to an airport.
In general, if the airlines had responsibility for security you would see a very different set of controls than what you get today, where it is an externality run by an organization with very strong “don’t do anything you can get blamed for” political incentives. In an ideal world, you could get an airline catering to paranoiacs who wanted themselves and their fellow passengers to undergo extreme screening, one for people who have done the math, and then most airlines in the middle would phase into nominal gate screening procedures that didn’t make them look to their customers that they didn’t care (which largely the math says that they shouldn’t).
A thought experiment: why is there no equivalent bus/train station security to what we have at airports? And what are the outcomes there?
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.
My project seems to have expired from the OWASP site, but here is an interactive version that should have most of the data:
You’ll need to mouse over the elements to see the details, so not really mobile friendly, sorry.
I agree that linters are a weak form of automatic verification that are actually quite valuable. You can get a lot of mileage out of simply blacklisting unsafe APIs and a little out of clever pattern matching.
I would say that some formal proofs are actually impossible, but would agree that software with many (or even all) of the security properties we want could actually have formal-proof guarantees. I could even see a path to many of these proofs today.
While the intent of my post was to draw parallel lessons from software security, I actually think alignment is an oblique or orthogonal problem in many ways. I could imagine timelines in which alignment gets ‘solved’ before software security. In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don’t relate to alignment to actually switch their focus to alignment at this point.
Software security is important, but I don’t think it’s on the critical path to survival unless somehow it is a key defense against takeoff. Certainly many imagined takeoff scenarios are made easier if an AI can exploit available computing, but I think the ability to exploit physics would grant more than enough escape potential.
The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say “this is secure” or “this is insecure”.
It would be nice to able to have this important impossible thing. :)
I think we are trying to say the same thing, though. Do you agree with this more concise assertion?
“It’s not possible to make a high confidence checker system that can analyze an arbitrary specification, but it is probably possible (although very hard) to design systems that can be programmatically checked for the important qualities of alignment that we want, if such qualities can also be formally defined.”
I would agree that some people figured this out faster than others, but the analogy is also instructional here: if even a small community like the infosec world has a hard time percolating information about failure modes and how to address them, we should expect the average ML engineer to be doing very unsafe things for a very long time by default.
To dive deeper into the XSS example, I think even among those that understood the output encoding and canonicalization solutions early, it still took a while to formalize the definition of an encoding context concisely enough to be able to have confidence that all such edge cases could be covered.
It might be enough to simply recognize an area of alignment that has dragons and let the experts safely explore the nature and contours of these dragons, but you probably couldn’t build a useful web application that doesn’t display user-influencable input. I think trying to get the industry to halt on building even obvious dragon-infested things is part of what has gotten Eliezer so burned out and pessimistic.
I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.
Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.
I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.
It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional “pen-test” type bounties, doing runtime testing of existing live applications. But I’ve long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.
Many! Thanks for sharing. This could easily turn into its own post.
In general, I think this is a great idea. I’m somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I’m generally supportive of trying as many different approaches as possible to see what works best.
Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:
The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren’t noticed by others. (Perhaps this points to the idea of having multiple ‘judges’ in an alignment tournament.)
Non-experts often have insights or perspectives that are surprising to security professionals; I’ve been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I’ve run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren’t easily broken, it’s going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it’s probably a worthwhile one.
I definitely wouldn’t rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it’s a lot harder than eliminating software security bug classes. But I also wouldn’t rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.
Thank you for these additional references; I was trying to anchor this article with some very high-level concepts. I very much expect that to succeed we’re going to have to invent and test hundreds of formalisms to be able to achieve any kind of confidence about the alignment of a system.
I got part of the way through the process and then got stuck, but my situation may not be typical.
These bonds have to be purchased directly from the treasury, with an account at Treasury Direct.
Creation of a Treasury Direct account requires mailing in a form that has to be certified by a specific bank certifying agent in the US. A regular notary service is not accepted.
As far as I can tell, an equivalent certification service isn’t available outside the country.
I shudder to imagine the mutual funds created to fund bids on this thing.
How hard do you have squint to not see this thing as pyramid-shaped? This thing is like Sierpinski’s pyramid. It’s fractally a scam; a scam at every conceivable resolution.
Actually, the worst thing would be if the price of the minting increases at a rate slower than the value of half the pool grows. Then every next bid would still be “in the money”, and then whoever doesn’t go bankrupt first wins. This thing could eat the whole world. Terrible. Kill it with fire.
Well maybe I’m missing something, but the game theory doesn’t seem that interesting to me. And calling it a ‘return on investment’ seems a bit generous for what is really just a game of blockchain chicken. In fact, it might be as crazy as a dollar auction where people might end up bidding more than what half the accumulated contract is worth due to sunk cost fallacy or other irrational behaviors.
Either way, you’re not really buying anything of value here: you’re just betting that the auction gets so little attention that you can walk away with free money, or else you’re financing someone else’s eventual bad decisionmaking (possibly your own).
You say “which in theory should increase the value of every minted NFT so far”, but I don’t see how. What additional value is added to a previously purchased NFT by someone purchasing the next one? If anything, each incremental minting makes the previous one worthless (unless you ascribe some inherent value to owning an arbitrary NFT). In fact, every NFT in this game is a badge of shame except the one that that wins the pot, and even that one could be shameful if it cost more than the pot was worth.
It seems especially insidious that the game perpetuates itself by trying to get people to restart the bidding war again with the other half of the prize pool. You didn’t say whether the price of minting resets after half the pool gets claimed, but either way it’s terrible: either the price resets and make the next bidding war even more furious and ties up even more people/funds, or it doesn’t and makes it more likely that the other half just sits there unclaimed forever because it costs more to mint the bid than you’d see in return, but I don’t see how this ever ends well for anyone.
On the other hand, one could view this game as punishing greed at a meta level: at first it looks like you get a free 242x return, but at best you realize you’ve only thrown away $500; at worst you end up much deeper in the hole you tried to dig your way out of. Not sure I approve of the ethics of this punishment, though.
Anyway, it seems clear to me that the correct strategy is to not mint the next NFT. Get some popcorn and watch; the entertainment value to others is the only real value in RandomWalk.
What happens to the other half? This seems underspecified as you’ve described it.
In the interest of science, I ran 10 more simulations with our submitted population. This is not to open a can of worms or to challenge the results in any way—we all knew we had to win on the first try!
Some things that I observed:
The results were highly sensitive to randomness. Almost no species survived consistently.
Sometimes defenseless creatures survived and sometimes they didn’t.
LeavyTanky (ViktorThink) survived basically every time. Looks like there was no competition for the invincible leaf eater niche in the Rainforest (though plenty of leaf eaters abounded). I would say this is the strongest creature in the field of submissions based on my tests.
Usually an apex predator survived (10 attack, 10 speed). Often it was the most successful creature in terms of total energy across all biomes that it spread to. I was usually seeing antivenom in an apex predator not being worth it, but the Cheetah had it and did well in several runs.
Venomous creatures almost never survived.
As a class, armored tanks were the majority of survivors. Occasionally a speeder would survive, but much less commonly.
Usually, some mid-range tanks survived as well (~6 armor). This was often enough to stay ahead of predators while outcompeting invincible tanks.
On average only about 15 species survived past generation 1000. 30 species NEVER survived this long together. If you combine species occupying the same niche, this number was barely more than 10.
The tundra was always barren. The desert was always taken over by a single species.
I was surprised to see the Dump omnivores survive many times (Garbage Disposal, and 2-8-0 algae-...). Creatures with more than a few food sources generally didn’t do well, but the formula seemed decent in the Dump.
Sometimes the coconuts got eaten! Not often, though.
Often a 1 attack, 1 speed omnivore survived. Usually these took the place of defenseless creatures, but in one case they coexisted.
It might be fun to compete to design the creature that does the best against the 555-species field. I might also do some more experiments/analysis when I have some time—let me know if there’s anything you’re curious about.
Congrats to all the winners! Already looking forward to next year. Thanks lsusr for running this again this year!
Here’s our Brier scores for our predictions:
The defenseless creature result really surprised most of us. Well done, aphyer, you knew what was up.
Of all the things, the coconuts were by far the most difficult to get anything to survive on. In my simulations, usually the coconut eaters that survived were also eating something else.
In theory, coconuts should sustain a 13.1 E creature; In practice, with such a small food source this size creature gets outcompeted at first by much smaller organisms that then get hunted to extinction by predators.
Ah, I read the wrong line. So yeah, we submitted the exact same creature.
There were definitely reliably BAD creatures, and certainly some reliably good ones, but a lot of variance based on the overall makeup of the population. I certainly didn’t expect so many total creatures to be submitted; there was a lot more variability in results with 500-creature populations. In 5000-creature populations, basically the only thing that ever survived was invincibles.
With this size population, I don’t think it’s a coincidence that your minimal invincible survived—and certainly wasn’t just luck that you arrived at its design. Give yourself SOME credit. :)
I submitted the exact same 10 speed leaf eater that you did, I just started it in the Temperate Forest. Luck of the draw that yours got here first, I guess.