Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
if there are sufficiently many copies, it becomes impossible to corrupt them all at once.
So I don’t love this model because escaping corruption is ‘too easy’.
I really like the cellular automaton model. But I don’t think it makes escaping corruption easy! Even if most of the copies are non-corrupt, the question is how you can take a “vote” of the corrupt vs non-corrupt copies without making the voting mechanism itself be easily corrupted. That’s why I was talking about the non-corrupt copies needing to “overpower” the corrupt copies above.
A few responses:
As per my post on underdog bias, the question of which group is actually weaker and which group is stronger is often a pretty subjective call. I even discuss in the post the example of Israel, where you could see it as the “stronger” group (vs Palestine in particular) or the “weaker” group (vs all the Muslim countries surrounding it).
There are plenty of cases where leftists support the stronger group against the weaker group—most notably Soviet and Chinese repression of dissidents and minorities. E.g. it took Solzhenitsyn publishing Gulag Archipelago to finally get leftists (even fairly “mainstream” leftists) to stop lionizing the USSR.
Even insofar as leftists tend to support the weaker group, there are almost no cases where they do so as strongly as in Israel vs Palestine. So there’s still something important to be explained here even accepting your claims.
Wanted to revisit this because it seemed like one of the points where people most strongly disagreed with me. I’m trying to figure out a crux here. One might be something like: how widespread were celebrations of the 10⁄7 attacks amongst prominent leftists (and especially the student groups that later organized encampments)? I could imagine updating that there were only a handful of cases that disproportionately blew up, which would make me take back or at least caveat the “supports Hamas” thing.
If you on the other hand found that there were many cases where prominent leftists and encampment organizers actively celebrated the 10⁄7 attacks, would you (@dirk or @lc) then agree that “supports Hamas” is a reasonable summary?
That’s a fair point; however, I don’t think it undermines my overall claims very much. I think the lack of rule of law for black Americans was bad in a comparable way to how the lack of rule of law for various European colonies was bad. That is, while it was bad for the people who didn’t get rule of law, they were a separate enough category that this mostly didn’t “leak into” undermining the legal mechanisms that helped their societies become productive and functional in the first place.
This seems very relevant, thank you. Will check it out.
That is very cool, thank you! Will check it out.
A next step is to settle on a model of what you want to get done, and what capabilities the adversaries have.
Perhaps. The issue here is that I’m not so interested in any specific goal, but rather in facilitating emergent complexity. One analogy here is designing Conway’s game of life: I expect that it wasn’t a process of “pick the rules you want, then see what results from those” but also in part “pick what results you want, and then see what rules lead to that”.
Re the Byzantine generals problem, see my reply to niplav below:
I believe (please correct me if I’m wrong) that Byzantine fault tolerance mostly thinks about cases where the nodes give separate outputs—e.g. in the Byzantine generals problem, the “output” of each node is whether it attacks or retreats. But I’m interested in cases where the nodes need to end up producing a “synthesis” output—i.e. there’s a single output channel under joint control.
Yeah, “incomplete” is the wrong word here. I edited shortly after posting to say instead “there are many elements of good chess strategy for bounded agents that they can’t account for”.
In a post on Solomonoff Induction (and also in this wiki entry), Yudkowsky describes Shannon’s minimax algorithm for searching the entire chess game-tree as an example of conceptual progress. Previously, Edgar Allen Poe had argued that it was impossible in principle for a machine to play chess well. With Shannon’s algorithm, it became possible in principle, just computationally infeasible.
However, even “principled” algorithms like minimax search don’t take into account the possibility that your opponent knows things you don’t know (as I’ve previously discussed here). And so there are many elements of good chess strategy for bounded agents that they can’t account for:
When your opponent plays a move that seems surprisingly bad, update that they’re probably seeing some tactics that you’re missing.
When your opponent plays a move that seems surprisingly good, update that they’re probably more skilled than you thought.
Try to play tricky (“sharp”) lines when you’re behind, in the hope that your opponent makes a mistake.
Try to play solidly when you’re ahead, even if it narrows your lead.
Identify your opponent’s playing style, and try to exploit it.
Some of these considerations have been discussed by Yudkowsky under the heading of “Vingean uncertainty” (though as per my post linked above I’ll talk about the more general concept of Knightian uncertainty). And I’m sure that they’ve been acknowledged elsewhere as well—though I suspect that their importance is underrated because we don’t have good technical language for describing them. To build relevant intuitions, try playing a game against LeelaPieceOdds, a chess engine designed to beat humans even when it starts down a piece (or several). Playing it is very frustrating because you constantly have the sense that you should be winning, but you don’t know how to actually make use of your advantage. I’m interested in figuring out a theory of how to do so, as a step towards building a more general theory of how to behave under Knightian uncertainty.
Here’s one approach. For each position, estimate the value of that position conditional on reaching it in your game (which I’ll call the Knightian value of the position). That is, suppose you’re playing white, and you’re evaluating a possible move W1 and a possible response B1. Then your evaluation of the position after B1 should not only incorporate your “naive” estimate of how good the position is, but also account for the fact that your opponent has deliberately chosen to play B1 knowing that it’d lead to that state.
According to Claude, this is analogous to a simple version of the Glosten-Milgrom model of bid-ask spreads on stock markets: Ask = E[Value | Buy order arrives], and Bid = E[Value | Sell order arrives]. How can you calculate those values? I’m not sure how financiers actually do it, but my immediate intuition is that you’d need a model of the opportunity cost of not investing in other things, and a model of how well-informed and “rational” other traders are, and then you’d need to somehow combine them to get an overall update.
The chess case is simpler in the sense that there’s only one adversary. But unlike in finance, there’s no “market rate” of opportunity cost. Instead, the Knightian value of every move will depend sensitively on how good the alternative moves available were. E.g. if I play a seemingly-pointless queen sacrifice when I have many other good moves available, then you should strongly suspect that it’s a trap. Whereas if I do the same in a position that is otherwise lost, you shouldn’t update nearly as much.
Unfortunately, this makes things much more complicated. In order to estimate the “naive” value of a move, we need to look at all the moves downstream of it. But estimating the Knightian value also requires evaluating the naive value of all the moves upstream of it. Basically, every Knightian estimate depends on every other part of the game-tree. Now, we might still be able to find an algorithm which does this given enough compute—just like Shannon took chess from unsolved in principle to solved given enormously implausible amounts of compute. But this seems like it would defeat the point of updating on our opponent’s intentions at all—because knowing the entire game-tree should “screen off” the relevance of their intentions. (This feels kinda analogous to how fully understanding physics, biology, sociology, etc should screen off anthropic reasoning about how many alien civilizations exist.)
—
Here’s a more promising approach. There are some positions where my estimates of their value won’t change much conditional on the game reaching them. If I can see a clear mate in 1, then it’s far more likely that my opponent made a bunch of mistakes to get to this point, than that they have some clever plan I’m not seeing (though sometimes I think I have a mate in 1 and don’t actually).
Intuitively speaking, the most sensible strategy seems to be to “bootstrap” from parts of the game-tree I can evaluate robustly, to then come up with more robust estimates of other parts of the game-tree. In order to figure out how to do that, we’ll need a more precise definition of robustness, and also some way of tying it to my estimate of my opponent’s skill level (though perhaps we could even define skill in terms of ability to win from positions that I’d thought were robustly losing? Worth thinking about.)
I don’t currently know how to do this, but I want to gesture at three ideas that might be relevant. Firstly, there’s work on imprecise probabilities—for example replacing point estimates (like “55% chance of winning from this position”) with “credal sets” (like “45-65% chance of winning from this position”). You can then replace the credal set in turn with a “plausibility function” that describes how plausible each point estimate is (with estimates outside the credal set having 0 plausibility). I suspect that there’s some relationship between plausibility and robustness in the sense I defined above—though I’m not very familiar with the literature, and so I’m mainly going off some conversations I had with davidad.
Secondly, Peter Schmidt-Nelson is trying to use symmetry considerations to demonstrate that there’s no winning strategy for black in chess. I don’t think the project itself is directly relevant to the ideas I’ve been describing here, but it feels related in a philosophical sense.
Thirdly, although I’ve been talking about the “value” of a position as if it’s a well-defined concept, it mostly isn’t. Stockfish’s value calculations are grounded in the likelihood of it winning from that position when playing itself. But there’s no clear way to translate from that to the likelihood of winning against one’s actual opponent, which is what we’re interested in. I won’t discuss this further here, but trying to pin down how to estimate a position’s value in that sense seems potentially fruitful.
—
To finish, I want to highlight an interesting analogy. Personally speaking, I’m much more interested in playing Go than in playing chess. And because Go has a many many more possible board positions than chess, the considerations above are much more salient. But more than that, they seem to play out on two different levels: both within the game-tree, and on the board itself.
Specifically, when you’re playing Go, the board typically ends up divided into territories which are loosely or securely controlled by one player or the other player. When a territory is loosely controlled, it’s often possible to “invade” it and prevent the other player from scoring points within it. The more securely a territory is controlled, however, the less likely an invasion is to succeed (and the more the invading player will weaken their other nearby territories even if they do succeed).
When a territory is small, you can try to decide whether it can be invaded by evaluating many possible sequences of moves. However, for larger territories this is prohibitively difficult. Instead, you need to use your intuition to evaluate how secure the territory is overall, perhaps by evaluating a couple of the scariest possible invasions. It’s also possible to make “probing moves” which your opponents should respond to differently depending on how secure they think their territory is.
I hypothesize that a sufficiently good theory of how to search through the chess game-tree will also tell us about, not just how to search through the Go game-tree, but also what strategies to use on the Go board itself. It’ll need to incorporate concepts like “this region of the board/game-tree is robustly my territory” and “this region is high in expected value but not very robust” and “here’s the boundary between one region and another”. The interactions between different regions of the board are more complex and nuanced than between different regions of the game-tree, but some of the same principles will likely still apply. E.g. since players have limited time, building trust in one region of the game-tree trades off against building trust in another (just as defending one region of the Go board trades off against defending another). To exploit that, you can make “probing moves” which force them to decide if they actually trust their estimate of a sharp line of play, or if they’ll retreat to a line which seems worse but more solid for them. And so on.
This all feels like a set of hints towards a potential way of quantifying Knightian uncertainty in terms of boundaries and how permeable they are to adversaries—which might then be extended to much larger-scale “games” like internal conflicts within minds or even human geopolitical conflicts. One piece of evidence that this agenda is paying off would be if we could create a “good old-fashioned AI” Go engine which can play at a superhuman level without using any neural networks—though I expect that it’s worth trying to build a significantly better theoretical understanding before embarking on that project. (Edit: maybe worth starting off with something like Connect 4 for the sake of simplicity.)
Ty, fixed.
Here’s a list of my donations so far this year (put together as part of thinking through whether I and others should participate in an OpenAI equity donation round).
They are roughly in chronological order (though it’s possible I missed one or two). I include some thoughts on what I’ve learned and what I’m now doing differently at the bottom.
$100k to Lightcone
This grant was largely motivated by my respect for Oliver Habryka’s quality of thinking and personal judgment.
This ended up being matched by the Survival and Flourishing Fund (though I didn’t know it would be when I made it). Note that they’ll continue matching donations to Lightcone until the end of March 2026.
$50k to the Alignment of Complex Systems (ACS) research group
This grant was largely motivated by my respect for Jan Kulveit’s philosophical and technical thinking.
$20k to Alexander Gietelink Oldenziel for support with running agent foundations conferences.
~$25k to Inference Magazine to host a public debate on the plausibility of the intelligence explosion in London.
$100k to Apart Research, who run hackathons where people can engage with AI safety research in a hands-on way (technically made with my regranting funds from Manifund, though I treated it like a 100k boost to my own donation budget)
$50k to Janus
Janus could reasonably be described as the Jane Goodall of AI. They and their collaborators are doing the kind of creative thinking and experimentation that has a genuine chance of leading to new paradigms for understanding AI. See for instance this discussion of AI identities.
$15k to Palladium
They are doing good thinking about governance and politics on a surprisingly tight budget.
$100k to Sahil to support work on live theory at groundless.ai
I’ve found my conversations with Sahil extremely generative. He’s one of the researchers I’ve talked to with the most ambitious and philosophically coherent “overall vision” for the future of AI. I still feel confused about how likely his current plans are to actualize that vision (and there are also some points where it’s in tension with my own overall vision) but it definitely seems worth betting on.
Total so far: ~$460k (of which $360k was my own money, and $100k Manifund’s money).
Note that my personal donations this year are >10x greater than any previous year; this is because I cashed out some of my OpenAI equity for the first time. So this is the first year that I’ve invested serious time and energy into donating. What have I learned?
My biggest shift is from thinking of myself as donating “on behalf of the AI safety community” to specifically donating to things that I personally am unusually excited about. I have only a very small proportion of the AI safety community’s money; also, I have fairly idiosyncratic views that I’ve put a lot of time into developing. So I now want to donate in a way which “bets on” my research taste, since that’s the best way to potentially get outsized returns. More concretely:
I’d classify the grants to Apart Research and the Inference Magazine debate as things that I “thought the community as a whole should fund”. If I were making those decisions today, I’d fund Apart Research significantly less (maybe $50k?) and not fund the debate (also because I’ve updated away from public outreach as a valuable strategy).
I consider my donations to ACS, Janus and Sahil as leveraging my research taste: these are some of the people who I have the most productive research discussions with. I’m excited about others donating to them too.
My grants to Lighthaven and Alexander Gietelink Oldenziel are somewhere in between those two categories. I’m still excited about them, though I’m now a bit more skeptical about conferences/workshops in general as a thing I want to support (there are so many conferences, are people actually getting value out of them or mainly using them as a way to feel high-status?) However this is less of a concern for agent foundations conferences, and also the sort of thing that I trust Oliver to track and account for.
My political views are unusual enough that I haven’t yet figured out a great way to fund them. Palladium is in the right broad direction but not focused enough on my particular interests for me to want to fund at scale (and again is more of a “someone should fund it” type thing). Regardless, I’m uninterested in almost all of the AI governance interventions others in the community are funding.
Even more recently, I’ve decided that I can bet on my research taste most effectively by simply hiring research assistants to work for me. I’m uncertain how much this will cost me, but if it goes well it’ll be most of my “donation” budget for the next year. I could potentially get funding for this, but at least to start off with, it feels valuable to not be beholden to any external funders.
More generally, I’d be excited if more people who are wealthy from working at AI labs used that money to make more leveraged bets on their own research (e.g. by working independently and hiring collaborators). This seems like a good way to produce the kinds of innovative research that are hard to incentivize under other institutional setups. I’m currently writing a post elaborating on this intuition.
I guess your thought is around someone corrupting a specific part of all nodes?
No, I’m happy to stick with the standard assumption of limited amounts of corruption.
However, I believe (please correct me if I’m wrong) that Byzantine fault tolerance mostly thinks about cases where the nodes give separate outputs—e.g. in the Byzantine generals problem, the “output” of each node is whether it attacks or retreats. But I’m interested in cases where the nodes need to end up producing a “synthesis” output—i.e. there’s a single output channel under joint control.
Error-correcting codes work by running some algorithm to decode potentially-corrupted data. But what if the algorithm might also have been corrupted? One approach to dealing with this is triple modular redundancy, in which three copies of the algorithm each do the computation and take the majority vote on what the output should be. But this still creates a single point of failure—the part where the majority voting is implemented. Maybe this is fine if the corruption is random, because the voting algorithm can constitute a very small proportion of the total code. But I’m most interested in the case where the corruption happens adversarially—where the adversary would home in on the voting algorithm as the key thing to corrupt.
After a quick search, I can’t find much work on this specific question. But I want to speculate on what such an “error-correcting algorithm” might look like. The idea of running many copies of it in parallel seems solid, so that it’s hard to corrupt a majority at once. But there can’t be a single voting algorithm (or any other kind of “overseer”) between those copies and the output channel, because that overseer might itself be corrupted. Instead, you need the majority of the copies to be able to “overpower” the few corrupted copies to control the output channel via some process that isn’t mediated by a small easily-corruptible section of code.
The viability of some copies “overpowering” other copies will depend heavily on the substrate on which they’re running. For example, if all the copies are running on different segments of a Universal Turing Machine tape, then a corrupted copy could potentially just loop forever and prevent the others from answering. So in order to make error-correcting algorithms viable we may need a specific type of Universal Turing Machine which somehow enforces parallelism. Then you need some process by which copies that agree on their outputs can “merge” together to form a more powerful entity; and by which entities that disagree can “fight it out”. At the end there should be some way for the most powerful entity to control the output channel (which isn’t accessible while conflict is still ongoing).
The punchline is that we seem to have built up a kind of model of “agency” (and, indeed, almost a kind of politics) from these very basic assumptions. Perhaps there are other ways to create such error-correcting algorithms. If so, I’d be very interested in hearing about them. But I increasingly suspect that agency is a fundamental concept which will emerge in all sorts of surprising places, if only we know how to look for it.
The people I instinctively checked after reading this:
Pichai: 5′11
Gates: 5′10
Ballmer: 6′5
I got conflicting estimates for Jobs and Nadella
A few quick comments, on the same theme as but mostly unrelated to the exchange so far:
I’m not very sold on “cares about xrisk” as a key metric for technical researchers. I am more interested in people who want to very deeply understand how intelligence works (whether abstractly or in neural networks in particular). I think the former is sometimes a good proxy for the latter but it’s important not to conflate them. See this post for more.
Having said that, I don’t get much of a sense that many MATS scholars want to deeply understand how intelligence works. When I walked around the poster showcase at the most recent iteration of MATS, a large majority of the projects seemed like they’d prioritized pretty “shallow” investigations. Obviously it’s hard to complete deep scientific work in three months but at least on a quick skim I didn’t see many projects that seemed like they were even heading in that direction. (I’d cite Tom Ringstrom as one example of a MATS scholar who was trying to do deep and rigorous work, though I also think that his core assumptions are wrong.)
As one characterization of an alternative approach: my intership with Owain Evans back in 2017 consisted of me basically sitting around and thinking about AI safety for three months. I had some blog posts as output but nothing particularly legible. I think this helped nudge me towards thinking more deeply about AI safety subsequently (though it’s hard to assign specific credit).
There’s an incentive alignment problem where even if mentors want scholars to spend their time thinking carefully, the scholars’ careers will benefit most from legible projects. In my most recent MATS cohort I’ve selected for people who seem like they would be happy to just sit around and think for the whole time period without feeling much internal pressure to produce legible outputs. We’ll see how that goes.
At some point I recall thinking to myself “huh, LessWrong is really having a surge of good content lately”. Then I introspected and realized that about 80% of that feeling was just that you’ve been posting a lot.
“Please don’t roll your own crypto” is a good message to send to software engineers looking to build robust products. But it’s a bad message to send to the community of crypto researchers, because insofar as they believe you, then you won’t get new crypto algorithms from them.
In the context of metaethics, LW seems much more analogous to the “community of crypto researchers” than the “software engineers looking to build robust products”. Therefore this seems like a bad message to send to LessWrong, even if it’s a good message to send to e.g. CEOs who justify immoral behavior with metaethical nihilism.
FWIW, in case this is helpful, my impression is that:
It is accurate to describe Wei as doing a “charge of the hobby-horse” in his initial comment, and this should be considered a mild norm violation. I’m also surprised and a bit disappointed that it got so many upvotes.
By the time that Tsvi announced the ban, Wei had already acknowledged that his original comments had been partly based on a misunderstanding. In my culture, I would expect more of an apology for doing so than the “ok...but to be fair” follow-up Wei actually gave. However, the phrase “Also, another part of my motivation is still valid and I think it would be interesting to try to answer” is a clear enough acknowledgement of a distinct line of inquiry that I no longer consider that comment to be a continuation of the “charge of the hobby-horse”.
Tsvi banning Wei for “grossly negligent reading comprehension” after Wei had acknowledged that he was mistaken seems like a mild norm violation. It wouldn’t have been a norm violation if Wei’s comment hadn’t made that acknowledgement; however, it would have been a stronger norm violation if Wei’s comment had included an actual apology.
Hmm, I don’t have anything substantive out on this specifically; the closest is probably this talk (though note that some of my arguments in it were a bit sloppy, e.g. as per the top comment).