Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
First, some remarks about the meta-level:
The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.
Actually, I don’t feel like I learned that much reading this list, compared to what I already knew. [EDIT: To be clear, this knowledge owes a lot to prior inputs from Yudkowsky and the surrounding intellectual circle, I am making no claim that I would derive it all independently in a world in which Yudkowsky and MIRI didn’t exit.] To be sure, it didn’t feel like a waste of time, and I liked some particular framings (e.g. in A.4 separating the difficulty into “unlimited time but 1 try” and “limited time with retries”), but I think I could write something that would be similar (in terms of content; it would be very likely much worse in terms of writing quality).
One reason I didn’t write such a list is, I don’t have the ability to write things comprehensibly. Empirically, everything of substance that I write is notoriously difficult for readers to understand. Another reason is, at some point I decided to write top-level posts only when I have substantial novel mathematical results, with rare exceptions. This is in part because I feel like the field has too much hand-waving and philosophizing and too little hard math (which rhymes with C.38). In part it is because, even if people can’t understand the informal component of my reasoning, they can at least understand there is math here and, given sufficient background, follow the definitions/theorems/proofs (although tbh few people follow).
There’s no plan
Actually, I do have a plan. It doesn’t have an amazing probability of success (my biggest concerns are (i) not enough remaining time and (ii) even if the theory is ready in time, the implementation can be bungled, in particular for reasons of operational adequacy), but it is also not practically useless. The last time I tried to communicate it was 4 years ago, since which time it obviously evolved. Maybe it’s about time to make another attempt, although I’m wary of spending a lot of effort on something which few people will understand.
Now, some technical remarks:
Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.
This is true, but it is notable that deep learning is not equivalent to evolution, and the differences are important. Consider for example a system that is designed to separately (i) learn a generative model of the environment and (ii) search for plans effective on this model (model-based RL). Then, module ii doesn’t inherently have the problem where the solution only optimizes the correct thing in the training environment. Because, this module is not bounded by available training data, but only by compute. The question is then, to 1st approximation, whether module i is able to correctly generalize from the training data (obviously there are theoretical bounds on how good such this generalization can be; but we want this generalization to be at least as good as human ability and without dangerous biases). I do not think current systems do such generalization correctly, although they do seem to have some ingredients right, in particular Occam’s razor / simplicity bias. But we can imagine some algorithm that does.
...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
Also true, but there is nuance. The key problem is that we don’t know why deep learning works, or more specifically w.r.t. which prior does it satisfy good generalization bounds. If we knew what this prior is, then we could predict some inner properties. For example, if you know your algorithm follows Occam’s razor, for a reasonable formalization of “Occam’s razor”, and you trained it on the sun setting every day for a million days, then you can predict that the algorithm will not confidently predict the sun is going to fail to to set on any given future day. Moreover, our not knowing such generalization bounds for deep learning is a fact about our present state of mathematical ignorance, not a fact about the algorithms themselves.
...there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.
It is true that (AFAIK) nothing like this was accomplished in practice, but the distance to that might not be too great. For example, I can imagine training an ANN to implement a POMDP which simultaneously successfully predicts the environment and complies with some “ontological hypothesis” about how the environment needs to be structured in order for the-things-we-want-to-point-at to be well-defined (technically, this POMDP needs to be a refinement of some infra-POMPD that represents the ontological hypothesis).
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
There is a big chunk of what you’re trying to teach which not weird and complicated, namely: “find this other agent, and what their values are”. Because, “agents” and “values” are natural concepts, for reasons strongly related to “there’s a relatively simple core structure that explains why complicated cognitive machines work”. Admittedly, my rough proposal (PreDCA) does have some “weird and complicated” parts because of the acausal attack problem.
Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don’t know so that it can make plans we wouldn’t be able to make ourselves. It knows, at the least, the fact we didn’t previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence.
This is inaccurate, because . It is possible to imagine an AI that provides us with a plan for which we simultaneously (i) can understand why it works and (ii) wouldn’t think of it ourselves without thinking for a very long time that we don’t have. At the very least, the AI could suggest a way of building a more powerful aligned AI. Of course, in itself this doesn’t save us at all: instead of producing such a helpful plan, the AI can produce a deceitful plan instead. Or a plan that literally makes everyone who reads it go insane in very specific ways. Or the AI could just hack the hardware/software system inside which it’s embedded to produce a result which counts for it as a high reward but which for us wouldn’t look anything like “producing a plan the overseer rates high”. But, this direction might be not completely unsalvageable[1].
Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
I agree that the process of inferring human thought from the surface artifacts of human thought require powerful non-human thought which is dangerous in itself. But this doesn’t necessarily mean that the idea of imitating human though doesn’t help at all. We can combine it with techniques such as counterfactual oracles and confidence thresholds to try to make sure the resulting agent is truly only optimizing for accurate imitation (which still leaves problems like attacks from counterfactuals and non-Cartesian daemons, and also not knowing which features of the data are important to imitate might be a big capability handicap).
That said, I feel that PreDCA is more promising than AQD: it seems to require less fragile assumptions and deals more convincingly with non-Cartesian daemons. [EDIT: AQD also can’t defend from acausal attack if the malign hypothesis has massive advantage in prior probability mass, and it’s quite likely to have that. It does not work to solve this by combining AQD with IBP, at least not naively.]
Full disclosure: I am a MIRI Research Associate. This means that I receive funding from MIRI, but I am not a MIRI employee and I am not privy to its internal operation or secrets.
First of all, I am really sorry you had these horrible experiences.
A few thoughts:
Thought 1: I am not convinced the analogy between Leverage and MIRI/CFAR holds up to scrutiny. I think that Geoff Anders is most likely a bad actor, whereas MIRI/CFAR leadership is probably acting in good faith. There seems to be significantly more evidence of bad faith in Zoe’s account than in Jessica’s account, and the conclusion is reinforced by adding evidence from other accounts. In addition, MIRI definitely produced some valuable public research whereas I doubt the same can be said of Leverage, although I haven’t been following Leverage so I am not confident about the latter (ofc it’s in principle possible for a deeply unhealthy organization to produce some good outputs, and good outputs certainly don’t excuse abuse of personnel, but I do think good outputs provide some evidence against such abuse).
It is important not to commit the fallacy of gray: it would risk both judging MIRI/CFAR too harshly and judging Leverage insufficiently harshly. The comparison Jessica makes to “normal corporations” reinforces this impression: I have much experience in the industry, and although it’s possible I’ve been lucky in some ways, I still very much doubt the typical company is nearly as bad as Leverage.
Thought 2: From my experience, AI alignment is a domain of research that intrinsically comes with mental health hazards. First, the possibility of impending doom and the heavy sense of responsibility are sources of stress. Second, research inquiries often enough lead to “weird” metaphysical questions that risk overturning the (justified or unjustified) assumptions we implicitly hold to maintain a sense of safety in life. I think it might be the closest thing in real life to the Lovecraftian notion of “things that are best not to know because they will drive you mad”. Third, the sort of people drawn to the area and/or having the necessary talents seem to often also come with mental health issues (I am including myself in this group).
This might be regarded as an argument to blame MIRI less for the mental health fallout described by Jessica, but this is also an argument to pay more attention to the problem. It would be best if we could provide the people working in the area with the tools and environment to deal with these risks.
Thought 3: The part that concerned me the most in Jessica’s account (in part due to its novelty to me) is MIRI’s internal secrecy policy. While it might be justifiable to have some secrets to which only some employees are privy, it seems very extreme to require going through an executive because even the mere fact that a secret project exists is too dangerous. MIRI’s secrecy policy seemed questionable to me even before, but this new spin makes it even more dubious.
Overall, I wish MIRI was more transparent, so that for example its supporters would know about this internal policy. I realize there are tradeoffs involved, but I am not convinced MIRI chose the right balance. To me it feels like overconfidence about MIRI’s ability to steer the right way without the help of external critique.
Moreover, I’m a little worried that MIRI’s lack of transparency might pose a risk for the entire AI safety project. Tbh, one of my first thoughts when I saw the headline of the OP was “oh no, what if some scandal around MIRI blows up and the shockwave buries the entire community”. And I guess some people might think this is a reason for more secrecy. IMO it’s a reason for less secrecy (not necessarily less secrecy about technical AI stuff, but less secrecy about management and high-level plans). If we don’t have any skeletons in the closest, we don’t need to worry about the day they will come out. And eventually everything comes out, more or less. When most of everything is in the open, the community can find the right balance around it, and the reputation system is much more robust.
Thought 4: “Someone in the community told me that for me to think AGI probably won’t be developed soon, I must think I’m better at meta-rationality than Eliezer Yudkowsky, a massive claim of my own specialness.” I think (hope?) this is not at all a prevalent stance in the community (or at least in its leading echelons), but just for the record I want to note my strong position that the “someone” in this story is very misguided. Like I said, I don’t think community is currently comparable to Leverage, but this is the sort of thing that can push us in that direction.
I worked for 16 years in the industry, including management positions, including (briefly) having my own startup. I talked to many, many people who worked in many companies, including people who had their own startups and some with successful exits.
The industry is certainly not a rose garden. I encountered people who were selfish, unscrupulous, megalomaniac or just foolish. I’ve seen lies, manipulation, intrigue and plain incompetence. But, I also encountered people who were honest, idealistic, hardworking and talented. I’ve seen teams trying their best to build something actually useful for some corner of the world. And, it’s pretty hard to avoid reality checks when you need to deliver a real product for real customers (although some companies do manage to just get more and more investments without delivering anything until the eventual crash).
I honestly think most of them are not nearly as bad as Leverage.
This is a classical example where having a prediction market creates really bad incentives.
I met Vassar once. He came across as extremely charismatic (with a sort of charisma that probably only works on a particular type of people, which includes me), creating the impression of saying wise and insightful things (especially if you lack relevant domain knowledge), while in truth he was saying a lot of stuff which was patently absurd. Something about his delivery was so captivating, that it took me a while to “shake off the fairy dust” and realize just how silly some of his claims were, even when it should have been obvious from the start. Moreover, his worldview seemed heavily based on paranoidal / conspiracy-theory type of thinking. So, yes, I’m not too surprised by Scott’s revelations about him.
Good post, although I have some misgivings about how unpleasant it must be to read for some people.
One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI’s extreme pessimism deflates motivation and naturally produces the thought “if they are right then we’re doomed anyway, so might as well assume they are wrong”.
Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I’m also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization[2].
MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it’s also a reason to maintain some uncertainty about everything (including e.g. timelines).
Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book “non-streelighting” means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.
What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.
From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don’t think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won’t contribute much on the more conceptual/philosophic level.
Which is, creating mathematical theory and tools for understanding agents.
I mostly didn’t feel comfortable talking about it in the past, because I was on MIRI’s payroll. This is not MIRI’s fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to criticize the people who write your paycheck.