AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.
E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org
AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.
E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org
Full disclosure: I am a MIRI Research Associate. This means that I receive funding from MIRI, but I am not a MIRI employee and I am not privy to its internal operation or secrets.
First of all, I am really sorry you had these horrible experiences.
A few thoughts:
Thought 1: I am not convinced the analogy between Leverage and MIRI/CFAR holds up to scrutiny. I think that Geoff Anders is most likely a bad actor, whereas MIRI/CFAR leadership is probably acting in good faith. There seems to be significantly more evidence of bad faith in Zoe’s account than in Jessica’s account, and the conclusion is reinforced by adding evidence from other accounts. In addition, MIRI definitely produced some valuable public research whereas I doubt the same can be said of Leverage, although I haven’t been following Leverage so I am not confident about the latter (ofc it’s in principle possible for a deeply unhealthy organization to produce some good outputs, and good outputs certainly don’t excuse abuse of personnel, but I do think good outputs provide some evidence against such abuse).
It is important not to commit the fallacy of gray: it would risk both judging MIRI/CFAR too harshly and judging Leverage insufficiently harshly. The comparison Jessica makes to “normal corporations” reinforces this impression: I have much experience in the industry, and although it’s possible I’ve been lucky in some ways, I still very much doubt the typical company is nearly as bad as Leverage.
Thought 2: From my experience, AI alignment is a domain of research that intrinsically comes with mental health hazards. First, the possibility of impending doom and the heavy sense of responsibility are sources of stress. Second, research inquiries often enough lead to “weird” metaphysical questions that risk overturning the (justified or unjustified) assumptions we implicitly hold to maintain a sense of safety in life. I think it might be the closest thing in real life to the Lovecraftian notion of “things that are best not to know because they will drive you mad”. Third, the sort of people drawn to the area and/or having the necessary talents seem to often also come with mental health issues (I am including myself in this group).
This might be regarded as an argument to blame MIRI less for the mental health fallout described by Jessica, but this is also an argument to pay more attention to the problem. It would be best if we could provide the people working in the area with the tools and environment to deal with these risks.
Thought 3: The part that concerned me the most in Jessica’s account (in part due to its novelty to me) is MIRI’s internal secrecy policy. While it might be justifiable to have some secrets to which only some employees are privy, it seems very extreme to require going through an executive because even the mere fact that a secret project exists is too dangerous. MIRI’s secrecy policy seemed questionable to me even before, but this new spin makes it even more dubious.
Overall, I wish MIRI was more transparent, so that for example its supporters would know about this internal policy. I realize there are tradeoffs involved, but I am not convinced MIRI chose the right balance. To me it feels like overconfidence about MIRI’s ability to steer the right way without the help of external critique.
Moreover, I’m a little worried that MIRI’s lack of transparency might pose a risk for the entire AI safety project. Tbh, one of my first thoughts when I saw the headline of the OP was “oh no, what if some scandal around MIRI blows up and the shockwave buries the entire community”. And I guess some people might think this is a reason for more secrecy. IMO it’s a reason for less secrecy (not necessarily less secrecy about technical AI stuff, but less secrecy about management and high-level plans). If we don’t have any skeletons in the closest, we don’t need to worry about the day they will come out. And eventually everything comes out, more or less. When most of everything is in the open, the community can find the right balance around it, and the reputation system is much more robust.
Thought 4: “Someone in the community told me that for me to think AGI probably won’t be developed soon, I must think I’m better at meta-rationality than Eliezer Yudkowsky, a massive claim of my own specialness.” I think (hope?) this is not at all a prevalent stance in the community (or at least in its leading echelons), but just for the record I want to note my strong position that the “someone” in this story is very misguided. Like I said, I don’t think community is currently comparable to Leverage, but this is the sort of thing that can push us in that direction.
I worked for 16 years in the industry, including management positions, including (briefly) having my own startup. I talked to many, many people who worked in many companies, including people who had their own startups and some with successful exits.
The industry is certainly not a rose garden. I encountered people who were selfish, unscrupulous, megalomaniac or just foolish. I’ve seen lies, manipulation, intrigue and plain incompetence. But, I also encountered people who were honest, idealistic, hardworking and talented. I’ve seen teams trying their best to build something actually useful for some corner of the world. And, it’s pretty hard to avoid reality checks when you need to deliver a real product for real customers (although some companies do manage to just get more and more investments without delivering anything until the eventual crash).
I honestly think most of them are not nearly as bad as Leverage.
This is a classical example where having a prediction market creates really bad incentives.
I met Vassar once. He came across as extremely charismatic (with a sort of charisma that probably only works on a particular type of people, which includes me), creating the impression of saying wise and insightful things (especially if you lack relevant domain knowledge), while in truth he was saying a lot of stuff which was patently absurd. Something about his delivery was so captivating, that it took me a while to “shake off the fairy dust” and realize just how silly some of his claims were, even when it should have been obvious from the start. Moreover, his worldview seemed heavily based on paranoidal / conspiracy-theory type of thinking. So, yes, I’m not too surprised by Scott’s revelations about him.
First, some remarks about the meta-level:
Actually, I don’t feel like I learned that much reading this list, compared to what I already knew. [EDIT: To be clear, this knowledge owes a lot to prior inputs from Yudkowsky and the surrounding intellectual circle, I am making no claim that I would derive it all independently in a world in which Yudkowsky and MIRI didn’t exit.] To be sure, it didn’t feel like a waste of time, and I liked some particular framings (e.g. in A.4 separating the difficulty into “unlimited time but 1 try” and “limited time with retries”), but I think I could write something that would be similar (in terms of content; it would be very likely much worse in terms of writing quality).
One reason I didn’t write such a list is, I don’t have the ability to write things comprehensibly. Empirically, everything of substance that I write is notoriously difficult for readers to understand. Another reason is, at some point I decided to write top-level posts only when I have substantial novel mathematical results, with rare exceptions. This is in part because I feel like the field has too much hand-waving and philosophizing and too little hard math (which rhymes with C.38). In part it is because, even if people can’t understand the informal component of my reasoning, they can at least understand there is math here and, given sufficient background, follow the definitions/theorems/proofs (although tbh few people follow).
Actually, I do have a plan. It doesn’t have an amazing probability of success (my biggest concerns are (i) not enough remaining time and (ii) even if the theory is ready in time, the implementation can be bungled, in particular for reasons of operational adequacy), but it is also not practically useless. The last time I tried to communicate it was 4 years ago, since which time it obviously evolved. Maybe it’s about time to make another attempt, although I’m wary of spending a lot of effort on something which few people will understand.
Now, some technical remarks:
This is true, but it is notable that deep learning is not equivalent to evolution, and the differences are important. Consider for example a system that is designed to separately (i) learn a generative model of the environment and (ii) search for plans effective on this model (model-based RL). Then, module ii doesn’t inherently have the problem where the solution only optimizes the correct thing in the training environment. Because, this module is not bounded by available training data, but only by compute. The question is then, to 1st approximation, whether module i is able to correctly generalize from the training data (obviously there are theoretical bounds on how good such this generalization can be; but we want this generalization to be at least as good as human ability and without dangerous biases). I do not think current systems do such generalization correctly, although they do seem to have some ingredients right, in particular Occam’s razor / simplicity bias. But we can imagine some algorithm that does.
Also true, but there is nuance. The key problem is that we don’t know why deep learning works, or more specifically w.r.t. which prior does it satisfy good generalization bounds. If we knew what this prior is, then we could predict some inner properties. For example, if you know your algorithm follows Occam’s razor, for a reasonable formalization of “Occam’s razor”, and you trained it on the sun setting every day for a million days, then you can predict that the algorithm will not confidently predict the sun is going to fail to to set on any given future day. Moreover, our not knowing such generalization bounds for deep learning is a fact about our present state of mathematical ignorance, not a fact about the algorithms themselves.
It is true that (AFAIK) nothing like this was accomplished in practice, but the distance to that might not be too great. For example, I can imagine training an ANN to implement a POMDP which simultaneously successfully predicts the environment and complies with some “ontological hypothesis” about how the environment needs to be structured in order for the-things-we-want-to-point-at to be well-defined (technically, this POMDP needs to be a refinement of some infra-POMPD that represents the ontological hypothesis).
There is a big chunk of what you’re trying to teach which not weird and complicated, namely: “find this other agent, and what their values are”. Because, “agents” and “values” are natural concepts, for reasons strongly related to “there’s a relatively simple core structure that explains why complicated cognitive machines work”. Admittedly, my rough proposal (PreDCA) does have some “weird and complicated” parts because of the acausal attack problem.
This is inaccurate, because P≠NP. It is possible to imagine an AI that provides us with a plan for which we simultaneously (i) can understand why it works and (ii) wouldn’t think of it ourselves without thinking for a very long time that we don’t have. At the very least, the AI could suggest a way of building a more powerful aligned AI. Of course, in itself this doesn’t save us at all: instead of producing such a helpful plan, the AI can produce a deceitful plan instead. Or a plan that literally makes everyone who reads it go insane in very specific ways. Or the AI could just hack the hardware/software system inside which it’s embedded to produce a result which counts for it as a high reward but which for us wouldn’t look anything like “producing a plan the overseer rates high”. But, this direction might be not completely unsalvageable[1].
I agree that the process of inferring human thought from the surface artifacts of human thought require powerful non-human thought which is dangerous in itself. But this doesn’t necessarily mean that the idea of imitating human though doesn’t help at all. We can combine it with techniques such as counterfactual oracles and confidence thresholds to try to make sure the resulting agent is truly only optimizing for accurate imitation (which still leaves problems like attacks from counterfactuals and non-Cartesian daemons, and also not knowing which features of the data are important to imitate might be a big capability handicap).
That said, I feel that PreDCA is more promising than AQD: it seems to require less fragile assumptions and deals more convincingly with non-Cartesian daemons. [EDIT: AQD also can’t defend from acausal attack if the malign hypothesis has massive advantage in prior probability mass, and it’s quite likely to have that. It does not work to solve this by combining AQD with IBP, at least not naively.]