When only a couple thousand copies you probably don’t want to pay for the speedup, eg even going an extra 4x decreases the number of copies by 8x.
I also think when you don’t have control over your own hardware the speedup schemes become harder, since they might require custom network topologies. Not sure about that though
While I am not close to this situation, I felt moved to write something, mostly to support junior researchers and staff such as TurnTrout, Thomas Kwa, and KurtB who are voicing difficult experiences that may be challenging for them to talk about; and partly because I can provide perspective as someone who has managed many researchers and worked in a variety of research and non-research organizations and so can more authoritatively speak to what behaviors are ‘normal’ and what patterns tend to lead to good or bad outcomes. Caveat that I know very little about any internal details of MIRI, but I am still reasonably confident of what I’m saying based on general patterns and experience in the world.
Based on reading Thomas Kwa’s experience, as well as KurtB’s experience, Nate Soares’ behavior is far outside any norms of acceptable behavior that I’d endorse. Accepting or normalizing this behavior within an organization has a corrosive effect on the morale, epsistemics, and spiritual well-being of its members. The morale effects are probably obvious, but regarding epistemics, leadership is significantly less likely to get useful feedback if people are afraid to cross them (psychological safety is an important concept here). Finally, regarding spirit, normalizing this behavior sends a message to people that they aren’t entitled to set boundaries or be respected, which can create far-reaching damage in their other interactions and in their image of themselves. Based on this, I feel very worried for MIRI and think it should probably do a serious re-think of its organizational culture.
Since some commenters brought up academia and the idea that some professors can be negligent or difficult to work with, I will compare Nate’s behavior to professors in CS academia. Looking at what Thomas Kwa described, I can think of some professors who exhibit individual traits in Thomas’ description, but someone who had all of them at once would be an outlier (in a field that is already welcoming to difficult personalities), and I would strongly warn students against working with such a person. KurtB’s experience goes beyond that and seems at least a standard deviation worse; if someone behaved this way, I would try to minimize their influence in any organization I was part of and refuse to collaborate with them, and I would expect even a tenured faculty to have a serious talking-to about their behavior from colleagues (though maybe some places would be too cowardly to have this conversation), and for HR complaints to stack up.
Nate, the best description I can think of for what’s going on is that you have fairly severe issues with emotional regulation. Your comments indicate that you see this as a basic aspect of your emotional make-up (and maybe intimately tied to your ability to do research), but I have seen this pattern several times before and I am pretty confident this is not the case. In previous cases I’ve seen, the person in question expresses or exhibits and unwillingness to change up until the point that they face clear consequences for their actions, at which point (after a period of expressing outrage) they buckle down and make the changes, which usually changes their own life for the better, including being able to think more clearly. A first step would be going to therapy, which I definitely recommend. I am pretty confident that even for your own sake you should make a serious effort to make changes here. (I hope this doesn’t come across as condescending, as I genuinely believe this is good advice.)
Along these lines, for people around Nate who think that they “have” to accept this behavior because Nate’s work is important, even on those grounds alone setting boundaries on the behavior will lead to better outcomes.
Here is an example of how an organization could set boundaries on this behavior: If Nate yells at a staff member, that staff member no longer does ops work for Nate until he apologizes and expresses a credible commitment to communicate more courteously in the future. (This could be done in principle by making it opt-in to do continued ops work for Nate if this happens, and working hard to create a real affordance for not opting in.)
The important principle here is that Nate internalizes the costs of his decisions (by removing his ability to impose costs on others, and bearing the resulting inconvenience). Here the cost to Nate is also generally lower than the cost that would have been imposed on others (inflating your own bike tire is less costly than having your day ruined by being yelled at), though this isn’t crucial. The important thing is Nate would have skin in the game—if he still doesn’t change, then I believe somewhat more that he’s actually incapable of doing so, but I would guess that this would actually lead to changes. And if MIRI for some reason believes that other people should be willing to bear large costs for small benefits to Nate, they should also hire a dedicated staff to do damage control for him. (Maybe some or all of this is already happening… I am not at MIRI so I don’t know, but it doesn’t sound this way based on the experiences that have been shared.)
In summary: based on my own personal experience across many organizations, Nate’s behavior is not okay and MIRI should set boundaries on it. I do not believe Nate’s claim that this is a fundamental aspect of his emotional make-up, as it matches other patterns in the past that have changed when consequences were imposed, and even if it is a fundamental aspect he should face the natural consequences of his actions. These consequences should center on removing his ability to harm others, or, if this is not feasible, creating institutions at MIRI to reliably clean up after him and maintain psychological safety.
I don’t see it in the header in Mobile (although I do see the updated text now about it being a link post). Maybe it works on desktop but not mobile?
Is it clear these results don’t count? I see nothing in the Metaculus question text that rules it out.
Mods, could you have these posts link back to my blog Bounded Regret in some form? Right now there is no indication that this is cross-posted from my blog, and no link back to the original source.
Dan spent his entire PhD working on AI safety and did some of the most influential work on OOD robustness and OOD detection, as well as writing Unsolved Problems. Even if this work is less valued by some readers on LessWrong (imo mistakenly), it seems pretty inaccurate to say that he didn’t work on safety before founding CAIS.
Melanie Mitchell and Meg Mitchell are different people. Melanie was the participant in this debate, but you seem to be ascribing Meg’s opinions to her, including linking to video interviews with Meg in your comments.
I’m leaving it to the moderators to keep the copies mirrored, or just accept that errors won’t be corrected on this copy. Hopefully there’s some automatic way to do that?
Oops, thanks, updated to fix this.
Thanks! I removed the link.
Glad it was helpful!
Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it’s one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon’s moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).
I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won’t realize this. I think it’s good to point out Dan’s initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn’t been addressed. (I also think Dan overfocused on Ludwig’s paper, see below for more of my take on related work.)
Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):
* There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.
* In language, there’s a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I’d expect someone to try soft prompt tuning), but I don’t know for sure.
* On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.
I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model’s runtime behavior: * You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar. * You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar. * The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I’m pretty sure there’s a paper on averaging together CLIP activations to get controllable generation.
So I think it’s most accurate to say that you’ve adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like “Activation additions are a new way of interacting with LLMs”, which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa’s response seems to suggest.
I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven’t been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.
This is my best attempt to explain where I’m coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don’t think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it’s helpful.