Status: writing-while-frustrated. As with the last post, many of Jessica’s claims seem to me to be rooted in truth, but weirdly distorted. (Ever since the end of Jessica’s tenure at MIRI, I have perceived a communication barrier between us that has the weird-distortion nature.)
Meta: I continue to be somewhat hesitant to post stuff like this, on the grounds that it sucks to air dirty laundry about your old employer and then have your old employer drop by and criticize everything you said. I’ve asked Jessica whether she objects to me giving a critical reply, and she said she has no objections, so at least we have that. I remain open to suggestions for better ways to navigate these sorts of situations.
Jessica, I continue to be sad about the tough times you had during the end of your tenure at MIRI, and in the times following. I continue to appreciate your research contributions, and to wish you well.
My own recollections follow. Note that these are limited to cases where Jessica cites me personally, in the interest of time. Note also that I’m not entirely sure I’ve correctly identified the conversations she’s referring to, due to the blurring effects of the perceived distortion, and of time. And it goes almost-without-saying that my own recollections are fallible.
As a MIRI employee I was coerced into a frame where I was extremely powerful and likely to by-default cause immense damage with this power, and therefore potentially responsible for astronomical amounts of harm. I was discouraged from engaging with people who had criticisms of this frame, and had reason to fear for my life if I published some criticisms of it.
My own frame indeed says that the present is the hinge of history, and that humans alive today have extreme ability to affect the course of the future, and that this is especially true of humans working in the AI space, broadly construed. I wouldn’t personally call this “power”—I don’t think anyone in the space has all that much power in the present. I think a bunch of people in the AI-o-sphere have a lowish likelihood of a large amount of future power, and thus high future expected power, which is kinda like power. From my perspective, MIRI researchers have less of this than researchers from current top AI labs, but do have a decent amount. My own model does not predict that MIRI researchers are likely to cause astronomical harm by default. I do not personally adhere to the copenhagen interpretation of ethics, and in the event that humanity destroys itself, I would not be assigning extra blame to alignment researchers on the grounds that they were closer to the action. I’m also not personally very interested in the game of pre-assigning blame, favoring object-level alignment research instead.
Insofar as I was influencing Jessica with my own frame, my best guess is that she misunderstood my frame, as evidenced by these differences between the frame she describes feeling coerced into, and my own picture.
I don’t recall ever discouraging Jessica from engaging with people who had criticisms of my frame. I readily admit that she was talking to folks I had little intellectual respect for, and I vaguely remember some of these people coming up in conversation and me noting that I lacked intellectual respect for them. To the best of my recollection, in all such instances, I added caveats of the form “but, just because I wouldn’t doesn’t mean you shouldn’t”. I readily admit that my openness about my lack of intellectual respect may have been taken as discouragement, especially given my position as her employer. The aforementioned caveats were intended to counteract such forces, at least insofar as I recall.
I was aware at the time that Jessica and I didn’t see eye-to-eye on various issues. I remember at least two occasions where I attempted to explicitly convey that I knew we didn’t see eye-to-eye, that it was OK for her to have views that didn’t match mine, and that I encouraged her to think independently and develop her own views.
Jessica said she felt coerced into a frame she found uncomfortable, and I believe her. My notes here are not intended to cast doubt on the honesty of her reports. My intent in saying all this is merely to express that (1) the frame she reports feeling coerced into, is not one that I recognize, nevermind one that I intentionally coerced her into; and (2) I was aware of the pressures and actively tried to counteract them. Clearly, I failed at this. (And I have a decent chunk of probability mass that Jessica would clarify that she’s not accusing me of intentional coercion.) From my own perspective, she was misreading my own frame and feeling pressured into it despite significant efforts on my part to ameliorate the pressure. I happily solicit advice for what to do better next time, but do not consider my comport to have been a mistake.
talked about hellscapes
I don’t recall ever “talking about hellscapes” per se. I recall mentioning them in passing, rarely. In my recollection, that mainly happened in response to someone else broaching the topic of fates worse than death. (Maybe there were other occasional throwaway references? But I don’t recall them.) My cached reply to others raising the idea of fates worse than death went something like:
“Goal-space is high dimensional, and almost all directions of optimization seem likely to be comparably bad to death from our perspective. To get something that is even vaguely recognizable to human values you have to be hitting a very narrow target in this high-dimensional space. Now, most of that target is plausibly dystopias as opposed to eutopias, because once you’re in the neighborhood, there are a lot of nearby things that are bad rather than good, and value is fragile. As such, it’s reasonable in principle to worry about civilization getting good enough at aiming AIs that they can hit the target but not the bullseye, and so you might worry that that civilization is more likely to create a hellscape than a eutopia. I personally don’t worry about this myself, because it seems to me that the space is so freaking high dimensional and the target so freaking small, that I find it implausible that a civilization successfully able to point an AI in a human-relevant direction, isn’t also able to hit the bullseye. Like, if you’re already hitting a quarter with an arrowhead on the backside of the moon, I expect you can also hit a dime.”
Two reasons I’d defend mentioning hellscapes in such situations: firstly, to demonstrate that I at least plausibly understood the concern my conversation partner had raised (as a matter of course before making a counterpoint), and secondly, so as to not undermine our friends working on S-risk reduction (a program I support).
My reason for not hesitating to use terms like “hellscapes” rather than more banal and less evocative terms was (to the best of my current recollection) out of a desire to speak freely and frankly, at least behind closed doors (eg, in the privacy of the MIRI research space). At the time, there was a bunch of social pressure around to stop thought experiments that end with the AI escaping and eating the galaxy, and instead use thought experiments about AIs that are trying to vacuum the house and accidentally break a lamp or whatever, and this used to rub me the wrong way. The motivations as I previously understood them were that, if you talk about star-eating rather than lamp-breaking, then none of the old guard AI researchers are ever going to take your field seriously. I thought (and still think) this is basically a bad reason. However, I have since learned a new reason, which is that mentioning large-scale disasters freely and frankly, might trigger psychotic episodes in people predisposed to them. I find this a much more compelling reason to elide the high-valence examples.
(Also, the more banal term “S-risk” hadn’t propagated yet, IIRC.)
Regardless, I have never thought in detail about fates worse than death, never mind discussed fates worse than death in any depth. I have no reason to, and I recommend against it. Me occasionally mentioning something in passing, and Jessica glossing it as “Nate talked about it” (with an implication of depth and regularity), is a fine example of the “weird distortion” I perceive in Jessica’s accounts.
I was told, by Nate Soares, that the pieces to make AGI are likely already out there and someone just has to put them together.
I contest this. According to my best recollection of the conversation that I think Jessica is referring to, she was arguing that AGI will not arrive in our own lifetimes, and seemed unresponsive to my attempts to argue that a confident claim of long timelines requires positive knowledge, at which point I exasperatedly remarked that for all we knew, the allegedly missing AGI insights had already been not only had, but published in the literature, and all that remains is someone figuring out how to assemble them. (cf no one knows what science doesn’t know.) I do not assign particularly high credence to this claim myself, and (IIRC) I was using it rhetorically to test for acknowledgement of the idea that confident long timelines require positive knowledge that we seem to lack.
(This seems like another central example of my throwaway lines becoming weirdly distorted and heavily highlighted in Jessica’s recounting.)
He did not tell me anything about how to make such an AGI, on the basis that this would be dangerous.
Here Jessica seems to be implying that, not only did I positively claim that the pieces of AGI were already out there in the literature, but also that I had personally identified them? I deny that, and I’m not sure what claim I made that Jessica misunderstood in that way. Given the surrounding context where Jessica made this claim, my guess is that it was in the same conversation as the exasperated remark described above, and that the conversation past that point became so desynched that Jessica’s recounting is no longer recognizable to me.
To be clear, I have claimed that AI alignment work is sometimes intertwined with AI capabilities work, and I have claimed that capabilities insights shouldn’t be publicized (as a strong default) on account of the negative externalities. Perhaps I said something along those lines that got distorted into Jessica’s claim?
Instead, he encouraged me to figure it out for myself, saying it was within my abilities to do so.
I occasionally recommend that our researchers periodically (every 6mo or so) open a text file and see if they can write pseudocode for an FAI (ignoring computational constraints, at least for now), to help focus their attention on exactly where they’re confused and ground out their alignment research in things that are literally blocking them from actually writing a flippin’ FAI. I don’t recall ever telling Jessica that I thought she could figure out how to build an AGI herself. I do recall telling her I expected she could benefit from the exercise of attempting to write the pseudocode for an FAI.
If memory serves, this is an exercise I’d been advocating for a couple years before the time period that Jessica’s discussing (and IIRC, I’ve seen Jessica advocate it, or more general variants like “what could you do with a hypercomputer on a thumb drive”, as an exercise to potential hires). One guess as to what’s going on is that I tried to advocate the exercise of pseudocoding an FAI as I had many times before, but used some shorthand for it that I thought would be transparent, in some new conversational context (eg, shortly after MIRI switched to non-disclosure by default), and while Jessica was in some new mental state, and Jessica misunderstood me as advocating figuring out how to build an AGI all on her own while insinuating that I thought she could?
[From the comments, in answer to the query “How did you conclude from Nate Soares saying that that the tools to create agi likely already exist that he wanted people to believe he knew how to construct one?”] Because he asked me to figure it out in a way that implied he already had a solution; the assignment wouldn’t make sense if it were to locate a non-workable AGI design (as many AI researchers have done throughout the history of the field); that wouldn’t at all prove that the pieces to make AGI are already out there. Also, there wouldn’t be much reason to think that his sharing a non-workable AGI design with me would be dangerous.
In light of this, my guess is that Jessica flatly misread my implications here.
To be very explicit: Jessica, I never believed you capable of creating a workable AGI design (using, say, your 2017 mind, unaugmented, in any reasonable amount of time). I also don’t assign particularly high credence to the claim that the insights are already out in the literature waiting to be found (or that they were in 2017). Furthermore, I never intentionally implied that I have myself succeeded at the “pseudocode an FAI” exercise so hard as to have an AGI design. Sorry for the miscommunication.
Researchers were told not to talk to each other about research, on the basis that some people were working on secret projects and would have to say so if they were asked what they were working on.
This suggests a picture of MIRI’s nondisclosure-by-default policies that’s much more top-down than reality, similar to a correction I made on a post by Evan Hubinger a few years ago.
The sequence of events as I recall them was: Various researchers wanted to do some closed research. There was much discussion about how much information was private: Research results? Yes, if the project lead wants privacy. Research directions? Yes, if the project lead wants privacy. What about the participant list for each project? Can each project determine their own secrecy bounds individually, or is revealing who’s working with you defecting against (possibly-hypothetical) projects that don’t want to disclose who they’re working with? etc. etc. I recall at least one convo with a bunch of researchers where, in efforts to get everyone to stop circling privacy questions like moths to a flame and get back to the object level research, I said something to the effect of “come to me if you’re having trouble”.
I separately recall Jessica coming to me afterwards and asking a bunch more questions about who she can ask about what. I recall trying to convey something like “just work on what you want to work on, with whatever privacy level you want, and if someone working on something closed wants you working with them they’ll let you know (perhaps through me, if they want to), and you can bang out details with them as need be”.
The fact that people shouldn’t have to reveal whether they are in fact working on closed research if they don’t want to sounds like the sort of thing that came up in one or both of those conversations, and my guess is that that’s what Jessica’s referring to here. From my perspective, that wasn’t a particularly central point, and the point I recall attempting to drive home was more like “let’s just work on the object-level research and not get all wound up around privacy (especially because all that we’ve changed are the defaults, and you’re still completely welcome to publicize your own research, with my full support, as much as you’d like)”.
Nate Soares also wrote a post discouraging people from talking about the ways they believe others to be acting in bad faith.
According to me, I was not trying to say “you shouldn’t talk about ways you believe others to be acting in bad faith”. I was trying to say “I think y’all are usually mistaken when you’re accusing certain types of other people of acting in bad faith”, plus “accusing people of acting in bad faith [in confrontational and adversarial ways, instead of gently clarifying and confirming first] runs a risk of being self-fulfilling, and also burns a commons, and I’m annoyed by the burned commons”. I think those people are wrong and having negative externalities, not that they’re bad for reporting what they believe.
Note that the sorts of talking the post explicitly pushes back against is arguments of the form “person X is gaining [status|power|prestige] through their actions, therefore they are untrustworthy and have bad intentions”, which I believe to be invalid. Had I predicted Jessica’s particular misread in advance, I would have explicitly noted that I’m completely ok with arguments of the form “given observations X and Y, I have non-trivial probability on the hypothesis that you’re acting in bad faith, which I know is a serious allegation. Are you acting in bad faith? If not, how do you explain observations X and Y?”.
In other words, the thing I object to is not the flat statement of credence on the hypothesis “thus-and-such is acting in bad faith”, it’s the part where the author socially rallies people to war on flimsy pretenses.
In other other words, I both believe that human wonkiness makes many people particularly bad at calculating P(bad-faith|the-evidence) in particular, and recommend being extra charitable and cautious when it feels like that probability is spiking. Separately but relatedly, in the moment that you move from stating your own credence that someone is acting in bad faith, to socially accusing someone of acting in bad faith, my preferred norms require a high degree of explicit justification.
And, to be explicit: I think that most of the people who are acting in bad faith will either say “yes I’m acting in bad faith” when you ask them, or will sneer at you or laugh at you or make fun of you instead, which is just as good. I think a handful of other people are harmful to have around regardless of their intentions, and my guess is that most community decisions about harmful people should revolve around harm rather than intent.
(Indeed, I suspect the community needs to lower, rather than raise, the costs of shunning someone who’s doing a lot of harm. But I reiterate that I think such decisions should center on the harm, not the intent, and as such I continue to support the norm that combative/adversarial accusations of ill intent require a high degree of justification.)
Nate Soares expressed discontent that Michael Vassar was talking with “his” employees, distracting them from work.
I don’t actually know what conversation this is referring to. I recall a separate instance, not involving Jessica, of a non-researcher spending lots of time in the office hanging out and talking with one of our researchers, and me pulling the researcher aside and asking whether they reflectively endorsed having those conversations or whether they kept getting dragged into them and then found themselves unable to politely leave. (In that case, the researcher said they reflectively endorsed them, and thereafter I left them alone.)
There might have been a time when Michael Arc (nee Vassar) was spending a lot of time talking to Jessica and one other employee, and I said something about how I don’t have much intellectual respect for Michael? I doubt I said this unsolicited, but I definitely would have said it if anyone asked, and I at least vaguely remember something like that happening once or twice. It’s also possible that towards the end of Jessica’s tenure we were trying to have scheduled meetings to see if we could bridge the communications barrier, and it came up naturally in the conversation? But I’m not sure, as (unlike most of the other claims) I don’t concretely recognize this reference.
It should be noted that, as I was nominally Nate’s employee, it is consistent with standard business practices for him to prevent me from talking with people who might distract me from my work during office hours.
I’m confident I did not prevent anyone from talking to anyone. I occasionally pulled people aside and asked them if they felt trapped in a given conversation when someone was loitering in the office having lots of conversations, so that I could rescue them if need be. I occasionally answered honestly, when asked, what I thought about people’s conversation partners. I leave almost all researchers to their own devices (conversational or otherwise) almost all of the time.
In Jessica’s particular case, she was having a lot of difficulty at the workplace, and so I stepped deeper into the management role than I usually do and we spent more time together seeing whether we could iron out our difficulties or whether we would need to part ways. It’s quite plausible that, during one of those conversations, I noted of my own accord that she was spending lots of her office-time deep in conversation with Michael, and that I didn’t personally expect this to help Jessica get back to producing alignment research that passed my research quality bar. But I am confident that, insofar as I did express my concerns, it was limited to an expression of skepticism. I… might have asked Michael to stop frequenting the offices quite so much? But I doubt it, and I have no recollection of such a thing.
I am confident I didn’t ever tell anyone not to talk to someone else, that feels way out-of-line to me. I may well have said things along the lines of “I predict that that conversation will prove fruitless”, which Jessica interpreted as a guess-culture style command? I tend to couch against that interpretation by adding hedges of the form “but I’m not you” or whatever, but perhaps I neglected to, or perhaps it fell on deaf ears?
Or perhaps Jessica’s just saying something along the lines of “I feared that if I kept talking to Michael all day, I’d be fired, and Nate expressing that he didn’t expect those conversations to be productive was tantamount to him saying that if I continued he’d fire me, which was tantamount to him saying that I can’t talk to Michael”? In which case, my prediction is indeed that if she hadn’t left MIRI of her own accord, and her research performance didn’t rebound, at some point I would have fired her on the grounds of poor performance. And in worlds where Jessica kept talking to Michael all the time, I would have guessed that a rebound was somewhat less likely, because I didn’t expect him to provide useful meta-level or object-level insights that lead to downstream alignment progress. But I’m an empiricist, and I would have happily tested my “talking to Michael doesn’t result in Nate-legible research output” hypothesis, after noting my skepticism in advance.
(Also, for the record, “Nate-legible research output” does not mean “research that is useful according to Nate’s own models”. Plenty of MIRI researchers disagree with me and my frames about all sorts of stuff, and I’m happy to have them at MIRI regardless, given that they’ve demonstrated the ability to seriously engage with the problem. I’m looking for something more like a cohesive vision that the researcher themself believes in, not research that necessarily strikes me personally as directly useful.)
MIRI certainly had a substantially conflict-theoretic view of the broad situation, even if not the local situation. I brought up the possibility of convincing DeepMind people to care about AI alignment. MIRI leaders including Eliezer Yudkowsky and Nate Soares told me that this was overly naive, that DeepMind would not stop dangerous research even if good reasons for this could be given.
I contest this. I endorse talking with leadership at leading AI labs, and have done so in the past, and expect to continue doing so in the future.
It’s true that I don’t expect any of the leading labs to slow down or stop soon enough, and it’s true that I think converging beliefs takes a huge time investment. On the mainline I predict that the required investment won’t in fact be paid in the relevant cases. But, as I told Jessica at the time (IIRC), I expect folks at leading AGI labs to be much more sensitive to solutions to the alignment problem, despite the fact that I don’t think you can talk them into giving up public capabilities research in practice. (This might be what she misunderstood as me saying we’d have better luck “competing”? I don’t recall saying any such thing, but I do recall saying that we’d have better luck solving alignment first and persuading second.)
(And for the record, while I think these big labs are making a mistake, it’s a very easy mistake to make: knowing that you’re in the bad Nash equilibrium doesn’t let you teleport to a better one, and it’s at least understandable that each individual capabilities lab thinks that they’re better than the next guy, or that they can’t get the actual top researchers if they implement privacy protocols right out the gate. It’s an important mistake, but not a weird one that requires positing unusual levels of bad faith.)
In case it’s not clear from the above, I don’t have much sympathy for conflict theory in this, and I definitely don’t think in broadly us-vs-them terms about the AGI landscape. And (as I think I said at the time) I endorse learning how to rapidly converge with people. I recommend figuring out how to more rapidly converge with friends before burning the commons of time-spent-converging-with-busy-people-who-have-limited-attention-for-you, but I still endorse figuring it out. I don’t expect it to work, and I think solving the dang alignment problem on the object-level is probably a better way to convince people to do things differently, but also I will cheer on the sidelines as people try to figure out how to get better and faster at converging their beliefs.
There’s no law saying that, when someone’s making a mistake, there’s some way to explain it to them such that suddenly it’s fixed. I think existing capabilities orgs are making mistakes (at the very least, in publishing capabilities advances (though credit where credit is due, various labs are doing better at keeping their cutting-edge results private, at least until somebody else replicates or nearly-replicates them, than they used to be (though to be clear I think we have a long way to go before I stop saying that I believe I see a big mistake))), and deny the implicit inference from “you can’t quickly convince someone with words that they’re making a mistake” to “you must be using conflict theory”.
I was concerned about the linked misleading statement in 2017 and told Nate Soares and others about it, although Nate Soares insisted that it was not a lie, because technically the word “excited” could indicate the magnitude of a feeling rather than the positiveness of it.
That doesn’t seem to me like a good characterization of my views.
My recollection is that, in my conversation about this topic with Jessica, I was trying to convey something more like “Yeah, I’m pretty worried that they’re going to screw lots of things up. And the overt plan to give AGI to everyone is dumb. But also there are a bunch of sane people trying to redirect OpenAI in a saner direction, and I don’t want to immediately sic our entire community on OpenAI and thereby ruin their chances. This whole thing looks real high-variance, and at the very least this is “exciting” in the sense that watching an adventure movie is exciting, even in the parts where the plot is probably about to take a downturn. That said, there’s definitely a sense in which I’m saying things with more positive connotations than I actually feel—like, I do feel some real positive hope here, but I’m writing lopsidedly from those hopes. This is because the blog post is an official MIRI statement about a new AI org on the block, and my sense of politics says that if a new org appears on the block and you think they’re doing some things wrong, then the politic thing to do initially is talk about their good attributes out loud, while trying to help redirect them in private.”
For the record, I think I was not completely crazy to have some hope about OpenAI at the time. As things played out, they wound up pretty friendly to folks from our community, and their new charter is much saner than their original plan. That doesn’t undo the damage of adding a new capabilities shop at that particular moment in that particular way; but there were people trying behind the scenes, that did in real life manage to do something, and so having some advance hope before they played their hands out was a plausible mistake to make, before seeing the actual underwhelming history unfold.
All that said, I do now consider this a mistake, both in terms of my “don’t rock the boat” communications strategy and in terms of how well I thought things might realistically go at OpenAI if things went well there. I have since updated, and appreciate Jessica for being early in pointing out that mistake. I specifically think I was mistaken in making public MIRI blog posts with anything less than full candor.
While someone bullshitting on the public Internet doesn’t automatically imply they lie to their coworkers in-person, I did not and still don’t know where Nate is drawing the line here.
As I said to Jessica at the time (IIRC), one reason I felt (at the time) that the blog post was fine, is that it was an official MIRI-organization announcement. When speaking as an organization, I was (at the time) significantly more Polite and significantly more Politically Correct and significantly less Dour (and less opinionated and more uncertain, etc).
Furthermore, I (wrongly) expected that my post would not be misleading, because I (wrongly) expected my statements made as MIRI-the-org to be transparently statements made as MIRI-the-org, and for such statements to be transparently Polite and Politically Correct, and thus not very informative one way or another. (In case it wasn’t clear, I now think this was a mistake.)
That said, as I told Jessica at the time (IIRC), you can always just ask me whether I’m speaking as MIRI-the-organization or whether I’m speaking as Nate. Similarly, when I’m speaking as Nate-the-person, you can always just ask me about my honesty protocols.
I have since updated against the idea that I should ever speak as MIRI-the-organization, and towards speaking uniformly with full candor as Nate-the-person. I’m not sure I’ll follow this perfectly (I’d at least slip back into politic-speak if I found myself cornered by a journalist), but again, you can always just ask.
Thanks for reading closely enough to have detailed responses and trying to correct the record according to your memory. Appreciate that you’re explicitly not trying to disincentivize saying negative things about one’s former employee (a family member of mine was worried about my writing this post on the basis that it would “burn bridges”).
A couple general points:
These events happened years ago and no one’s memory is perfect (although our culture has propaganda saying memories are less reliable than they in fact are). E.g. I mis-stated a fact about Maia’s death, that Maia had been on Ziz’s boat, based on filling in the detail from the other details and impressions I had.
I can’t know what someone “really means”, I can know what they say and what the most reasonable apparent interpretations are. I could have asked more clarifying questions at the time, but that felt expensive due to the stressful dynamics the post describes.
In terms of more specific points:
(And I have a decent chunk of probability mass that Jessica would clarify that she’s not accusing me of intentional coercion.) From my own perspective, she was misreading my own frame and feeling pressured into it despite significant efforts on my part to ameliorate the pressure. I happily solicit advice for what to do better next time, but do not consider my comport to have been a mistake.
I’m not accusing you of intentional coercion, I think this sort of problem could result as a side effect of e.g. mental processes trying to play along with coalitions while not adequately modeling effects on others. Some of the reasons I’m saying I was coerced are (a) Anna discouraging researchers from talking with Michael, (b) the remote possibility of assassination, (c) the sort of economic coercion that would be expected on priors at most corporations (even if MIRI is different). I think my threat model was pretty wrong at the time which made me more afraid than I actually had to be (due to conservatism); this is in an important sense irrational (and I’ve tried pretty hard to get better at modeling threats realistically since then), although in a way that would be expected to be common in normal college graduates. Given that I was criticizing MIRI’s ideology more than other researchers, my guess is that I was relatively un-coerced by the frame, although it’s in principle possible that I simply disagreed more.
I don’t recall ever “talking about hellscapes” per se. I recall mentioning them in passing, rarely. In my recollection, that mainly happened in response to someone else broaching the topic of fates worse than death. (Maybe there were other occasional throwaway references? But I don’t recall them.)
I’m not semantically distinguishing “mentioning” from “talking about”. I don’t recall asking about fates worse than death when you mentioned them and drew a corresponding graph (showing ~0 utility for low levels of alignment, negative utility for high but not excellent levels of alignment, and positive utility for excellent levels of alignment).
According to my best recollection of the conversation that I think Jessica is referring to, she was arguing that AGI will not arrive in our own lifetimes, and seemed unresponsive to my attempts to argue that a confident claim of long timelines requires positive knowledge, at which point I exasperatedly remarked that for all we knew, the allegedly missing AGI insights had already been not only had, but published in the literature, and all that remains is someone figuring out how to assemble them.
Edited to make it clear you weren’t trying to assign high probability to this proposition. What you said seems more reasonable given this, although given you were also talking about AI coming in the next 20 years I hope you can see why I thought this reflected your belief.
Here Jessica seems to be implying that, not only did I positively claim that the pieces of AGI were already out there in the literature, but also that I had personally identified them? I deny that, and I’m not sure what claim I made that Jessica misunderstood in that way.
Edited to make it clear you didn’t mean this. The reason I drew this as a Gricean implicature is that figuring out how to make an AGI wouldn’t provide evidence that the pieces to make AGI are already out there, unless such an AGI design would work if scaled up / iteratively improved in ways that don’t require advanced theory / etc.
The sequence of events as I recall them was: Various researchers wanted to do some closed research. There was much discussion about how much information was private: Research results? Yes, if the project lead wants privacy. Research directions? Yes, if the project lead wants privacy. What about the participant list for each project? Can each project determine their own secrecy bounds individually, or is revealing who’s working with you defecting against (possibly-hypothetical) projects that don’t want to disclose who they’re working with? etc. etc. I recall at least one convo with a bunch of researchers where, in efforts to get everyone to stop circling privacy questions like moths to a flame and get back to the object level research, I said something to the effect of “come to me if you’re having trouble”.
Even if the motive came from other researchers I specifically remember hearing about the policy at a meeting in a top-down fashion. I thought the “don’t ask each other about research” policy was bad enough that I complained about it and it might have been changed. It seems that not everyone remembers this policy (although Eliezer in a recent conversation didn’t disagree about this being the policy at some point), but I must have been interpreting something this way because I remember contesting it.
According to me, I was not trying to say “you shouldn’t talk about ways you believe others to be acting in bad faith”. I was trying to say “I think y’all are usually mistaken when you’re accusing certain types of other people of acting in bad faith”, plus “accusing people of acting in bad faith [in confrontational and adversarial ways, instead of gently clarifying and confirming first] runs a risk of being self-fulfilling, and also burns a commons, and I’m annoyed by the burned commons”. I think those people are wrong and having negative externalities, not that they’re bad for reporting what they believe.
I hope you can see why I interpreted the post as making a pragmatic argument, not simply an epistemic argument, against saying others are acting in bad faith:
When criticism turns to attacking the intentions of others, I perceive that to be burning the commons. Communities often have to deal with actors that in fact have ill intentions, and in that case it’s often worth the damage to prevent an even greater exploitation by malicious actors. But damage is damage in either case, and I suspect that young communities are prone to destroying this particular commons based on false premises.
In the context of 2017, I also had a conversation with Anna Salamon where she said our main disagreement was about whether bad faith should be talked about (which implies our main disagreement wasn’t about how common bad faith was).
I don’t actually know what conversation this is referring to. I recall a separate instance, not involving Jessica, of a non-researcher spending lots of time in the office hanging out and talking with one of our researchers, and me pulling the researcher aside and asking whether they reflectively endorsed having those conversations or whether they kept getting dragged into them and then found themselves unable to politely leave. (In that case, the researcher said they reflectively endorsed them, and thereafter I left them alone.)
Edited to say you don’t recall this. I didn’t hear this from you, I head it secondhand perhaps from Michael Vassar, so I don’t at this point have strong reason to think you said this.
There’s no law saying that, when someone’s making a mistake, there’s some way to explain it to them such that suddenly it’s fixed. I think existing capabilities orgs are making mistakes (at the very least, in publishing capabilities advances (though credit where credit is due, various labs are doing better at keeping their cutting-edge results private, at least until somebody else replicates or nearly-replicates them, than they used to be (though to be clear I think we have a long way to go before I stop saying that I believe I see a big mistake))), and deny the implicit inference from “you can’t quickly convince someone with words that they’re making a mistake” to “you must be using conflict theory”.
I agree that “speed at which you can convince someone” is relevant in a mistake theory. Edited to make this clear.
But, as I told Jessica at the time (IIRC), I expect folks at leading AGI labs to be much more sensitive to solutions to the alignment problem, despite the fact that I don’t think you can talk them into giving up public capabilities research in practice. (This might be what she misunderstood as me saying we’d have better luck “competing”? I don’t recall saying any such thing, but I do recall saying that we’d have better luck solving alignment first and persuading second.)
If I recall correctly you were at the time including some AGI capabilities research as part of alignment research (which makes a significant amount of theoretical sense, given that FAI has to pursue convergent instrumental goals). In this case developing an alignment solution before DeepMind develops AGI would be a form of competition. DeepMind people might be more interested in the alignment solution if it comes along with a capabilities boost (I’m not sure whether this consideration was discussed in the specific conversation I’m referring to, but it might have been considered in another conversation, which doesn’t mean it was in any way planned on).
That said, as I told Jessica at the time (IIRC), you can always just ask me whether I’m speaking as MIRI-the-organization or whether I’m speaking as Nate. Similarly, when I’m speaking as Nate-the-person, you can always just ask me about my honesty protocols.
Ok, this helps me disambiguate your honesty policy. If “employees may say things on the MIRI blog that would be very misleading under the assumption that this blog was not the output of MIRI playing politics and being PC and polite” is consistent with MIRI’s policies, it’s good for that to be generally known. In the case of the OpenAI blog post, the post is polite because it gives a misleadingly positive impression.
(a) Anna discouraging researchers from talking with Michael
...
...I specifically remember hearing about the policy at a meeting in a top-down fashion...it seems that not everyone remembers this policy...I must have been interpreting something this way because I remember contesting it.
...
...I also had a conversation with Anna Salamon where she said our main disagreement was about whether bad faith should be talked about...
Just a note on my own mental state, reading the above:
Given the rather large number of misinterpretations and misrememberings and confusions-of-meaning in this and the previous post, along with Jessica quite badly mischaracterizing what I said twice in a row in a comment thread above, my status on any Jessica-summary (as opposed to directly quoted words) is “that’s probably not what the other person meant, nor what others listening to that person would have interpreted that person to mean.”
By “probably” I literally mean strictly probably, i.e. a greater than 50% chance of misinterpretation, in part because the set of things-Jessica-is-choosing-to-summarize is skewed toward those she found unusually surprising or objectionable.
If I were in Jessica’s shoes, I would by this point be replacing statements like “I had a conversation with Anna Salamon where she said X” with “I had a conversation with Anna Salamon where she said things which I interpreted to mean X” as a matter of general policy, so as not to be misleading-in-expectation to readers.
This is quite a small note, but it’s representative of a lot of things that tripped me up in the OP, and might be relevant to the weird distortion:
> Jessica said she felt coerced into a frame she found uncomfortable
I note that Jessica said she was coerced.
I suspect that Nate-dialect tracks meaningful distinctions between whether one feels coerced, whether one has evidence of coercion, whether one has a model of coercive forces which outputs predictions that closely resemble actual events, whether one expects that a poll of one’s peers would return a majority consensus that [what happened] is well-described by the label [coercion], etc.
By default, I would have assumed that Jessica-dialect tracks such distinctions as well, since such distinctions are fairly common in both the rationalsphere and (even moreso) in places like MIRI.
But it’s possible that Jessica was not, with the phrase “I was coerced,” attempting to convey the strong thing that would be meant in Nate-dialect by those words, and was indeed attempting to convey the thing you (automatically? Reflexively?) seem to have translated it to: “I felt coerced; I had an internal experience matching that of being coerced [which is an assertion we generally have a social agreement to take as indisputable, separate from questions of whether or not those feelings were caused by something more-or-less objectively identifiable as coercion].”
I suspect a lot of what you describe as weird distortion has its roots in tiny distinctions like this made by one party but not by the other/taken for granted by one party but not by the other. That particular example leapt out to me as conspicuous, but I posit many others.
Status: writing-while-frustrated. As with the last post, many of Jessica’s claims seem to me to be rooted in truth, but weirdly distorted. (Ever since the end of Jessica’s tenure at MIRI, I have perceived a communication barrier between us that has the weird-distortion nature.)
Meta: I continue to be somewhat hesitant to post stuff like this, on the grounds that it sucks to air dirty laundry about your old employer and then have your old employer drop by and criticize everything you said. I’ve asked Jessica whether she objects to me giving a critical reply, and she said she has no objections, so at least we have that. I remain open to suggestions for better ways to navigate these sorts of situations.
Jessica, I continue to be sad about the tough times you had during the end of your tenure at MIRI, and in the times following. I continue to appreciate your research contributions, and to wish you well.
My own recollections follow. Note that these are limited to cases where Jessica cites me personally, in the interest of time. Note also that I’m not entirely sure I’ve correctly identified the conversations she’s referring to, due to the blurring effects of the perceived distortion, and of time. And it goes almost-without-saying that my own recollections are fallible.
My own frame indeed says that the present is the hinge of history, and that humans alive today have extreme ability to affect the course of the future, and that this is especially true of humans working in the AI space, broadly construed. I wouldn’t personally call this “power”—I don’t think anyone in the space has all that much power in the present. I think a bunch of people in the AI-o-sphere have a lowish likelihood of a large amount of future power, and thus high future expected power, which is kinda like power. From my perspective, MIRI researchers have less of this than researchers from current top AI labs, but do have a decent amount. My own model does not predict that MIRI researchers are likely to cause astronomical harm by default. I do not personally adhere to the copenhagen interpretation of ethics, and in the event that humanity destroys itself, I would not be assigning extra blame to alignment researchers on the grounds that they were closer to the action. I’m also not personally very interested in the game of pre-assigning blame, favoring object-level alignment research instead.
Insofar as I was influencing Jessica with my own frame, my best guess is that she misunderstood my frame, as evidenced by these differences between the frame she describes feeling coerced into, and my own picture.
I don’t recall ever discouraging Jessica from engaging with people who had criticisms of my frame. I readily admit that she was talking to folks I had little intellectual respect for, and I vaguely remember some of these people coming up in conversation and me noting that I lacked intellectual respect for them. To the best of my recollection, in all such instances, I added caveats of the form “but, just because I wouldn’t doesn’t mean you shouldn’t”. I readily admit that my openness about my lack of intellectual respect may have been taken as discouragement, especially given my position as her employer. The aforementioned caveats were intended to counteract such forces, at least insofar as I recall.
I was aware at the time that Jessica and I didn’t see eye-to-eye on various issues. I remember at least two occasions where I attempted to explicitly convey that I knew we didn’t see eye-to-eye, that it was OK for her to have views that didn’t match mine, and that I encouraged her to think independently and develop her own views.
Jessica said she felt coerced into a frame she found uncomfortable, and I believe her. My notes here are not intended to cast doubt on the honesty of her reports. My intent in saying all this is merely to express that (1) the frame she reports feeling coerced into, is not one that I recognize, nevermind one that I intentionally coerced her into; and (2) I was aware of the pressures and actively tried to counteract them. Clearly, I failed at this. (And I have a decent chunk of probability mass that Jessica would clarify that she’s not accusing me of intentional coercion.) From my own perspective, she was misreading my own frame and feeling pressured into it despite significant efforts on my part to ameliorate the pressure. I happily solicit advice for what to do better next time, but do not consider my comport to have been a mistake.
I don’t recall ever “talking about hellscapes” per se. I recall mentioning them in passing, rarely. In my recollection, that mainly happened in response to someone else broaching the topic of fates worse than death. (Maybe there were other occasional throwaway references? But I don’t recall them.) My cached reply to others raising the idea of fates worse than death went something like:
“Goal-space is high dimensional, and almost all directions of optimization seem likely to be comparably bad to death from our perspective. To get something that is even vaguely recognizable to human values you have to be hitting a very narrow target in this high-dimensional space. Now, most of that target is plausibly dystopias as opposed to eutopias, because once you’re in the neighborhood, there are a lot of nearby things that are bad rather than good, and value is fragile. As such, it’s reasonable in principle to worry about civilization getting good enough at aiming AIs that they can hit the target but not the bullseye, and so you might worry that that civilization is more likely to create a hellscape than a eutopia. I personally don’t worry about this myself, because it seems to me that the space is so freaking high dimensional and the target so freaking small, that I find it implausible that a civilization successfully able to point an AI in a human-relevant direction, isn’t also able to hit the bullseye. Like, if you’re already hitting a quarter with an arrowhead on the backside of the moon, I expect you can also hit a dime.”
Two reasons I’d defend mentioning hellscapes in such situations: firstly, to demonstrate that I at least plausibly understood the concern my conversation partner had raised (as a matter of course before making a counterpoint), and secondly, so as to not undermine our friends working on S-risk reduction (a program I support).
My reason for not hesitating to use terms like “hellscapes” rather than more banal and less evocative terms was (to the best of my current recollection) out of a desire to speak freely and frankly, at least behind closed doors (eg, in the privacy of the MIRI research space). At the time, there was a bunch of social pressure around to stop thought experiments that end with the AI escaping and eating the galaxy, and instead use thought experiments about AIs that are trying to vacuum the house and accidentally break a lamp or whatever, and this used to rub me the wrong way. The motivations as I previously understood them were that, if you talk about star-eating rather than lamp-breaking, then none of the old guard AI researchers are ever going to take your field seriously. I thought (and still think) this is basically a bad reason. However, I have since learned a new reason, which is that mentioning large-scale disasters freely and frankly, might trigger psychotic episodes in people predisposed to them. I find this a much more compelling reason to elide the high-valence examples.
(Also, the more banal term “S-risk” hadn’t propagated yet, IIRC.)
Regardless, I have never thought in detail about fates worse than death, never mind discussed fates worse than death in any depth. I have no reason to, and I recommend against it. Me occasionally mentioning something in passing, and Jessica glossing it as “Nate talked about it” (with an implication of depth and regularity), is a fine example of the “weird distortion” I perceive in Jessica’s accounts.
I contest this. According to my best recollection of the conversation that I think Jessica is referring to, she was arguing that AGI will not arrive in our own lifetimes, and seemed unresponsive to my attempts to argue that a confident claim of long timelines requires positive knowledge, at which point I exasperatedly remarked that for all we knew, the allegedly missing AGI insights had already been not only had, but published in the literature, and all that remains is someone figuring out how to assemble them. (cf no one knows what science doesn’t know.) I do not assign particularly high credence to this claim myself, and (IIRC) I was using it rhetorically to test for acknowledgement of the idea that confident long timelines require positive knowledge that we seem to lack.
(This seems like another central example of my throwaway lines becoming weirdly distorted and heavily highlighted in Jessica’s recounting.)
Here Jessica seems to be implying that, not only did I positively claim that the pieces of AGI were already out there in the literature, but also that I had personally identified them? I deny that, and I’m not sure what claim I made that Jessica misunderstood in that way. Given the surrounding context where Jessica made this claim, my guess is that it was in the same conversation as the exasperated remark described above, and that the conversation past that point became so desynched that Jessica’s recounting is no longer recognizable to me.
To be clear, I have claimed that AI alignment work is sometimes intertwined with AI capabilities work, and I have claimed that capabilities insights shouldn’t be publicized (as a strong default) on account of the negative externalities. Perhaps I said something along those lines that got distorted into Jessica’s claim?
I occasionally recommend that our researchers periodically (every 6mo or so) open a text file and see if they can write pseudocode for an FAI (ignoring computational constraints, at least for now), to help focus their attention on exactly where they’re confused and ground out their alignment research in things that are literally blocking them from actually writing a flippin’ FAI. I don’t recall ever telling Jessica that I thought she could figure out how to build an AGI herself. I do recall telling her I expected she could benefit from the exercise of attempting to write the pseudocode for an FAI.
If memory serves, this is an exercise I’d been advocating for a couple years before the time period that Jessica’s discussing (and IIRC, I’ve seen Jessica advocate it, or more general variants like “what could you do with a hypercomputer on a thumb drive”, as an exercise to potential hires). One guess as to what’s going on is that I tried to advocate the exercise of pseudocoding an FAI as I had many times before, but used some shorthand for it that I thought would be transparent, in some new conversational context (eg, shortly after MIRI switched to non-disclosure by default), and while Jessica was in some new mental state, and Jessica misunderstood me as advocating figuring out how to build an AGI all on her own while insinuating that I thought she could?
In light of this, my guess is that Jessica flatly misread my implications here.
To be very explicit: Jessica, I never believed you capable of creating a workable AGI design (using, say, your 2017 mind, unaugmented, in any reasonable amount of time). I also don’t assign particularly high credence to the claim that the insights are already out in the literature waiting to be found (or that they were in 2017). Furthermore, I never intentionally implied that I have myself succeeded at the “pseudocode an FAI” exercise so hard as to have an AGI design. Sorry for the miscommunication.
This suggests a picture of MIRI’s nondisclosure-by-default policies that’s much more top-down than reality, similar to a correction I made on a post by Evan Hubinger a few years ago.
The sequence of events as I recall them was: Various researchers wanted to do some closed research. There was much discussion about how much information was private: Research results? Yes, if the project lead wants privacy. Research directions? Yes, if the project lead wants privacy. What about the participant list for each project? Can each project determine their own secrecy bounds individually, or is revealing who’s working with you defecting against (possibly-hypothetical) projects that don’t want to disclose who they’re working with? etc. etc. I recall at least one convo with a bunch of researchers where, in efforts to get everyone to stop circling privacy questions like moths to a flame and get back to the object level research, I said something to the effect of “come to me if you’re having trouble”.
I separately recall Jessica coming to me afterwards and asking a bunch more questions about who she can ask about what. I recall trying to convey something like “just work on what you want to work on, with whatever privacy level you want, and if someone working on something closed wants you working with them they’ll let you know (perhaps through me, if they want to), and you can bang out details with them as need be”.
The fact that people shouldn’t have to reveal whether they are in fact working on closed research if they don’t want to sounds like the sort of thing that came up in one or both of those conversations, and my guess is that that’s what Jessica’s referring to here. From my perspective, that wasn’t a particularly central point, and the point I recall attempting to drive home was more like “let’s just work on the object-level research and not get all wound up around privacy (especially because all that we’ve changed are the defaults, and you’re still completely welcome to publicize your own research, with my full support, as much as you’d like)”.
According to me, I was not trying to say “you shouldn’t talk about ways you believe others to be acting in bad faith”. I was trying to say “I think y’all are usually mistaken when you’re accusing certain types of other people of acting in bad faith”, plus “accusing people of acting in bad faith [in confrontational and adversarial ways, instead of gently clarifying and confirming first] runs a risk of being self-fulfilling, and also burns a commons, and I’m annoyed by the burned commons”. I think those people are wrong and having negative externalities, not that they’re bad for reporting what they believe.
Note that the sorts of talking the post explicitly pushes back against is arguments of the form “person X is gaining [status|power|prestige] through their actions, therefore they are untrustworthy and have bad intentions”, which I believe to be invalid. Had I predicted Jessica’s particular misread in advance, I would have explicitly noted that I’m completely ok with arguments of the form “given observations X and Y, I have non-trivial probability on the hypothesis that you’re acting in bad faith, which I know is a serious allegation. Are you acting in bad faith? If not, how do you explain observations X and Y?”.
In other words, the thing I object to is not the flat statement of credence on the hypothesis “thus-and-such is acting in bad faith”, it’s the part where the author socially rallies people to war on flimsy pretenses.
In other other words, I both believe that human wonkiness makes many people particularly bad at calculating P(bad-faith|the-evidence) in particular, and recommend being extra charitable and cautious when it feels like that probability is spiking. Separately but relatedly, in the moment that you move from stating your own credence that someone is acting in bad faith, to socially accusing someone of acting in bad faith, my preferred norms require a high degree of explicit justification.
And, to be explicit: I think that most of the people who are acting in bad faith will either say “yes I’m acting in bad faith” when you ask them, or will sneer at you or laugh at you or make fun of you instead, which is just as good. I think a handful of other people are harmful to have around regardless of their intentions, and my guess is that most community decisions about harmful people should revolve around harm rather than intent.
(Indeed, I suspect the community needs to lower, rather than raise, the costs of shunning someone who’s doing a lot of harm. But I reiterate that I think such decisions should center on the harm, not the intent, and as such I continue to support the norm that combative/adversarial accusations of ill intent require a high degree of justification.)
I don’t actually know what conversation this is referring to. I recall a separate instance, not involving Jessica, of a non-researcher spending lots of time in the office hanging out and talking with one of our researchers, and me pulling the researcher aside and asking whether they reflectively endorsed having those conversations or whether they kept getting dragged into them and then found themselves unable to politely leave. (In that case, the researcher said they reflectively endorsed them, and thereafter I left them alone.)
There might have been a time when Michael Arc (nee Vassar) was spending a lot of time talking to Jessica and one other employee, and I said something about how I don’t have much intellectual respect for Michael? I doubt I said this unsolicited, but I definitely would have said it if anyone asked, and I at least vaguely remember something like that happening once or twice. It’s also possible that towards the end of Jessica’s tenure we were trying to have scheduled meetings to see if we could bridge the communications barrier, and it came up naturally in the conversation? But I’m not sure, as (unlike most of the other claims) I don’t concretely recognize this reference.
I’m confident I did not prevent anyone from talking to anyone. I occasionally pulled people aside and asked them if they felt trapped in a given conversation when someone was loitering in the office having lots of conversations, so that I could rescue them if need be. I occasionally answered honestly, when asked, what I thought about people’s conversation partners. I leave almost all researchers to their own devices (conversational or otherwise) almost all of the time.
In Jessica’s particular case, she was having a lot of difficulty at the workplace, and so I stepped deeper into the management role than I usually do and we spent more time together seeing whether we could iron out our difficulties or whether we would need to part ways. It’s quite plausible that, during one of those conversations, I noted of my own accord that she was spending lots of her office-time deep in conversation with Michael, and that I didn’t personally expect this to help Jessica get back to producing alignment research that passed my research quality bar. But I am confident that, insofar as I did express my concerns, it was limited to an expression of skepticism. I… might have asked Michael to stop frequenting the offices quite so much? But I doubt it, and I have no recollection of such a thing.
I am confident I didn’t ever tell anyone not to talk to someone else, that feels way out-of-line to me. I may well have said things along the lines of “I predict that that conversation will prove fruitless”, which Jessica interpreted as a guess-culture style command? I tend to couch against that interpretation by adding hedges of the form “but I’m not you” or whatever, but perhaps I neglected to, or perhaps it fell on deaf ears?
Or perhaps Jessica’s just saying something along the lines of “I feared that if I kept talking to Michael all day, I’d be fired, and Nate expressing that he didn’t expect those conversations to be productive was tantamount to him saying that if I continued he’d fire me, which was tantamount to him saying that I can’t talk to Michael”? In which case, my prediction is indeed that if she hadn’t left MIRI of her own accord, and her research performance didn’t rebound, at some point I would have fired her on the grounds of poor performance. And in worlds where Jessica kept talking to Michael all the time, I would have guessed that a rebound was somewhat less likely, because I didn’t expect him to provide useful meta-level or object-level insights that lead to downstream alignment progress. But I’m an empiricist, and I would have happily tested my “talking to Michael doesn’t result in Nate-legible research output” hypothesis, after noting my skepticism in advance.
(Also, for the record, “Nate-legible research output” does not mean “research that is useful according to Nate’s own models”. Plenty of MIRI researchers disagree with me and my frames about all sorts of stuff, and I’m happy to have them at MIRI regardless, given that they’ve demonstrated the ability to seriously engage with the problem. I’m looking for something more like a cohesive vision that the researcher themself believes in, not research that necessarily strikes me personally as directly useful.)
I contest this. I endorse talking with leadership at leading AI labs, and have done so in the past, and expect to continue doing so in the future.
It’s true that I don’t expect any of the leading labs to slow down or stop soon enough, and it’s true that I think converging beliefs takes a huge time investment. On the mainline I predict that the required investment won’t in fact be paid in the relevant cases. But, as I told Jessica at the time (IIRC), I expect folks at leading AGI labs to be much more sensitive to solutions to the alignment problem, despite the fact that I don’t think you can talk them into giving up public capabilities research in practice. (This might be what she misunderstood as me saying we’d have better luck “competing”? I don’t recall saying any such thing, but I do recall saying that we’d have better luck solving alignment first and persuading second.)
(And for the record, while I think these big labs are making a mistake, it’s a very easy mistake to make: knowing that you’re in the bad Nash equilibrium doesn’t let you teleport to a better one, and it’s at least understandable that each individual capabilities lab thinks that they’re better than the next guy, or that they can’t get the actual top researchers if they implement privacy protocols right out the gate. It’s an important mistake, but not a weird one that requires positing unusual levels of bad faith.)
In case it’s not clear from the above, I don’t have much sympathy for conflict theory in this, and I definitely don’t think in broadly us-vs-them terms about the AGI landscape. And (as I think I said at the time) I endorse learning how to rapidly converge with people. I recommend figuring out how to more rapidly converge with friends before burning the commons of time-spent-converging-with-busy-people-who-have-limited-attention-for-you, but I still endorse figuring it out. I don’t expect it to work, and I think solving the dang alignment problem on the object-level is probably a better way to convince people to do things differently, but also I will cheer on the sidelines as people try to figure out how to get better and faster at converging their beliefs.
There’s no law saying that, when someone’s making a mistake, there’s some way to explain it to them such that suddenly it’s fixed. I think existing capabilities orgs are making mistakes (at the very least, in publishing capabilities advances (though credit where credit is due, various labs are doing better at keeping their cutting-edge results private, at least until somebody else replicates or nearly-replicates them, than they used to be (though to be clear I think we have a long way to go before I stop saying that I believe I see a big mistake))), and deny the implicit inference from “you can’t quickly convince someone with words that they’re making a mistake” to “you must be using conflict theory”.
That doesn’t seem to me like a good characterization of my views.
My recollection is that, in my conversation about this topic with Jessica, I was trying to convey something more like “Yeah, I’m pretty worried that they’re going to screw lots of things up. And the overt plan to give AGI to everyone is dumb. But also there are a bunch of sane people trying to redirect OpenAI in a saner direction, and I don’t want to immediately sic our entire community on OpenAI and thereby ruin their chances. This whole thing looks real high-variance, and at the very least this is “exciting” in the sense that watching an adventure movie is exciting, even in the parts where the plot is probably about to take a downturn. That said, there’s definitely a sense in which I’m saying things with more positive connotations than I actually feel—like, I do feel some real positive hope here, but I’m writing lopsidedly from those hopes. This is because the blog post is an official MIRI statement about a new AI org on the block, and my sense of politics says that if a new org appears on the block and you think they’re doing some things wrong, then the politic thing to do initially is talk about their good attributes out loud, while trying to help redirect them in private.”
For the record, I think I was not completely crazy to have some hope about OpenAI at the time. As things played out, they wound up pretty friendly to folks from our community, and their new charter is much saner than their original plan. That doesn’t undo the damage of adding a new capabilities shop at that particular moment in that particular way; but there were people trying behind the scenes, that did in real life manage to do something, and so having some advance hope before they played their hands out was a plausible mistake to make, before seeing the actual underwhelming history unfold.
All that said, I do now consider this a mistake, both in terms of my “don’t rock the boat” communications strategy and in terms of how well I thought things might realistically go at OpenAI if things went well there. I have since updated, and appreciate Jessica for being early in pointing out that mistake. I specifically think I was mistaken in making public MIRI blog posts with anything less than full candor.
As I said to Jessica at the time (IIRC), one reason I felt (at the time) that the blog post was fine, is that it was an official MIRI-organization announcement. When speaking as an organization, I was (at the time) significantly more Polite and significantly more Politically Correct and significantly less Dour (and less opinionated and more uncertain, etc).
Furthermore, I (wrongly) expected that my post would not be misleading, because I (wrongly) expected my statements made as MIRI-the-org to be transparently statements made as MIRI-the-org, and for such statements to be transparently Polite and Politically Correct, and thus not very informative one way or another. (In case it wasn’t clear, I now think this was a mistake.)
That said, as I told Jessica at the time (IIRC), you can always just ask me whether I’m speaking as MIRI-the-organization or whether I’m speaking as Nate. Similarly, when I’m speaking as Nate-the-person, you can always just ask me about my honesty protocols.
I have since updated against the idea that I should ever speak as MIRI-the-organization, and towards speaking uniformly with full candor as Nate-the-person. I’m not sure I’ll follow this perfectly (I’d at least slip back into politic-speak if I found myself cornered by a journalist), but again, you can always just ask.
Thanks for reading closely enough to have detailed responses and trying to correct the record according to your memory. Appreciate that you’re explicitly not trying to disincentivize saying negative things about one’s former employee (a family member of mine was worried about my writing this post on the basis that it would “burn bridges”).
A couple general points:
These events happened years ago and no one’s memory is perfect (although our culture has propaganda saying memories are less reliable than they in fact are). E.g. I mis-stated a fact about Maia’s death, that Maia had been on Ziz’s boat, based on filling in the detail from the other details and impressions I had.
I can’t know what someone “really means”, I can know what they say and what the most reasonable apparent interpretations are. I could have asked more clarifying questions at the time, but that felt expensive due to the stressful dynamics the post describes.
In terms of more specific points:
I’m not accusing you of intentional coercion, I think this sort of problem could result as a side effect of e.g. mental processes trying to play along with coalitions while not adequately modeling effects on others. Some of the reasons I’m saying I was coerced are (a) Anna discouraging researchers from talking with Michael, (b) the remote possibility of assassination, (c) the sort of economic coercion that would be expected on priors at most corporations (even if MIRI is different). I think my threat model was pretty wrong at the time which made me more afraid than I actually had to be (due to conservatism); this is in an important sense irrational (and I’ve tried pretty hard to get better at modeling threats realistically since then), although in a way that would be expected to be common in normal college graduates. Given that I was criticizing MIRI’s ideology more than other researchers, my guess is that I was relatively un-coerced by the frame, although it’s in principle possible that I simply disagreed more.
I’m not semantically distinguishing “mentioning” from “talking about”. I don’t recall asking about fates worse than death when you mentioned them and drew a corresponding graph (showing ~0 utility for low levels of alignment, negative utility for high but not excellent levels of alignment, and positive utility for excellent levels of alignment).
Edited to make it clear you weren’t trying to assign high probability to this proposition. What you said seems more reasonable given this, although given you were also talking about AI coming in the next 20 years I hope you can see why I thought this reflected your belief.
Edited to make it clear you didn’t mean this. The reason I drew this as a Gricean implicature is that figuring out how to make an AGI wouldn’t provide evidence that the pieces to make AGI are already out there, unless such an AGI design would work if scaled up / iteratively improved in ways that don’t require advanced theory / etc.
Even if the motive came from other researchers I specifically remember hearing about the policy at a meeting in a top-down fashion. I thought the “don’t ask each other about research” policy was bad enough that I complained about it and it might have been changed. It seems that not everyone remembers this policy (although Eliezer in a recent conversation didn’t disagree about this being the policy at some point), but I must have been interpreting something this way because I remember contesting it.
I hope you can see why I interpreted the post as making a pragmatic argument, not simply an epistemic argument, against saying others are acting in bad faith:
In the context of 2017, I also had a conversation with Anna Salamon where she said our main disagreement was about whether bad faith should be talked about (which implies our main disagreement wasn’t about how common bad faith was).
Edited to say you don’t recall this. I didn’t hear this from you, I head it secondhand perhaps from Michael Vassar, so I don’t at this point have strong reason to think you said this.
I agree that “speed at which you can convince someone” is relevant in a mistake theory. Edited to make this clear.
If I recall correctly you were at the time including some AGI capabilities research as part of alignment research (which makes a significant amount of theoretical sense, given that FAI has to pursue convergent instrumental goals). In this case developing an alignment solution before DeepMind develops AGI would be a form of competition. DeepMind people might be more interested in the alignment solution if it comes along with a capabilities boost (I’m not sure whether this consideration was discussed in the specific conversation I’m referring to, but it might have been considered in another conversation, which doesn’t mean it was in any way planned on).
Ok, this helps me disambiguate your honesty policy. If “employees may say things on the MIRI blog that would be very misleading under the assumption that this blog was not the output of MIRI playing politics and being PC and polite” is consistent with MIRI’s policies, it’s good for that to be generally known. In the case of the OpenAI blog post, the post is polite because it gives a misleadingly positive impression.
Just a note on my own mental state, reading the above:
Given the rather large number of misinterpretations and misrememberings and confusions-of-meaning in this and the previous post, along with Jessica quite badly mischaracterizing what I said twice in a row in a comment thread above, my status on any Jessica-summary (as opposed to directly quoted words) is “that’s probably not what the other person meant, nor what others listening to that person would have interpreted that person to mean.”
By “probably” I literally mean strictly probably, i.e. a greater than 50% chance of misinterpretation, in part because the set of things-Jessica-is-choosing-to-summarize is skewed toward those she found unusually surprising or objectionable.
If I were in Jessica’s shoes, I would by this point be replacing statements like “I had a conversation with Anna Salamon where she said X” with “I had a conversation with Anna Salamon where she said things which I interpreted to mean X” as a matter of general policy, so as not to be misleading-in-expectation to readers.
This is quite a small note, but it’s representative of a lot of things that tripped me up in the OP, and might be relevant to the weird distortion:
> Jessica said she felt coerced into a frame she found uncomfortable
I note that Jessica said she was coerced.
I suspect that Nate-dialect tracks meaningful distinctions between whether one feels coerced, whether one has evidence of coercion, whether one has a model of coercive forces which outputs predictions that closely resemble actual events, whether one expects that a poll of one’s peers would return a majority consensus that [what happened] is well-described by the label [coercion], etc.
By default, I would have assumed that Jessica-dialect tracks such distinctions as well, since such distinctions are fairly common in both the rationalsphere and (even moreso) in places like MIRI.
But it’s possible that Jessica was not, with the phrase “I was coerced,” attempting to convey the strong thing that would be meant in Nate-dialect by those words, and was indeed attempting to convey the thing you (automatically? Reflexively?) seem to have translated it to: “I felt coerced; I had an internal experience matching that of being coerced [which is an assertion we generally have a social agreement to take as indisputable, separate from questions of whether or not those feelings were caused by something more-or-less objectively identifiable as coercion].”
I suspect a lot of what you describe as weird distortion has its roots in tiny distinctions like this made by one party but not by the other/taken for granted by one party but not by the other. That particular example leapt out to me as conspicuous, but I posit many others.