LessWrong dev & admin as of July 5th, 2022.
RobertM
Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search”
EDIT: I believe I’ve found the “plan” that Politico (and other news sources) managed to fail to link to, maybe because it doesn’t seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI.
I’ve seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK’s AISI about granting them access for e.g. predeployment testing of frontier models. Is there any concrete evidence about what commitment was made, if any? The only thing I’ve seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time. If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they’ve actually made, lest they weaken the relevant incentives. (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they’ve made; that is also bad.)
The LessWrong 2022 Review: Review Phase
My thoughts on direct work (and joining LessWrong)
A “weak” AGI may attempt an unlikely-to-succeed takeover
Considerations on Compensation
Dan Luu on Futurist Predictions
I don’t actually see very much of an argument presented for the extremely strong headline claim:
This post aims to show that, over the next decade, it is quite likely that most democratic Western countries will become fascist dictatorships—this is not a tail risk, but the most likely overall outcome.
You draw an analogy between the “by induction”/”line go up” AI risk argument, and the increase in far-right political representation in Western democracies over the last couple decades. But the “by induction”/”line go up” argument for AI risk is not the reason one should be worried; one should be worried for specific causal reasons that we expect unaligned ASI to cause extremely bad outcomes. There is no corresponding causal model presented for why fascist dictatorship is the default future outcome for most Western democracies.
Like, yes, it is a bit silly to see “line go up” and plug one’s fingers in one’s ears. It certainly can happen here. Donald Trump being elected in 2024 seems like the kind of thing that might do it, though I’d probably be happy to bet at 9:1 against. But if that doesn’t happen, I don’t know why you expect some other Republican candidate to do it, given that none of them seem particularly inclined.
Meanwhile I find Duncan vaguely fascinating like he is a very weird bug
I don’t know[1] for sure what purpose this analogy is serving in this comment, and without it the comment would have felt much less like it was trying to hijack me into associating Duncan with something viscerally unpleasant.
- ^
My guess is that it’s meant to convey something like your internal emotional experience, with regards to Duncan, to readers.
- ^
patio11′s “Observations from an EA-adjacent (?) charitable effort”
I think this post neglects one of the most serious risks: that adopting a strategy is correlated decision across agents, that others will correctly see that happening, and that the downside risk is significantly magnified by those dynamics.
Naive 1st-order utilitarianism gives you the wrong answer here. Do not illegally skip on paying your income taxes to donate to charity. Spend your cognitive resources getting a better job, or otherwise legally optimizing for more income. Being a software engineer at many tech companies will enable you to donate six figures per year while maintaining a comfortable lifestyle, without restricting your ability to engage with financial infrastructure, own property, or travel.
At a high level, I’m sort of confused by why you’re choosing to respond to the extremely simplified presentation of Eliezer’s arguments that he presented in this podcast.
I do also have some object-level thoughts.
When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms. E.g., I expect that we can apply current alignment techniques such as reinforcement learning from human feedback (RLHF) to evolved architectures.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe, we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors. (I have a further objection to your argument about dimensionality which I’ll address below.)
However, I think such issues largely fall under “ordinary engineering challenges”, not “we made too many capabilities advances, and now all our alignment techniques are totally useless”. I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
But they don’t need to completely break the previous generations’ alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly. For this to be comforting you need to argue against the disjunctive nature of the “pessimistic” arguments, or else rebut each one individually.
The manifold of mind designs is thus:
Vastly more compact than mind design space itself.
More similar to humans than you’d expect.
Less differentiated by learning process detail (architecture, optimizer, etc), as compared to data content, since learning processes are much simpler than data.
This can all be true, while still leaving the manifold of “likely” mind designs vastly larger than “basically human”. But even if that turned out to not be the case, I don’t think it matters, since the relevant difference (for the point he’s making) is not the architecture but the values embedded in it.
It also assumes that the orthogonality thesis should hold in respect to alignment techniques—that such techniques should be equally capable of aligning models to any possible objective.
This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It’s thus vastly easier to align models to goals where we have many examples of people executing said goals.
The difficulty he’s referring to is not one of implementing a known alignment technique to target a goal with no existing examples of success (generating a molecularly-identical strawberry), but of devising an alignment technique (or several) which will work at all. I think you’re taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF “working” in a meaningful way), and then saying that, assuming those are true, Eliezer’s conclusions don’t follow? Which, I mean, sure, maybe, but… is not an actual argument that attacks the disagreement.
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
As you say later, this doesn’t seem trivial, since our current paradigm for SotA basically doesn’t allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don’t scale nearly as well. (This seems like some evidence against your earlier claim that “When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.”)
As it happens, I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens.
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn’t happen by default.
I may come back with more object-level thoughts later. I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. “will the org that gets there even bother doing the thing correctly” (& others laid out in Ray’s recent post on organizational failure modes). But for now, some positives (not remotely comprehensive):
In general, I think object-level engagement with arguments is good, especially when you can attempt to ground it against reality.
Many of the arguments (i.e. the section on evolution) seem like they point to places where it might be possible to verify the correctness of existing analogical reasoning. Even if it’s not obvious how the conclusion changes, helping figure out whether any specific argument is locally valid is still good.
The claim about transformer modularity is new to me and very interesting if true.
- ^
Though obviously not in practice, since humans will still make mistakes, will fail to anticipate many possible directions of generalization, etc, etc.
What is the optimal frontier for due diligence?
Recent Database Migration—Report Bugs
Why libertarians are advocating for regulation on AI
You Are Being Underpaid
Vaguely feeling like OpenAI might be moving away from GPT-N+1 release model, for some combination of “political/frog-boiling” reasons and “scaling actually hitting a wall” reasons. Seems relevant to note, since in the worlds where they hadn’t been drip-feeding people incremental releases of slight improvements over the original GPT-4 capabilities, and instead just dropped GPT-5 (and it was as much of an improvement over 4 as 4 was over 3, or close), that might have prompted people to do an explicit orientation step. As it is, I expect less of that kind of orientation to happen. (Though maybe I’m speaking too soon and they will drop GPT-5 on us at some point, and it’ll still manage to be a step-function improvement over whatever the latest GPT-4* model is at that point.)
Deconfusing “Capabilities vs. Alignment”
I think the reason you’re confused is because the comment did not original read like that; based on my recollection it was edited to add (most of?) the second paragraph after the fact. It was originally a mostly content-free slur.
So in general I’m noticing a pattern where you make claims about things that happened, but it turns out those things didn’t happen, or there’s no evidence that they happened and no reason one would believe they did a priori, and you’re actually just making an inference and presenting as the state of reality. These seem to universally be inferences which cast other’s motives or actions in a negative light. They seem to be broadly unjustified by the provided evidence and surrounding context, or rely on models of reality (both physical and social) which I think are very likely in conflict with the models held by the people those inferences are about. Sometimes you draw correspondences between the beliefs and/or behaviors of different people or groups, in what seems like an attempt to justify the belief/behavior of the first, or to frame the second as a hypocrite for complaining about the first (though you don’t usually say why these comparisons are relevant). These correspondences turn out to only be superficial similarities while lacking any of the mechanistic similarities that would make them useful for comparisons, or actually conceal the fact that the two parties in question have opposite beliefs on a more relevant axis. For lack of a better framing, it seems like you’re failing to disentangle your social reality from the things that actually happened. I am deeply sympathetic to the issues you experienced, but the way this post (and the previous post) was written makes it extremely difficult to engage with productively, since so many of the relevant claims turn out to not to be claims about things that actually occurred, despite their appearance as such.
I went through the post and picked out some examples (stopping only because it was taking too much of my evening, not because I ran out), with these being the most salient:
> 1. Many others have worked to conceal the circumstances of their deaths
As far as I can tell, this is almost entirely unsubstantiated, with the possible exception of Maia, and in that case it would have been Ziz’s circle doing the concealment, not any of the individuals you express specific concerns about.
> 2. My psychotic break in which I imagined myself creating hell was a natural extension of this line of thought.
The way this is written makes it sound like you think that it ought to have been a (relatively) predictable consequence.
> 3. By the law of excluded middle, the only possible alternative hypothesis is that the problems I experienced at MIRI and CFAR were unique or at least unusually severe, significantly worse than companies like Google for employees’ mental well-being.
In theory, the problems you experienced could have come from sources other than your professional environment. That is a heck of a missing middle.
> 4. This view is rapidly becoming mainstream, validated by research performed by MAPS and at Johns Hopkins, and FDA approval for psychedelic psychotherapy is widely anticipated in the field.
This seems to imply that Michael’s view on the subject corresponds in most relevant ways to the views taken by MAPS/etc. I don’t know what Michael’s views on the subject actually are, but on priors I’m extremely skeptical that the correspondence is sufficient to make this a useful comparison (which, as an appeal to authority, is already on moderately shaky grounds).
> 5. including a report from a friend along the lines of “CFAR can’t legally recommend that you try [a specific psychedelic], but...”
Can you clarify what relationship this friend had with CFAR? This could be concerning if they were a CFAR employee at the time. If they were not a CFAR employee, were they quoting someone who was? If neither, I’m not sure why it’s evidence of CFAR’s views on the subject.
> 7. MIRI leaders were already privately encouraging me to adopt a kind of conflict theory in which many AI organizations were trying to destroy the world
This is not supported by your later descriptions of those interactions.
First, at no point do you describe any encouragement to adopt a conflict-theoretic view. I assume this section is the relevant one: “MIRI leaders including Eliezer Yudkowsky and Nate Soares told me that this was overly naive, that DeepMind would not stop dangerous research even if good reasons for this could be given. Therefore (they said) it was reasonable to develop precursors to AGI in-house to compete with organizations such as DeepMind in terms of developing AGI first. So I was being told to consider people at other AI organizations to be intractably wrong, people who it makes more sense to compete with than to treat as participants in a discourse.” This does not describe encouragement to adopt a conflict-theoretic view. It describes encouragement to adopt some specific beliefs (e.g. about DeepMind’s lack of willingness to integrate information about AI safety into their models and then behave appropriately, and possible ways to mitigate the implied risks), but these are object-level claims, not ontological claims.
Second, this does not describe a stated belief that “AI organizations were trying to destroy the world”. Approximately nobody believes that AI researchers at e.g. DeepMind are actively trying to destroy the world. A more accurate representation of the prevailing belief would be something like “they are doing something that may end up destroying the world, which, from their perspective, would be a totally unintentional and unforeseen consequence of their actions”. This distinction is important, I’m not just nitpicking.
> 8.I was given ridiculous statements and assignments including the claim that MIRI already knew about a working AGI design and that it would not be that hard for me to come up with a working AGI design on short notice just by thinking about it, without being given hints.
I would be pretty surprised if this turned out to be an accurate summary of those interactions. In particular, that:
1) MIRI (Nate) believed, as of 2017, that would it be possible to develop AGI given known techniques & technology in 2017, with effectively no new research or breakthroughs required, just implementation, and
2) You were told that you should be able to come up with such a design yourself on short notice without any help or collaborative effort.
Indeed, the anecdote you later relay about your interaction with Nate does not support either of those claims, though it carries its own confusions (why would he encourage you to think about how to develop a working AGI using existing techniques if it would be dangerous to tell you outright? The fact that there’s an extremely obvious contradiction here makes me think that there was a severe miscommunication on at least one side of this conversation).
> 10. His belief that mental states somewhat in the direction of psychosis, such as those had by family members of schizophrenics, are helpful for some forms of intellectual productivity is also shared by Scott Alexander and many academics.
This seems to be (again) drawing a strong correspondence between Michael’s beliefs and actions taken on the basis of those beliefs, and Scott’s beliefs. Scott’s citation of research “showing greater mental modeling and verbal intelligence in relatives of schizophrenics” does not imply that Scott thinks it is a good idea to attempt to induce sub-clinical schizotypal states in people—in fact I would bet a lot of money that Scott thinks doing so is an extremely bad idea, which is a more relevant basis on which to compare his belief’s with Michael’s.
> 11. Scott asserts that Michael Vassar discourages people from seeking mental health treatment. Some mutual friends tried treating me at home for a week as I was losing sleep and becoming increasingly mentally disorganized before deciding to send me to a psychiatric institution, which was a reasonable decision in retrospect.
Were any of those people Michael Vassar? If not, I’m not sure how it’s intended to rebut Scott’s claim (though in general I agree that Scott’s claim about Michael’s discouragement could stand to be substantiated in some way). If so, retracted, but then why is that not specified here, given how clearly it rebuts one of the arguments?
> 13. This is inappropriately enforcing the norms of a minority ideological community as if they were widely accepted professional standards.
This does not read like a charitable interpretation of Scott’s concerns. To wit, if I was friends with someone who I knew was a materialist atheist rationalist, and then one day they came to me and started talking about auras and demons in a way which made it sound like they believed they existed (the way materialist phenomena exist, rather than as “frames”, “analogies”, or “I don’t have a better word for this purely internal mental phenomenon, and this word, despite all the red flags, has connotations that are sufficiently useful that I’m going to use it as a placeholder anyways”), I would update very sharply on them having had a psychotic break (or similar). The relevant reference class for how worrying it is that someone believes something is not what the general public believes, it’s how sharp a departure it is from that person’s previous beliefs (and in what direction—a new, sudden, literal belief in auras and demons is a very common symptom of a certain cluster of mental illnesses!). Note: I am not making any endorsement about the appropriateness of involuntary commitment on the basis of someone suddenly expressing these beliefs. I’m not well-calibrated on the likely distribution of outcomes from doing so.
Moving on from the summary:
> I notice that I have encountered little discussion, public or private, of the conditions of Maia Pasek’s death. To a naive perspective this lack of interest in a dramatic and mysterious death would seem deeply unnatural and extremely surprising, which makes it strong evidence that people are indeed participating in this cover-up.
I do not agree that this is the naive perspective. People, in general, do not enjoy discussing suicide. I have no specific reason to believe that people in the community enjoy this more than average, or expect to get enough value out of it to outweigh the unpleasantness. Unless there is something specifically surprising about a specific suicide that seems relevant to the community more broadly, my default expectation would be that people largely don’t talk about it (the same way they largely don’t talk about most things not relevant to their interests). As far as I can tell, Maia was not a public figure in a way that would, by itself, be sufficient temptation to override people’s generalized dispreference for gossip on the subject. Until today I had not seen any explanation of the specific chain of events preceding Maia’s suicide; a reasonable prior would have been “mentally ill person commits suicide, possibly related to experimental brain hacking they were attempting to do to themselves (as partially detailed on their own blog)”. This seems like fairly strong supporting evidence. I suppose the relevant community interest here is “consider not doing novel neuropsychological research on yourself, especially if you’re already not in a great place”. I agree with that as a useful default heuristic, but it’s one that seems “too obvious to say out loud”. Where do you think a good place for a PSA is?
In general it’s not clear what kind of cover-up you’re imagining. I have not seen any explicit or implicit discouragement of such discussion, except in the (previously mentioned) banal sense that people don’t like discussing suicide and hardly need additional reasons to avoid it.
> While there is a post about Jay’s death on LessWrong, it contains almost no details about Jay’s mental state leading up to their death, and does not link to Jay’s recent blog post. It seems that people other than Jay are also treating the circumstances of Jay’s death as an infohazard.
Jay, like Maia, does not strike me as a public figure. The post you linked is strongly upvoted (+124 at time of writing). It seems to be written as a tribute, which is not the place where I would link to Jay’s most recent blog post if I were trying to analyze causal factors upstream of Jay’s suicide. Again, my prior is that nothing about the lack of discussion needs an explanation in the form of an active conspiracy to suppress such discussion.
> There is a very disturbing possibility (with some evidence for it) here, that people may be picked off one by one (sometimes in ways they cooperate with, e.g. through suicide), with most everyone being too scared to investigate the circumstances.
Please be specific, what evidence? This is an extremely serious claim. For what it’s worth, I don’t agree that people can be picked off in “ways they cooperate with, e.g. though suicide”, unless you mean that Person A would make a concerted effort to convince Person B that Person B ought to commit suicide. But, to the extent that you believe this, the two examples of suicides you listed seem to be causally downstream of “self-hacking with friends(?) in similarly precarious mental states”, not “someone was engaged in a cover-up (of what?) and decided it would be a good idea to try to get these people to commit suicide (why?) despite the incredible risks and unclear benefits”.
> These considerations were widely regarded within MIRI as an important part of AI strategy. I was explicitly expected to think about AI strategy as part of my job. So it isn’t a stretch to say that thinking about extreme AI torture scenarios was part of my job.
S-risk concerns being potentially relevant factors in research does not imply that thinking about specific details of AI torture scenarios would be part of your job. Did someone at MIRI make a claim that imagining S-risk outcomes in graphic detail was necessary or helpful to doing research that factored in S-risks as part of the landscape? Roll to disbelieve.
> Another part of my job was to imagine myself in the role of someone who is going to be creating the AI that could make everything literally the worst it could possibly be, in order to avoid doing that, and prevent others from doing so.
Similarly, roll to disbelieve that someone else at MIRI suggested that you imagine this without significant missing context which would substantially alter the naïve interpretation of that claim.
Skipping the anecdote with Nate & AGI, as I addressed it above, but following that:
> Nate and others who claimed or implied that they had such information did not use it to win bets or make persuasive arguments against people who disagreed with them, but instead used the shared impression or vibe of superior knowledge to invalidate people who disagreed with them.
Putting aside the question of whether Nate or others actually claimed or implied that they had a working model of how to create an AGI with 2017-technology, if they had made such a claim, I am not sure why you would expect them to try to use that model to win bets or make persuasive arguments. I would in fact expect them to never say anything about it to the outside world, because why on earth would you do that given, uh, the entire enterprise of MIRI?
> But I was systematically discouraged from talking with people who doubted that MIRI was for real or publicly revealing evidence that MIRI was not for real, which made it harder for me to seriously entertain that hypothesis.
Can you concretize this? When I read this sentence, an example interaction I can imagine fitting this description would be someone at MIRI advising you in a conversation to avoid talking to Person A, because Person A doubted that MIRI was for real (rather than for other reasons, like “people who spend a lot of time interacting with Person A seem to have a curious habit of undergoing psychotic breaks”).
> In retrospect, I was correct that Nate Soares did not know of a workable AGI design.
I am not sure how either of the excerpts support the claim that “Nate Soares did not know of a workable AGI design” (which, to be clear, I agree with, but for totally unrelated reasons described earlier). Neither of them make any explicit or implicit claims about knowledge (or lack thereof) of AGI design.
> In a recent post, Eliezer Yudkowsky explicitly says that voicing “AGI timelines” is “not great for one’s mental health”, a new additional consideration for suppressing information about timelines.
This is not what Eliezer says. Quoting directly: “What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one’s mental health, and I worry that other people seem to have weaker immune systems than even my own.”
This is making a claim that AI timelines themselves are poor for one’s mental health (presumably, the consideration of AI timelines), not the voicing of them.
> Researchers were told not to talk to each other about research, on the basis that some people were working on secret projects and would have to say so if they were asked what they were working on. Instead, we were to talk to Nate Soares, who would connect people who were working on similar projects. I mentioned this to a friend later who considered it a standard cult abuse tactic, of making sure one’s victims don’t talk to each other.
The reason cults attempt to limit communication between their victims is to prevent the formation of common knowledge of specific abusive behaviors that the cult is engaging in, and similar information-theoretic concerns. Taking for granted the description of MIRI’s policy (and application of it) on internal communication about research, this is not a valid correspondence; they were not asking you to avoid discussing your interactions with MIRI (or individuals within MIRI) with other MIRI employees, which could indeed be worrying if it were a sufficiently general ask (rather than e.g. MIRI asking someone in HR not to discuss confidential details of various employees with other employees, which would technically fit the description above but is obviously not what we’re talking about).
> It should be noted that, as I was nominally Nate’s employee, it is consistent with standard business practices for him to prevent me from talking with people who might distract me from my work; this goes to show the continuity between “cults” and “normal corporations”.
I can’t parse this in a way which makes it seem remotely like “standard business practices”. I disagree that it is a standard business practice to actively discourage employees from talking to people who might distract them from their work, largely because employees do not generally have a problem with being distracted from their work because they are talking to specific people. I have worked at number of different companies, each very different from the last in terms of size, domain, organizational culture, etc, and there was not a single occasion where I felt the slightest hint that anyone above me on the org ladder thought I ought to not talk to certain people to avoid distractions, nor did I ever feel like that was part of the organization’s expectations of me.
> MIRI researchers were being very generally denied information (e.g. told not to talk to each other) in a way that makes more sense under a “bad motives” hypothesis than a “good motives” hypothesis. Alternative explanations offered were not persuasive.
Really? This seems to totally disregard explanations unrelated to whether someone has “good motives” or “bad motives”, which are not reasons that I would expect MIRI to have at the top of their list justifying whatever their info-sec policy was.
> By contrast, Michael Vassar thinks that it is common in institutions for people to play zero-sum games in a fractal manner, which makes it unlikely that they could coordinate well enough to cause such large harms.
This is a bit of a sidenote but I’m not sure why this claim is interesting w.r.t. AI alignment, since the problem space, almost by definition, does not require coordination with intent to cause large harms, in order to in fact cause large harms. If the claim is that “institutions are sufficiently dysfunctional that they’ll never be able to build an AGI at all”, that seems like a fully-general argument against institutions ever achieving any goals that require any sort of internal and/or external coordination (trivially invalidated by looking out the window).
> made a medication suggestion (for my sleep issues) that turned out to intensify the psychosis in a way that he might have been able to predict had he thought more carefully
I want to note that while this does not provide much evidence in the way of the claim that Michael Vassar actively seeks to induce psychotic states in people, it is in fact a claim that Michael Vassar was directly, causally upstream of your psychosis worsening, which is worth considering in light of what this entire post seems to be arguing against.