I work on applied mathematics and AI at the Johns Hopkins University Applied Physics Laboratory (APL). I am also currently pursuing a PhD in Information Systems at the University of Maryland, Baltimore County (UMBC). My PhD research focuses on decision and risk analysis under extreme uncertainty, with a particular focus on potential existential risks from very advanced AI.
Aryeh Englander
I see that many people are commenting how it’s crazy to try to keep things secret between coworkers, or to not allow people to even mention certain projects, or that this kind of secrecy is psychologically damaging, or the like.
Now, I imagine this is heavily dependent on exactly how it’s implemented, and I have no idea how it’s implemented at MIRI. But just as a relevant data point—this kind of secrecy is totally par for the course for anybody who works for certain government and especially military-related organizations or contractors. You need extensive background checks to get a security clearance, and even then you can’t mention anything classified to someone else unless they have a valid need to know, you’re in a secure classified area that meets a lot of very detailed guidelines, etc. Even within small groups, there are certain projects that you simply are not allowed to discuss with other group members, since they do not necessarily have a valid need to know. If you’re not sure whether something is classified, you should be talking to someone higher up who does know. There are projects that you cannot even admit that they exist, and there are even words that you cannot mention in connection to each other even though each word on its own is totally normal and unclassified. In some places like the CIA or the NSA, you’re usually not even supposed to admit that you work there.
Again, this is probably all very dependent on exactly how the security guidelines are implemented. I am also not commenting at all on whether or not the information that MIRI tries to keep secret should in fact be kept secret. I am just pointing out that if some organization thinks that certain pieces of information really do need to be kept secret, and if they implement secrecy guidelines in the proper way, then as far as I could tell everything that’s been described as MIRI policies seems pretty reasonable to me.
- 19 Oct 2021 20:41 UTC; 358 points) 's comment on My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage) by (
My general impression based on numerous interactions is that many EA orgs are specifically looking to hire and work with other EAs, many longtermist orgs are looking to specifically work with longtermists, and many AI safety orgs are specifically looking to hire people who are passionate about existential risks from AI. I get this to a certain extent, but I strongly suspect that ultimately this may be very counterproductive if we are really truly playing to win.
And it’s not just in terms of who gets hired. Maybe I’m wrong about this, but my impression is that many EA funding orgs are primarily looking to fund other EA orgs. I suspect that a new and inexperienced EA org may have an easier time getting funded to work on a given project than if a highly experienced non-EA org would apply for funding to pursue the same idea. (Again, entirely possible I’m wrong about that, and apologies to EA funding orgs if I am mis-characterizing how things work. On the other hand, if I am wrong about this then that is an indication that EA orgs might need to do a better job communicating how their funding decisions are made, because I am virtually positive that this is the impression that many other people have gotten as well.)
One reason why this selectivity kind of makes sense at least for some areas like AI safety is because of infohazard concerns, where if we get people who are not focused on the long-term to be involved then they might use our money to do capability enhancement research instead of pursuing longtermist goals. Again, I get this to a certain extent, but I think that if we are really playing to win then we can probably use our collective ingenuity to find ways around this.
Right now this focus on only looking for other EAs appears (to me, at least) to be causing an enormous bottleneck for achieving the goals we are ultimately aiming for.
Yes! I actually just discussed this with one of my advisors (an expert on machine learning), and he told me that if he could get funding to do it he would definitely be interested in dedicating a good chunk of his time to researching AGI safety. (For any funders who might read this and might be interested in providing that funding, please reach out to me by email Aryeh.Englander@jhuapl.edu. I’m going to try to reach out to some potential funders next week.)
I think that there are a lot of researchers who are sympathetic to AI risk concerns, but they either lack the funding to work on it or they don’t know how they might apply their area of expertise to do so. The former can definitely be fixed if there’s an interest from funding organizations. The latter can be fixed in many cases by reaching out and talking to the researcher.
I shared this essay with a colleague where I work (Johns Hopkins University Applied Physics Lab). Here are her comments, which she asked me to share:
This essay proposes a very interesting definition of optimization as the manifestation of a particular behavior of a closed, physical system. I haven’t finished thinking this over, but I suspect it will be (as is suggested in the essay) a useful construct. The reasoning leading to the definition is clearly laid out (thank you!), with examples that are very useful in understanding the concept. The downside of being clearly laid out, however, is that it makes critique easier. I have a few thoughts about the reasoning in the essay.
The first thing I will note is that the essay gives three definitions for an optimizing system. These definitions are close, but not exactly equivalent. The nuances can be important. For example, that the target configuration set and the basin of attraction cannot be equal is obvious; that is made explicit in definition 3, but only implied in definitions 1 and 2. A bigger issue is that there are no criteria or rationale for their extent and relative size.
For example, the essay offers two reasons why the posterchild of non-optimizers—the bottle with a cap—is not an optimizing system; they both arise from the rather arbitrary definition of the basin of attraction as equal to the target configuration set. I see no necessary reason why the basin of attraction couldn’t be defined as the set of all configurations of water molecules both inside and outside the bottle. That way, the definitional requirement of a target configuration set smaller than the basis of attraction is met. The important point is: will water molecules in this new, larger basin of attraction tend to the target configuration set?
Let’s suppose that capped bottle is in a sealed room (not necessary but easier to think about), and that the cap is made of a special material that allows water molecules to pass through it in only one direction: from outside the bottle to inside. The water molecules inside the bottle stay inside the bottle, as for any cap. The water molecules inside the room, but outside the bottle, are zooming about (thermodynamic energy), bouncing off the walls, each other, and the bottle. Although it will take some time, sooner or later all the molecules outside the bottle will hit the bottle cap, go through, and be trapped in the bottle. Voila!
Originally, the bottle-with-a-cap system was a non-optimizing system by definition; the bottle cap type was irrelevant and could have been the rather special one I described. Simply by changing the definition of the basin of attraction, we could turn it into an optimizing system. Further, the original, “non-optimizing” system (with the original definitions of the basin of attraction and target set) would have behaved exactly the same as my optimizing system. On the other hand, changing the bottle cap from our special one to a regular cap will change the system into a non-optimizing system, regardless of the definitions of the basin of attraction and the target configuration set. Perhaps, we should insist that a properly formed system description has a basin of attraction that is larger than the target set, and count on the system behavior to make the optimizing/non-optimizing distinction.
Definitions 1 and 2 both contain the phrase “a small set of target configurations” which implies that the target set << than the basin of attraction. This is a problem for the notion of the universe as a system with maximum entropy as the target configuration set because the target set is most of the possible configurations. For this reason, the essay’s author concludes that universe-with-entropy system is not an optimizing system, or at best, a weak one. Stars, galaxies, black holes – there are strong forces that pull matter into these structures. I would say that any system that has succeeded in getting nearly everything within the basin of attraction into the target configuration is a strong optimizer!
Regardless of the way we chose to think about strong or weak, the universe is a system that tends to a set of configurations smaller than the set of possible configurations despite perturbations (the occasional house-building project for example!). Personally, I see no value in a definitional limitation. The behavior of the system (tending toward a smaller set of configurations out of a larger set) should govern the definition of an optimizing system, regardless of relative sizes of the sets.
Between the universe-with-entropy and bottle-with-a-cap systems, I question the utility of the “all configurations >= basin of attraction >> target set configuration” structure in the definition of optimizing systems. I believe it is worth thinking about what the necessary relationships among these configurations are, and how they are chosen.
The example of the billiards system raised another (to me) interesting question. The essay did not offer a system description but says “Consider a billiard table with some billiard balls that are currently bouncing around in motion. Left alone, the balls will eventually come to rest in some configuration…. If we reach in while the billiard balls are bouncing around and move one of the balls that is in motion, the system will now come to rest in a different configuration. Therefore this is not an optimizing system, because there is no set of target configurations towards which the system evolves despite perturbations.”
This example has some odd features. Friction between the balls and the table surface, along with the loss of energy during non-elastic collisions, cause the balls to slow down and stop. The minutia of their travels determines where they stop. The final arrangement is unpredictable (ok, it could be modeled given complete information, but let’s skip that as beside the point), and any arrangement is as likely as another. This suggests that the billiards system is a non-optimizing system even without the proposed perturbation of moving the balls around while the balls are in motion.
Looked at another way, billiards system does tend to a certain target configuration set, while friction and the non-elasticity of the collisions are perturbations. If we make the surface frictionless and the collisions perfectly elastic, the balls will bounce around the table without stopping. Much like the water molecules in the bottle-with-a-cap example, each will eventually fall into one pocket or another during its travels. Once in the pocket, the ball cannot get out, and thus eventually all will end up in the pockets. So, this system tends to a target configuration set of all balls in pockets.
Adding back in the perturbing friction and energy loss does not mean that this system is not tending to the target configuration set. Reaching in and moving a ball to a different point, or even redirecting any ball heading for a pocket, will not keep this system from tending towards the target configuration. It seems as though the billiards system was an optimizing system all along! The larger point is that it seems, by definition, an optimizing system is an optimizing system even if there are a set of perturbations that prevent it from ever reaching the target configuration! “Tending toward”, not “reaching”, a target configuration set is in all three definitions. It is worth thinking about an optimizing system that never actually optimizes. This may have some bearing on the AGI question.
[And for you readers who, like me, would say, whoa—it is possible that the balls will enter some repeating pattern of motion where some do not enter pockets. Maybe we need a robot to move the balls around randomly if they seem stuck, just like the ball-in-valley+robot system where the robot moves the ball over barriers. I maintain that the point is the same.]
The satellite system illustrates (perhaps an obvious point) that the definition of the target configuration set can change a single system from optimizing and to non-optimizing. What is a little more subtle is that the definition of the system boundaries is essential to the characterization of the system as optimizing or non-optimizing, even if the behavior of the system is the same under both definitions. In particular, what we consider to be part of the system and what is considered to be a perturbation can flip a system between characterizations. [This latter point is illustrated by the billiards system as well, as I will explain below.]
The essay says that a satellite in orbit is a non-optimizing system because if its position or velocity is perturbed, it has no tendency to return to its original orbit; that is, the author defines the target configuration as a particular orbit. With respect to another target configuration that may be described as “a scorched pile of junk on the surface of the Earth”, a satellite in orbit is an optimizing system exactly like a ball in a valley. As soon as the launch rocket stops firing, a satellite starts falling to the center of the earth because atmospheric drag and solar radiation pressure continuously decrease the component of the satellite’s velocity perpendicular to the force of gravity. So, unless a perturbation is big enough to send it out of orbit altogether, a satellite tends towards a target configuration of junk located on Earth’s surface.
Since a particular orbit is usually the desired target configuration (!), many satellites incorporate a rocket system to force them to stay in a chosen orbit. If a rocket system is included in the system definition, then the satellite is an optimizing system relative to the desired orbit. What is a little more interesting, with respect to the junk-on-the-Earth target set, drag and solar pressure are the part of the optimizing system; an orbit correction system is a perturbation. If the target set is the particular orbit the satellite started in, these definitions swap.
This observation has bearing on the billiards system example. If we include drag and non-elastic collisions as part of the billiards system, then the system is non-optimizing. If we see them as perturbations outside the system, then the billiards system is optimizing. I find this flexibility as a little curious, although I haven’t completely thought through the implications.
A completely different sort of question is suggested by the section on Drexler. There the essay sets out a hierarchy of all AI systems, optimizing systems, and goal-directed agent systems. This makes sense with respect to AI systems, but I do not see how optimizing systems, as defined, can be wholly contained within the category of AI systems, unless you define AI systems pretty broadly. For example, I think that pretty much any control system is an optimizing system by the definition in the essay. If we accept this definition of optimizing system, and hold that all optimizing systems are a subset of AI systems, do we have to accept our thermostats as AI systems? What about the program that determined the square root of 2? Is that AI? Is this an issue for this definition, or does its broadness matter in an AI context?
And a nitpick: The first example of an optimizing system offered in the essay is a program calculating the square root of 2. It meets the definition of an optimizing system, but it seems to contradict the earlier assertion that “… optimizing systems are not something that are designed but are discovered.” The algorithm and the program were both designed. I’m not sure why this point is necessary. Either I do not understand something fundamental, or the only purpose of the statement of discovery is to give people like me something to argue about!
In summary, the definition in the essay suggests a few questions that could have a bearing on its application:
How do we choose the basis of attraction relative to the target configuration set, if our choice can change the status of the system from optimizing to non-optimizing and vice versa?
Is it an issue that an optimizing system may never actually optimize?
How do we choose what is part of the system versus a perturbation outside the system when our choice changes the status of the system as optimizing or non-optimizing?
All control systems are optimizing systems by the definition, but are all control systems AI systems? Does it matter? If it does matter, how do we tell the difference?
For any of these, how do they affect our thinking for AI?
Finally, it might be better to have one, consistent definition that covers all the possibilities, including (in my opinion) that perturbations may be confined to certain dimensions.
The more I think about this post, the more I think it captures my frustrations with a large percentage of the public discourse on AI x-risks, and not just this one debate event.
Thank you for articulating this. This matches closely with my own thoughts re Eliezer’s recently published discussion. I strongly agree that if Eliezer is in fact correct then the single most effective thing we could do is to persuasively show that to be true. Right now it’s not even persuasive to many / most alignment researchers, let alone anybody else.
Conditional on Eliezer being wrong though, I’m not sure how valuable showing him to be wrong would be. Presumably it would depend on why exactly he’s wrong, because if we knew that then we might be able to direct or resources more effectively.
I think that for those who agree with Eliezer, this is a very strong argument in favor of pouring money and resources into forecasting research or the like—as Open Philanthropy is in fact doing, I think. And even for people who disagree, if they put any non trivial probability mass on Eliezer’s views, that would still make this high priority.
Just putting in my vote for doing both broader and deeper explorations of these topics!
I have now posted a “Half-baked AI safety ideas thread” (LW version, EA Forum version) - let me know if that’s more or less what you had in mind.
I think part of what I was reacting to is a kind of half-formed argument that goes something like:
My prior credence is very low that all these really smart, carefully thought-through people are making the kinds of stupid or biased mistakes they are being accused of.
In fact, my prior for the above is sufficiently low that I suspect it’s more likely that the author is the one making the mistake(s) here, at least in the sense of straw-manning his opponents.
But if that’s the case then I shouldn’t trust the other things he says as much, because it looks like he’s making reasoning mistakes himself or else he’s biased.
Therefore I shouldn’t take his arguments so seriously.
Again, this isn’t actually an argument I would make. It’s just me trying to articulate my initial negative reactions to the post.
From a Facebook discussion with Scott Aaronson yesterday:
Yann: I think neither Yoshua nor Geoff believe that AI is going kill us all with any significant probability.
Scott: Well, Yoshua signed the pause letter, and wrote an accompanying statement about what he sees as the risk to civilization (I agree that there are many civilizational risks short of extinction). In his words: “No one, not even the leading AI experts, including those who developed these giant AI models, can be absolutely certain that such powerful tools now or in the future cannot be used in ways that would be catastrophic to society.”
Geoff said in a widely-shared recent video that it’s “not inconceivable” that AI will wipe out humanity, and didn’t offer any reassurances about it being vanishingly unlikely.
https://twitter.com/JMannhart/status/1641764742137016320
Yann: Scott Aaronson he is worried about catastrophic disruptions of the political, economic, and environmental systems. I don’t want to speak for him, but I doubt he worries about a Yuddite-style uncontrollable “hard takeoff”
[Note that two-axis voting is now enabled for this post. Thanks to the mods for allowing that!]
Darn, there goes my ability to use Iarwain as a really unusual pseudonym. I’ve used it off and on for almost 20 years, ever since my brother made me a new email address right after having read the LOTR appendixes.
[Cross-commenting from the EA Forum.]
[Disclaimers: My wife Deena works with Kat as a business coach. I briefly met Kat and Emerson while visiting in Puerto Rico and had positive interactions with them. My personality is such that I have a very strong inclination to try to see the good in others, which I am aware can bias my views.]
A few random thoughts related to this post:
1. I appreciate the concerns over potential for personal retaliation, and the other factors mentioned by @Habryka and others for why it might be good to not delay this kind of post. I think those concerns and factors are serious and should definitely not be ignored. That said, I want to point out that there’s a different type of retaliation in the other direction that posting this kind of thing without waiting for a response can cause: Reputational damage. As others have pointed out, many people seem to update more strongly on negative reports that come first and less on subsequent follow up rebuttals. If it turned out that the accusations are demonstrably false in critically important ways, then even if that comes to light later the reputational damage to Kat, Emerson, and Drew may now be irrevocable.Reputation is important almost everywhere, but in my anecdotal experience reputation seems to be even more important in EA than in many other spheres. Many people in EA seem to have a very strong in-group bias towards favoring other “EAs” and it has long seemed to me that (for example) getting a grant from an EA organization often feels to be even more about having strong EA personal connections than for other places. (This is not to say that personal connections aren’t important for securing other types of grants or deals or the like, and it’s definitely not to say that getting an EA grant is only or even mostly about having strong EA connections. But from my own personal experience and from talking to quite a few others both in and out of EA, this is definitely how it feels to me. Note that I have received multiple EA grants in the past, and I have helped other people apply to and receive substantial EA grants.) I really don’t like this sort of dynamic and I’ve low-key complained about it for a long time—it feels unprofessional and raises all sorts of in-group bias flags. And I think a lot of EA orgs feel like they’ve gotten somewhat better about this over time. But I think it is still a factor.
Additionally, it sometimes feels to me that EA Forum dynamics tend to lead to very strongly upvoting posts and comments that are critical of people or organizations, especially if they’re more “centrally connected” in EA, while ignoring or even downvoting posts and comments in the other direction. I am not sure why the dynamic feels like this, and maybe I’m wrong about it really being a thing at all. Regardless, I strongly suspect that any subsequent rebuttal by Nonlinear would receive significantly fewer views and upvotes, even if the rebuttal were actually very strong.
Because of all this, I think that the potential for reputational harm towards Kat, Emerson, and Drew may be even greater than if this were in the business world or some other community. Even if they somehow provide unambiguous evidence that refutes almost everything in this post, I would not be terribly surprised if their potential to get EA funding going forward or to collaborate with EA orgs was permanently ended. In other words, I wouldn’t be terribly surprised if this post spelled the end of their “EA careers” even if the central claims all turned out to be false. My best guess is that this is not the most likely scenario, and that if they provide sufficiently good evidence then they’ll be most likely “restored” in the EA community for the most part, but I think there’s a significant chance (say 1%-10%) that this is basically the end of their EA careers regardless of the actual truth of the matter.
Does any of this outweigh the factors mentioned by @Habryka? I don’t know. But I just wanted to point out a possible factor in the other direction that we may want to consider, particularly if we want to set norms for how to deal with other such situations going forward.
2. I don’t have any experience with libel law or anything of the sort, but my impression is that suing for slander over this kind of piece is very much within the range of normal responses in the business world, even if in the EA world it is basically unheard of. So if your frame of reference is the world outside of EA then suing seems at least like a reasonable response, while if your frame of reference is the EA community then maybe it doesn’t. I’ll let others weigh in on whether my impressions on this are correct, but I didn’t notice others bring this up so I figured I’d mention it.
3. My general perspective on these kinds of things is that… well, people are complicated. We humans often seem to have this tendency to want our heroes to be perfect and our villains to be horrible. If we like someone we want to think they could never do anything really bad, and unless presented with extremely strong evidence to the contrary we’ll look for excuses for their behavior so that it matches our pictures of them as “good people”. And if we decide that they did do something bad, then we label them as “bad people” and retroactively reject everything about them. And if that’s hard to do we suffer from cognitive dissonance. (Cf. halo effect.)
But the reality, at least in my opinion, is that things are more complicated. It’s not just that there are shades of grey, it’s that people can simultaneously be really good people in some ways and really bad people in other ways. Unfortunately, it’s not at all a contradiction for someone to be a genuinely kind, caring, supportive, and absolutely wonderful person towards most of the people in their life, while simultaneously being a sexual predator or committing terrible crimes.
I’m not saying that any of the people mentioned in this post necessarily did anything wrong at all. My point here is mostly just to point out something that may be obvious to almost all of us, but which feels potentially relevant and probably bears repeating in any case. Personally I suspect that everybody involved was acting in what they perceived to be good faith and are / were genuinely trying to do the right thing, just that they’re looking at the situation through lenses based on very different perspectives and experiences and so coming to very different conclusions. (But see my disclaimer at the beginning of this comment about my personality bias coloring my own perspective.)
I forgot about downvotes. I’m going to add this in to the guidelines.
Background material recommendations (popular-level audience, several hours time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for a popular level audience. Time commitment is allowed to be up to several hours, so for example a popular-level book or sequence of posts would work. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.
Also—particular papers that you think are important, especially if you think they might be harder to find in a quick literature search. I’m part of an AI Ethics team at work, and I would like to find out about these as well.
Please describe or provide links to descriptions of concrete AGI takeover scenarios that are at least semi-plausible, and especially takeover scenarios that result in human extermination and/or eternal suffering (s-risk). Yes, I know that the arguments don’t necessarily require that we can describe particular takeover scenarios, but I still find it extremely useful to have concrete scenarios available, both for thinking purposes and for explaining things to others.
I don’t think this is quite an example of a treacherous turn, but this still looks relevant:
Lewis et al., Deal or no deal? end-to-end learning for negotiation dialogues (2017):
Analysing the performance of our agents, we find evidence of sophisticated negotiation strategies. For example, we find instances of the model feigning interest in a valueless issue, so that it can later ‘compromise’ by conceding it. Deceit is a complex skill that requires hypothesising the other agent’s beliefs, and is learnt relatively late in child development (Talwar and Lee, 2002). Our agents have learnt to deceive without any explicit human design, simply by trying to achieve their goals.
(I found this reference cited in Kenton et al., Alignment of Language Agents (2021).)
My impression—which I kind of hope is wrong—has been that it is much easier to get an EA grant the more you are an “EA insider” or have EA insider connections. The only EA connection that my professor has is me. On the other hand, I understand the reluctance to some degree in the case of AI safety because funders are concerned that researchers will take the money and go do capabilities research instead.
Meta-comment:
I noticed that I found it very difficult to read through this post, even though I felt the content was important, because of the (deliberately) condescending style. I also noticed that I’m finding it difficult to take the ideas as seriously as I think I should, again due to the style. I did manage to read through it in the end, because I do think it’s important, and I think I am mostly able to avoid letting the style influence my judgments. But I find it fascinating to watch my own reaction to the post, and I’m wondering if others have any (constructive) insights on this.
In general I I’ve noticed that I have a very hard time reading things that are written in a polemical, condescending, insulting, or ridiculing manner. This is particularly true of course if the target is a group / person / idea that I happen to like. But even if it’s written by someone on “my side” I find I have a hard time getting myself to read it. There have been several times when I’ve been told I should really go read a certain book, blog, article, etc., and that it has important content I should know about, but I couldn’t get myself to read the whole thing due to the polemical or insulting way in which it was written.
Similarly, as I noted above, I’ve noticed that I often have a hard time taking ideas as seriously as I probably should if they’re written in a polemical / condescending / insulting / ridiculing style. I think maybe I tend to down-weight the credibility of anybody who writes like that, and by extension maybe I subconsciously down-weight the content? Maybe I’m subconsciously associating condescension (at least towards ideas / people I think of as worth taking seriously) with bias? Not sure.
I’ve heard from other people that they especially like polemical / condescending articles, and I imagine that it is effective / persuasive for a lot of readers. For all I know this is far and away the most effective way of writing this kind of thing. And even if not, Eliezer is perfectly within his rights to use whatever style he wants. Eliezer explicitly acknowledges the condescending-sounding tone of the article, but felt it was worth writing it that way anyway, and that’s fine.
So to be clear: This is not at all a criticism of the way this post was written. I am simply curious about my own reaction to it, and I’m interested to hear what others think about that.
A few questions:
Am I unusual in this? Do other people here find it difficult to read polemical or condescending writing, and/or do you find that the style makes it difficult for you to take the content as seriously as you perhaps should?
Are there any studies you’re aware of on how people react to polemical writing?
Are there some situations in which it actually does make sense to use the kind of intuitive heuristic I was using—i.e., if it’s written in a polemical / insulting style then it’s probably less credible? Or is this just a generally bad heuristic that I should try to get rid of entirely?
This is a topic I’m very interested in so I’d appreciate any other related comments or thoughts you might have.