Steven Byrnes

Karma: 17,268

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Steven Byrnes 7 Apr 2022 15:11 UTC
120 points
0
on: What an actually pessimistic containment strategy looks like
a strategy like “get existing top AGI researchers to stop”
There’s a (hopefully obvious) failure mode where the AGI doomer walks up to the AI capabilities researcher and says “Screw you for hastening the apocalypse. You should join me in opposing knowledge and progress.” Then the AI capabilities researcher responds “No, screw you, and leave me alone”. Not only is this useless, but it’s strongly counterproductive: that researcher will now be far more inclined to ignore and reject future outreach efforts (“Oh, pfft, I’ve already heard the argument for that, it’s stupid”), even if those future outreach efforts are better.
So the first step to good outreach is not treating AI capabilities researchers as the enemy. We need to view them as our future allies, and gently win them over to our side by the force of good arguments that meets them where they’re at, in a spirit of pedagogy and truth-seeking.
(You can maybe be more direct with someone that they’re doing counterproductive capabilities research when they’re already sold on AGI doom. That’s probably why your conversation at EleutherAI discord went OK.)
(In addition to “it would be directly super-counterproductive”, a second-order reason not to try to sabotage AI capabilities research is that “the kind of people who are attracted to movements that involve sabotaging enemies” has essentially no overlap with “the kind of people who we want to be part of our movement to avoid AGI doom”, in my opinion.)
So I endorse “get existing top AGI researchers to stop” as a good thing in the sense that if I had a magic wand I might wish for it (at least until we make more progress on AGI safety). But that’s very different from thinking that people should go out and directly try to do that.
Instead, I think the best approach to “get existing top AGI researchers to stop” is producing good pedagogy, and engaging in gentle, good-faith arguments (as opposed to gotchas) when the subject comes up, and continuing to do the research that may lead to more crisp and rigorous arguments for why AGI doom is likely (if indeed it’s likely) (and note that there are reasonable people who have heard and parsed and engaged with all the arguments about AGI doom but still think the probability of doom is <10%).
I do a lot of that kind of activity myself (1,2,3,4,5, etc.).

Steven Byrnes 11 Apr 2023 20:41 UTC
LW: 85 AF: 31
48
AF
on: Evolution provides no evidence for the sharp left turn
AlphaZero had autonomous learning—the longer you train, the better the model weights. Humans (and collaborative groups of humans) also have that—hence scientific progress. Like, you can lock a group of mathematicians in a building for a month with some paper and pens, and they will come out with more and better permanent knowledge of mathematics than when they entered. They didn’t need any new training data; we just “ran them” for longer, and they improved, discovering new things arbitrarily far beyond the training data, with no end in sight.
Today’s SOTA LLMs basically don’t have an autonomous learning capability analogous to the above. Sure, people do all sorts of cool tricks with the context window, but people don’t know how to iteratively make the weights better and better without limit, in a way that’s analogous to to AlphaZero doing self-play or a human mathematicians doing math. Like, you can run more epochs on the same training data, but it rapidly plateaus. You can do the Huang et al. thing in an infinite loop, but I think it would rapidly go off the rails.
I don’t want to publicly speculate on what it would take for autonomous learning to take off in LLMs—maybe it’s “just more scale” + the Huang et al. thing, maybe it’s system-level changes, maybe LLMs are just not fit for purpose and we need to wait for the next paradigm. Whatever it is, IMO it’s a thing we’ll have eventually, and don’t have right now.
So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.
In the context of the OP fast-takeoff scenarios, you wrote “Takeoff is less abrupt; Takeoff becomes easier to navigate; Capabilities gains are less general”. I’m not sure I buy any of those for my autonomous-learning fast-takeoff scenario. For example, AlphaZero was one of the first systems of that type that anyone got to work at all, and it rocketed to superhuman; that learning process happened over days, not years or decades; and presumably “getting autonomous learning to work stably” would be a cross-cutting advance not tied to any particular domain.
What links here?

Steven Byrnes 25 Jan 2023 0:30 UTC
LW: 79 AF: 33
51
AF
on: Alexander and Yudkowsky on AGI goals
Facile answer: Why, that’s just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.
(UPDATE: I WROTE A BETTER DISCUSSION OF THIS TOPIC AT: Heritability, Behaviorism, and Within-Lifetime RL)
There’s a popular tendency to conflate the two ideas:
- “we should think of humans as doing within-lifetime learning by RL”, versus
- “we should think of humans as doing within-lifetime learning by RL, where the reward function is whatever parents and/or other authority figures want it to be”
The second is associated with behaviorism, and is IMO preposterous. Intrinsic motivation is a thing; in fact, it’s kinda the only thing! The reward function is in the person’s own head, although things happening in the outside world are some of the inputs to it. Thus parents have some influence on the rewards (just like everything else in the world has some influence on the rewards), but the influence is through many paths, some very indirect, and the net influence is not even necessarily in the direction that the parent imagines it to be (thus reverse psychology is a thing!). My read of behavioral genetics is that approximately nothing that parents do to kids (within the typical distribution) has much if any effect on what kinds of adults their kids will grow into.
(Note the disanalogy to AGI, where the programmers get to write the reward function however they want.)
(…Although there’s some analogy to AGI if we don’t have perfect interpretability of the AGI’s thoughts, which seems likely.)
But none of this is evidence that the first bullet point is wrong. I think the first bullet point is true and important.
Slightly less facile answer: Because people are better at detecting cheating, in problems isomorphic to the Wason Selection Task, than they are at performing the naked Wason Selection Task, the conventional explanation of which is that we have built-in cheater detectors. This is a case in point of how humans aren’t blank slates and there’s no reason to pretend we are.
IIUC the experiment being referred to here showed that people did poorly on a reasoning task related to the proposition “if a card shows an even number on one face, then its opposite face is red”, but did much better on the same reasoning task related the proposition “If you are drinking alcohol, then you must be over 18”. This was taken to be evidence that humans have an innate cognitive adaptation for cheater-detectors. I think a better explanation is that most people don’t have a deep understanding of IF-THEN, but rather have learned some heuristics that work well enough in the everyday situations where IF-THEN is normally used. But “if you are drinking alcohol, then you must be over 18” is a sensible story. You don’t need a good understanding of IF-THEN to triangulate what the rule is and why it’s being applied. By contrast, the experimental subjects have no particular prior beliefs for “if a card shows an even number on one face, then its opposite face is red”.
In the paper, Cosmides & Tooby purport to rule out “familiarity” as a factor by noting that people do poorly on “If a person goes to Boston, then he takes the subway” and “If a person eats hot chili peppers, then he will drink a cold beer.” But those examples miss the point. If I said to you “Hey I want to tell you something about drinking alcohol and people-under-18…”, then you could already guess what I’m gonna say before I say it. But if I said to you “Hey I want to tell you something about going to Boston and taking the subway”, your guess would be wrong. Boston is very walkable! The conditional in this latter case is not obvious like it is in the former case. In the latter case, you can’t lean on common sense, you have to actually understand how IF-THEN works.
So I would be interested in a Wason selection task experiment on the following proposition: “If the stove is hot, then I shouldn’t put my hand on it”. This is not cheater-detection—it’s your own hand!—but I’d bet that people would do as well as the drinking question. (Maybe it’s already been done. I think there’s a substantial literature on Wason Selection that I haven’t read.)
(As it turns out, I’m open-minded to the possibility that humans do have cognitive adaptations related to cheater-detection, even if I don’t think this Wason selection task thing provides evidence for that. I think that this adaptation (if it exists) would be implemented via the RL reward function, more-or-less. Long story, still a work in progress.)
Actual answer: Because the entire field of experimental psychology that’s why.
This excerpt isn’t specific so it’s hard to respond, but I do think there’s a lot of garbage in experimental psychology (like every other field), and more specifically I believe that Eliezer has cited some papers in his old blog posts that are bad papers. (Also, even when experimental results are trustworthy, their interpretation can be wrong.) I have some general thoughts on the field of evolutionary psychology in Section 1 here.

Steven Byrnes 28 Mar 2024 18:40 UTC
71 points
7
on: [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
Way back in 2020 there was an article A Proposed Origin For SARS-COV-2 and the COVID-19 Pandemic, which I read after George Church tweeted it (!) (without comment or explanation). Their proposal (they call it “Mojiang Miner Passage” theory) in brief was that it WAS a lab leak but NOT gain-of-function. Rather, in April 2012, six workers in a “Mojiang mine fell ill from a mystery illness while removing bat faeces. Three of the six subsequently died.” Their symptoms were a perfect match to COVID, and two were very sick for more than four months.
The proposal is that the virus spent those four months adapting to life in human lungs, including (presumably) evolving the furin cleavage site. And then (this is also well-documented) samples from these miners were sent to WIV. The proposed theory is that those samples sat in a freezer at WIV for a few years while WIV was constructing some new lab facilities, and then in 2019 researchers pulled out those samples for study and infected themselves.
I like that theory! I’ve liked it ever since 2020! It seems to explain many of the contradictions brought up by both sides of this debate—it’s compatible with Saar’s claim that the furin cleavage site is very different from what’s in nature and seems specifically adapted to humans, but it’s also compatible with Peter’s claim that the furin cleavage site looks weird and evolved. It’s compatible with Saar’s claim that WIV is suspiciously close to the source of the outbreak, but it’s also compatible with Peter’s claim that WIV might not have been set up to do serious GoF experiments. It’s compatible with the data comparing COVID to other previously-known viruses (supposedly). Etc.
Old as this theory is, the authors are still pushing it and they claim that it’s consistent with all the evidence that’s come out since then (see author’s blog). But I’m sure not remotely an expert, and would be interested if anyone has opinions about this. I’m still confused why it’s never been much discussed.

Steven Byrnes 6 Jun 2022 1:19 UTC
LW: 69 AF: 25
39
AF
on: AGI Ruin: A List of Lethalities
I agree with pretty much everything here, and I would add into the mix two more claims that I think are especially cruxy and therefore should maybe be called out explicitly to facilitate better discussion:
Claim A: “There’s no defense against an out-of-control omnicidal AGI, not even with the help of an equally-capable (or more-capable) aligned AGI, except via aggressive outside-the-Overton-window acts like preventing the omnicidal AGI from being created in the first place.”
I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.
If someone disagrees with this claim (i.e., if they think that if DeepMind can make an aligned and Overton-window-abiding “helper” AGI, then we don’t have to worry about Meta making a similarly-capable out-of-control omnicidal misaligned AGI the following year, because DeepMind’s AGI will figure out how to protect us), and also believes in extremely slow takeoff, I can see how such a person might be substantially less pessimistic about AGI doom than I am.
Claim B: “Shortly after (i.e., years not decades after) we have dangerous AGI, we will have dangerous AGI requiring amounts of compute that many many many actors have access to.”
Again I think this claim is true, and I suspect Eliezer does too. In fact, my guess is that there are already single GPU chips with enough FLOP/s to run human-level, human-speed, AGI, or at least in that ballpark. All that we need is to figure out the right learning algorithms, which of course is happening as we speak.
If someone disagrees with this claim, I think they could plausibly be less pessimistic than I am about prospects for coordinating not to build AGI, or coordinating in other ways, because it just wouldn’t be that many actors, and maybe they could all be accounted for and reach agreement (e.g. after a headline-grabbing near-miss catastrophe or something).
(I think most people in AI alignment, especially “scaling hypothesis” people, are expecting early AGIs to involve truly mindboggling amounts of compute, followed by some very long period where the required compute very gradually decreases on account of algorithmic advances. That’s not what I expect; instead I expect the discovery of new better learning algorithms with a different scaling curve that zooms to AGI and beyond quite quickly.)
What links here?

Steven Byrnes 26 Apr 2023 21:50 UTC
59 points
32
on: $250 prize for checking Jake Cannell’s Brain Efficiency
I certainly don’t expect any prize for this, but…
why there has been so little discussion about his analysis since if true it seems to be quite important
…I can at least address this part from my perspective.
Some of the energy-efficiency discussion (particularly interconnect losses) seems wrong to me, but it seems not to be a crux for anything, so I don’t care to spend time looking into it and arguing about it. If a silicon-chip AGI server were 1000× the power consumption of a human brain, with comparable performance, its electricity costs would still be well below my local minimum wage. So who cares? And the world will run out of GPUs long before it runs out of the electricity needed to run them. And making more chips (or brains-in-vats or whatever) is a far harder problem than making enough solar cells to power them, and that remains true even if we substantially sacrifice energy-efficiency for e.g. higher speed.
If we (or an AI) master synthetic biology and can make brains-in-vats, tended and fed by teleoperated robots, then we (or the AI) can make whole warehouses of millions of them, each far larger (and hence smarter) than would be practical in humans who had to schlep their brains around the savannah, and they can have far better cooling systems (liquid-cooled with 1°C liquid coolant coming out of the HVAC system, rather than blood-temperature which is only slightly cooler than the brain), and each can have an ethernet/radio connection to a distant teleoperated robot body, etc. This all works even when I’m assuming “merely brain efficiency”. It doesn’t seem important to me whether it’s possible to do even better than that.
Likewise, the post argues that existing fabs are pumping out the equivalent of ~~5 million~~ (5000 maybe? See thread below.) brains per year, which to me seems like plenty for AI takeover—cf. the conquistadors, or Hitler / Stalin taking over a noticeable fraction of humanity with a mere 1 brain each. Again, maybe there’s room for improvement in chip tech / efficiency compared to today, or maybe not, it doesn’t really seem to matter IMO.
Another thing is: Jacob & I agree that “the cortex/cerebellum/BG/thalamus system is a generic universal learning system”, but he argues that this system isn’t doing anything fundamentally different from the MACs and ReLUs and gradient descent that we know and love from deep learning, and I think he’s wrong, but I don’t want to talk about it for infohazard reasons. Obviously, you have no reason to believe me. Oh well. We’ll find out sooner or later. (I will point out this paper arguing that correlations between DNN-learned-model-activations and brain-voxel-activations is weaker evidence than it seems. The paper is mostly about vision but also has an LLM discussion in Section 5.) Anyway, there are a zillion important model differences that are all downstream of that core disagreement, e.g. how many GPUs it will take for human-level capabilities, how soon and how gradually-vs-suddenly we’ll get human-level capabilities, etc. And hence I have a hard time discussing those too ¯\_(ツ)_/¯
Jacob & I have numerous other AI-risk-relevant disagreements too, but they didn’t come up in the “Brain Efficiency” post.
What links here?
- My side of an argument with Jacob Cannell about chip interconnect losses by Steven Byrnes (21 Jun 2023 13:33 UTC; 144 points)

Steven Byrnes 26 Mar 2024 18:18 UTC
LW: 58 AF: 28
22
AF
on: Modern Transformers are AGI, and Human-Level
Well I’m one of the people who says that “AGI” is the scary thing that doesn’t exist yet (e.g. FAQ or “why I want to move the goalposts on ‘AGI’”). I don’t think “AGI” is a perfect term for the scary thing that doesn’t exist yet, but my current take is that “AGI” is a less bad term compared to alternatives. (I was listing out some other options here.) In particular, I don’t think there’s any terminological option that is sufficiently widely-understood and unambiguous that I wouldn’t need to include a footnote or link explaining exactly what I mean. And if I’m going to do that anyway, doing that with “AGI” seems OK. But I’m open-minded to discussing other options if you (or anyone) have any.
Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.
I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)
What links here?
- Modern Transformers are AGI, and Human-Level by abramdemski (26 Mar 2024 17:46 UTC; 197 points)

Steven Byrnes 3 Jun 2022 2:48 UTC
LW: 58 AF: 19
AF
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
It seems to me that the this argument only makes sense if we assume that “more capabilities research now” translates into “more gradual development of AGI”. That’s the real crux for me.
If that assumption is false, then accelerating capabilities is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed. I do agree that there is important alignment-related work that we can only do in the future, when AGI is closer. I don’t agree that there is nothing useful being done right now.
On the other hand, if that assumption is true (i.e. the assumption “more capabilities research now” translates into “more gradual development of AGI”), then there’s at least a chance that more capabilities research now would be net positive.
However, I don’t think the assumption is true—or at least, not to any appreciable extent. It would only be true if you thought that there was a different bottleneck to AGI besides capabilities research. You mention faster hardware, but my best guess is that we already have a massive hardware overhang—once we figure out AGI-capable algorithms, I believe we already have the hardware that would support superhuman-level AGI with quite modest amounts of money and chips. (Not everyone agrees with me.) You mention “neuroscience understanding”, but I would say that insofar as neuroscience understanding helps people invent AGI-capable learning algorithms, neuroscience understanding = capabilities research! (I actually think some types of neuroscience are mainly helpful for capabilities and other types are mainly helpful for safety, see here.) I imagine there being small bottlenecks that would add a few months today, but would only add a few weeks in a decade, e.g. future better CUDA compilers. But I don’t see any big bottlenecks, things that add years or decades, other than AGI capabilities research itself.
Even if the assumption is significantly true, I still would be surprised if more capabilities research now would be a good trade, because (1) I do think there’s a lot of very useful alignment work we can do right now (not to mention outreach, developing pedagogy, etc.), (2) the most valuable alignment work is work that informs differential technological development, i.e. work that tells us exactly what AGI capabilities work should be done at all, namely R&D that moves us down a path to maximally alignable AGI, but that’s only valuable to the extent that we figure things out before the wrong kind of capabilities research has already been completed. See Section 1.7 here.
I’m not sure how this desire works, but I don’t think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined.
I do think that would be valuable to know, and am very interested in that question myself, but I think that figuring it out is mostly a different type of research than AGI capabilities research—loosely speaking, what you’re talking about looks like “designing the right RL reward function”, whereas capabilities research mostly looks like “designing a good RL algorithm”—or so I claim, for reasons here and here.

Steven Byrnes 24 Sep 2022 20:17 UTC
57 points
74
in reply to: P.’s comment on: Announcing $5,000 bounty for ending malaria
RE creating an instruction manual:

I strongly vote against increasing the number of people able to unilaterally decide that an arbitrary species should be extinct. I think there are already many thousands of such people, and I don’t want there to be millions.

(I’m less strongly opposed if the instruction manual were somehow specific to mosquitoes and completely useless for any other plant or animal.)

Steven Byrnes 21 Dec 2023 20:45 UTC
55 points
23
on: Most People Don’t Realize We Have No Idea How Our AIs Work
In this section (including the footnote) I suggested that
- there’s a category of engineered artifacts that includes planes and bridges and GOFAI and the Linux kernel;
- there’s another category of engineered artifacts that includes plant cultivars, most pharmaceutical drugs, and trained ML models
with the difference being whether questions of the form “why is the artifact exhibiting thus-and-such behavior” are straightforwardly answerable (for the first category) or not (for the second category).
If you were to go around the public saying “we have no idea how trained ML models work, nor how plant cultivars work, nor how most pharmaceutical drugs work” … umm, I understand there’s an important technical idea that you’re trying to communicate here, but I’m just not sure about that wording. It seems at least a little bit misleading, right? I understand that there’s not much space for nuance in public communication, etc. etc. But still. I dunno. ¯\_(ツ)_/¯

Steven Byrnes 24 Apr 2023 14:56 UTC
54 points
43
in reply to: jacob_cannell’s comment on: The Brain is Not Close to Thermodynamic Limits on Computation
I obviously understand the potential of reversible computing, and the thermodynamic efficiency limit I’m discussing is only for conventional non-exotic irreversible computers—the kind that humans know how to build now and for the foreseeable future, and also the kind of computer the brain is. Reversible computers may or may not ever be practical at room temp on earth, but essentially nobody is working on them—essentially all research into exotic computation is going into quantum computing.
I think OP was saying (and I agree) that you frequently say
“The brain is near thermodynamic efficiency limits for computation”
…as shorthand for…
“The brain is near the limit of what’s possible for computational efficiency, unless someone (or some AI) makes progress towards reversible computing, which seems very hard and not necessarily even possible and no one is working on it.”
The latter statement just doesn’t pack the punch of the former statement, and there’s a good reason that it doesn’t, and therefore making this substitution is importantly misleading.

Steven Byrnes 19 Nov 2021 2:59 UTC
LW: 53 AF: 17
0
AF
on: How To Get Into Independent Research On Alignment/Agency
the sort of person who this post is already aimed at (i.e. people who are excited to forge their own path in a technical field where everyone is fundamentally confused) is probably not the sort of person who is aiming for minor contributions anyway.
For me, there were two separate decisions. (1) Around March 2019, having just finished my previous intense long-term internet hobby, I figured my next intense long-term internet hobby was gonna be AI alignment; (2) later on, around June 2020, I started trying to get funding for full-time independent work. (I couldn’t work at an org because I didn’t want to move to a different city.)
I want to emphasize that at the earlier decision-point, I was absolutely “aiming for minor contributions”. I didn’t have great qualifications, or familiarity with the field, or a lot of time. But I figured that I could eventually get to a point where I could write helpful comments on other people’s blog posts. And that would be my contribution!
Well, I also figured I should be capable of pedagogy and outreach. And that was basically the first thing I did—I wrote a little talk summarizing the field for newbies, and gave it to one audience, and tried and failed to give it to a second audience.
(I find it a lot easier to “study topic X, in order to do Y with that knowledge”, compared to “study topic X” full stop. Just starting out on my new hobby, I had no Y yet, so “giving a pedagogical talk” was an obvious-to-me choice of Y.)
Then I had some original ideas! And blogged about them. But they turned out to be bad.
Then I had different original ideas! And blogged about them in my free time for like a year before I applied for LTFF.
…and they rejected me. On the plus side, their rejection came with advice about exactly what I was missing if I wanted to reapply. On the minus side, the advice was pretty hard to follow, given my time constraints. So I started gradually chipping away at the path towards getting those things done. But meanwhile, my rejected LTFF application got forwarded around, and I got a grant offer from a different source a few months later (yay).
With that background, a few comments on the post:
I wrote a fair bit on LessWrong, and researched some agency problems, even before quitting my job. I do expect it helps to “ease into it” this way, and if you’re coming in fresh you should probably give yourself extra time to start writing up ideas, following the field, and getting feedback.
I also went down the “ease into it” path. It’s especially (though not exclusively) suitable for people like me who are OK with long-term intense internet hobbies. (AI alignment was my 4th long-term intense internet hobby in my lifetime. Probably last. They are frankly pretty exhausting, especially with a full-time job and kids.)
Probably the most common mistake people make when first attempting to enter the alignment/agency research field is to not have any model at all of the main bottlenecks to alignment, or how their work will address those bottlenecks.
Just to clarify:
This quote makes sense to me if you read “when first attempting to enter the field” as meaning “when first attempting to enter the field as a grant-funded full-time independent researcher”.
On the other hand, when you’re first attempting to learn about and maybe dabble in the field, well obviously you won’t have a good model of the field yet.
One more thing:
the sort of person who this post is already aimed at (i.e. people who are excited to forge their own path in a technical field where everyone is fundamentally confused) is probably not the sort of person who is aiming for minor contributions anyway.
If you’re a kinda imposter-syndrome-y person who just constitutionally wouldn’t dream of looking themselves in the mirror and saying “I am aiming for a major contribution!”, well me too, and don’t let John scare you off. :-P
I can attest that it’s an awesome job.
I agree!
What links here?

Steven Byrnes 6 Sep 2021 18:07 UTC
48 points
on: A Primer on the Symmetry Theory of Valence
I’ll preface this by saying that I haven’t spent much time engaging with your material (it’s been on my to-do list for a very long time), and could well be misunderstanding things, and that I have great respect for what you’re trying to do. So you and everyone can feel free to ignore this, but here I go anyway.
OK, maybe the most basic reason that I’m skeptical of your STV stuff is that I’m going in expecting a, um, computational theory of valence, suffering, etc. As in, the brain has all those trillions of synapses and intricate circuitry in order to do evolutionary-fitness-improving calculations, and suffering is part of those calculations (e.g. other things equal, I’d rather not suffer, and I make decisions accordingly, and this presumably has helped my ancestors to survive and have more viable children).
So let’s say we’re sitting together at a computer, and we’re running a Super Mario executable on an emulator, and we’re watching the bits in the processor’s SRAM. You tell me: “Take the bits in the SRAM register, and take the Fourier transform, and look at the spectrum (≈ absolute value of the Fourier components). If most of the spectral weight is in long-wavelength components, e.g. the bits are “11111000111100000000...”, then Mario is doing really well in the game. If most of the spectral weight is in the short-wavelength components, e.g. the bits are “101010101101010″, then Mario is doing poorly in the game. That’s my theory!”
I would say “Ummm, I mean, I guess that’s possible. But if that’s true at all, it’s not an explanation, it’s a random coincidence.”
(This isn’t a perfect analogy, just trying to gesture at where I’m coming from right now.)
So that’s the real reason I don’t believe in STV—it just looks wrong to me, in the same way that Mario’s progress should not look like certain types of large-scale structure in SRAM bits.
I want a better argument than that though. So here are a few more specific things:
(1) waves and symmetries don’t carry many bits of information. If you think valence and suffering are fundamentally few-dimensional, maybe that doesn’t bother you; but I think it’s at least possible for people know whether they’re suffering from arm pain or finger pain or air-hunger or guilt or whatever. I guess I raised this issue in an offhand comment a couple years ago, and lsusr responded, and then I apparently dropped out of the conversation, I guess I must have gotten busy or something, hmm I guess I should read that. :-/
(2) From the outside, it’s easy to look at an fMRI or whatever and talk about its harmonic decomposition and symmetries. But from the perspective of any one neuron, that information is awfully hard to access. It’s not impossible, but I think you’d need the neuron to have a bunch of inputs from across the brain hooked into complicated timing circuits etc. My starting point, as I mentioned, is that suffering causes behavioral changes (including self-reports, trying not to suffer, etc.), so there has to be a way for the “am I suffering” information to impact specific brain computations, and I don’t know what that mechanism is in STV. (In the Mario analogy, if you just look at one SRAM bit, or even a few bits, you get almost no information about the spectrum of the whole SRAM register.) If “suffering” was a particular signal carried by a particular neurotransmitter, for example, we wouldn’t have that problem, we just take that signal and wire it to whatever circuits need to be modulated by the presence/absence of suffering. So theories like that strike me as more plausible.
(3) Conversely, I’m confused at how you would tell a story where getting tortured (for example) leads to suffering. This is just the opposite of the previous one: Just as a brain-wide harmonic decomposition can’t have a straightforward and systematic impact on a specific neural signal, likewise a specific neural signal can’t have a straightforward and systematic impact on a brain-wide harmonic decomposition, as far as I can tell.
(4) I don’t have a particularly well-formed alternative theory to STV, but all the most intriguing ideas that I’ve played around with so far that seem to have something to do with the nature of valence and suffering (e.g. here , here , various other things I haven’t written up) look wildly different from STV. Instead they tend to involve certain signals in the insular cortex and reticular activating system and those signals have certain effects on decisionmaking circuits, blah blah blah.

Steven Byrnes 15 Nov 2021 16:20 UTC
LW: 45 AF: 22
AF
in reply to: Rohin Shah’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the “easy problem of wireheading”. He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.
Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.
For the record, I don’t think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of “progress towards AI alignment per researcher-hour”, and this is true at both the production and consumption end (I can read Abram’s one sentence much much faster than I can skim the DeepMind paper).
If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That’s how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don’t think we have such a plan right now.
Also, there are times when it’s totally unclear a priori what an algorithm will do just by thinking about it, and then obviously the experiments are super useful.
But at the end of the day, I feel like there are experiments that are happening not because it’s the optimal thing to do for AI alignment, but rather because there are very strong pro-experiment forces that exist inside CS / ML / AI research in academia and academia-adjacent labs.

Steven Byrnes 21 Mar 2023 14:37 UTC
LW: 43 AF: 21
12
AF
on: Deep Deceptiveness
I think your example was doomed from the start because
- the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
- the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.
If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)
In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.
Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.
What links here?
- Steven Byrnes's comment on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by Jeremy Gillen (26 Jan 2024 14:54 UTC; 2 points)

Steven Byrnes 27 Mar 2023 19:17 UTC
42 points
54
in reply to: johnswentworth’s comment on: The salt in pasta water fallacy
Another reason the pasta terminology is bad is that I bet a reasonable fraction of the population have always believed that the salt is for taste, and have never heard any other justification. For them, “salt in pasta water fallacy” would be a pretty confusing term. I like “epsilon fallacy”.

Steven Byrnes 21 Oct 2022 1:56 UTC
42 points
21
on: The heritability of human values: A behavior genetic critique of Shard Theory
I wonder what you think of my post “Learning from scratch” in the Brain? I feel like the shard theory discussions you cite were significantly based off that post of mine (I hope I’m not tooting my own horn—someone can correct me if I’m mis-describing the intellectual history here). If so, I think there’s a game of telephone, and things are maybe getting lost in translation.
For what it’s worth, I find this an odd post because I’m quite familiar with twin studies, I find them compelling, and I frequently bring them up in conversation. (They didn’t come up in that particular post but I did briefly mention them indirectly here in a kinda weird context.)
See in particular:
- Section 2.3.1: Learning-from-scratch is NOT “blank slate”;
- Section 2.3.2: Learning from scratch is NOT “nurture-over-nature”.
Onto more specific things:
Heritability for behavioral traits tends to increase, not decrease, during lifespan development
If we think about the human brain (loosely) as doing model-based reinforcement learning, and if different people have different genetically-determined reward functions, then one might expect that people with very similar reward functions tend to find their way into similar ways of being / relating / living / thinking / etc.—namely, the ways of being that best tickle their innate reward function. But that process might take time.
For example, if Alice has an innate reward function that predisposes her to be sympathetic to the idea of authoritarianism [important open question that I’m working on: exactly wtf kind of reward function might do that??], but Alice has spent her sheltered childhood having never been exposed to pro-authoritarian arguments, well then she’s not going to be a pro-authoritarian child! But by adulthood, she will have met lots of people, read lots of things, lived in different places, etc., so she’s much more likely to have come across pro-authoritarian arguments, and those arguments would have really resonated with her, thanks to her genetically-determined reward function.
So I find the increase in heritability with age to be unsurprising.
Recall that Assumption 1 of Shard Theory was that “The cortex is basically (locally) randomly initialized.” Recent studies in neurogenetics show that this is not accurate. Genetically informative studies in the Human Connectome Project show pervasive heritability in neural structure and function across all brain areas, not just limbic areas.
The word “locally” in the first sentence is doing a lot of work here. Again see Section 2.3.1. AFAICT, the large-scale wiring diagram of the cortex is mostly or entirely innate, as are the various cytoarchitectural differences across the cortex (agranularity etc.). I think of this fact as roughly “a person’s genome sets them up with a bias to learn particular types of patterns, in particular parts of their cortex”. But they still have to learn those patterns, with a learning algorithm (I claim).
As an ML analogy, there’s a lot of large-scale structure in a randomly-initialized convolutional neural net. Layer N is connected to Layer N+1 but not Layer N+17, and nearby pixels are related by convolutions in a way that distant pixels are not, etc. But a randomly-initialized convolutional neural net is still “learning from scratch” by my definition.
Shard Theory implies that genes shape human brains mostly before birth
I don’t speak for Shard Theory but I for one strongly believe that the innate reward function is different during different stages of life, as a result of development (not learning), e.g. sex drive goes up in puberty.
See the 2002 book The Blank Slate by Steven Pinker
FWIW, if memory serves, I have no complaints about that book outside of chapter 5. I have lots of complaints about chapter 5, as you might expect.
Pope and Turner include this bold statement: “Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.”
My understanding (which I have repeatedly complained about) is that they are using “values” where I would use the word “desires”.
In that context, consider the statement: “I like Ghirardelli chocolate, and I like the Charles River, and I like my neighbor Dani, and I like [insert 5000 more things like that].” I think it’s perfectly obvious to everyone that there are not 5000 specific genes for my liking these 5000 particular things. There are probably (IMO) specific genes that contribute to why I like chocolate (i.e. genes related to taste), and genes that contribute to why I like the Charles River (i.e. genes related to my sense of aesthetics), etc. And there are also life experiences involved here; if I had had my first kiss at the Charles River, I would probably like it a bit more. Right?
I’m not exactly sure what the word “crude” is doing here. I don’t think I would have used that word. I think the hard-coded reward circuitry is rather complex and intricate in its own way. But it’s not as complex and intricate as our learned desires! I think describing our genetically hard-coded reward circuitry would take like maybe thousands of lines of pseudocode, whereas describing everything that an adult human likes / desires would take maybe millions or billions of lines. After all, we only have 25,000ish genes, but the brain has 100 trillion(ish) synapses.
“it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode.”
I’m not sure I would have said it that way—see here if you want to see me trying to articulate (what I think is) this same point that Quintin was trying to get across there.
Let’s consider Death as an abstract concept in your conscious awareness. When this abstract concept is invoked, it probably centrally involves particular neurons in your temporal lobe, and various other neurons in various other places. Which temporal-lobe neurons? It’s probably different neurons for different people. Sure, most people will have this Death concept in the same part of their temporal lobes, maybe even the same at a millimeter scale or something. But down to the individual neurons? Seems unlikely, right? After all, different people have pretty different conceptions of death. Some cultures might have two death concepts instead of one, for all I know. Some people (especially kids) don’t know that death is a thing in the first place, and therefore don’t have a concept for it at all. When the kid finally learns it, it’s going to get stored somewhere (really, multiple places), but I don’t think the destination in the temporal lobe is predetermined down to the level of individual neurons.
So, consider the possible story: “The abstract concept of Death is going to deterministically involve temporal lobe neurons 89642, and 976387, and (etc.) The genome will have a program that wires those particular neurons to the ventral-anterior & medial hypothalamus and PAG. And therefore, humans will be hardwired to be afraid of death.”
That’s an implausible story, right? As it happens, I don’t think humans are genetically hardwired to be afraid of death in the first place. But even if they were, I don’t think the mechanism could look like that.
That doesn’t necessarily mean there’s no possible mechanism by which humans could have a genetic disposition to be specifically afraid of death. It would just have to work in a more indirect way, presumably (IMO) involving learning algorithms in some way.
Shard Theory incorporates a relatively Blank Slate view about the origins of human values
I’m trying to think about how you wound up with this belief in the first place. Here’s a guess. I could be wrong.
One thing is, insofar as human learning is describable as model-based RL (yes it’s an oversimplification but I do think it’s a good starting point), the reward function is playing a huge role.
And in the context of AGI alignment, we the programmers get to design the reward function however we want.
We can even give ourselves a reward button, and press whenever we feel like it.
If we are magically perfectly skillful with the reward function / button, e.g. we have magical perfect interpretability and give reward whenever the AGI’s innermost thoughts and plans line up with what we want it to be thinking and planning, then I think we would eventually get an aligned AGI.
A point that shard theory posts sometimes bring up is, once we get to this point, and the AGI is also super smart and capable, it’s at least plausible that we can just give the reward button to the AGI, or give the AGI write access to its reward function. Thanks to instrumental convergence goal-preservation drive, the AGI would try to use that newfound power for good, to make sure that it stayed aligned (and by assumption of super-competence, it would succeed).
Whether we buy that argument or not, I think maybe it can be misinterpreted as a blank slate-ish argument! After all, it involves saying that “early in life” we brainwash the AGI with our reward button, and “late in life” (after we’ve completely aligned it and then granted it access to its own reward function), the AGI will continue to adhere to the desires with which it was brainwashed as a child.
But you can see the enormous disanalogies with humans, right? Human parents are hampered in their ability to brainwash their children by not having direct access to their kids’ reward centers, and not having interpretability of their kids’ deepest thoughts which would be necessary to make good use of that anyway. Likewise, human adults are hampered in their ability to prevent their own values from drifting by not having write access to their own brainstems etc., and generally they aren’t even trying to prevent their own values from drifting anyway, and they wouldn’t know how to even if they could in principle.
(Again, maybe this is all totally unrelated to how you got the impression that Shard Theory is blank slate-ist, in which case you can ignore it!)

Steven Byrnes 26 Oct 2023 15:39 UTC
41 points
22
in reply to: 1a3orn’s comment on: AI as a science, and three obstacles to alignment strategies
I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!
OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.
So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.
Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.
For example: She wanted the humans to not get injured, so she installed in the human body a system to detect physical injury, and put in the brain an innate drive to avoid getting those injuries, via an innate aversion (negative reward) related to “pain”. And she wanted the humans to eat sugary food, so she put a sweet-food-detector on the tongue and installed in the brain an innate drive to trigger reinforcement (positive reward) when that detector goes off (but modulated by hunger, as detected by yet another system). And so on.
Then she did some debugging and hyperparameter tweaking by running these newly-designed humans in the training environment (African savannah) and seeing how they do.
So that’s how Ev designed humans. Then she “pressed go” and lets them run for 1e5 years. What happened?
Well, I think it’s fair to say that modern humans “care about” things that probably would have struck Ev as “weird”. (Although we, with the benefit of hindsight, can wag our finger at Ev and say that she should have seen them coming.) For example:
- Superstitions and fashions: Some people care, sometimes very intensely, about pretty arbitrary things that Ev could not have possibly anticipated in detail, like walking under ladders, and where Jupiter is in the sky, and exactly what tattoos they have on their body.
- Lack of reflective equilibrium resulting in self-modification: Ev put a lot of work into her design, but sometimes people don’t like some of the innate drives or other design features that Ev put into them, so the people go right ahead and change them! For example, they don’t like how Ev designed their hunger drive, so they take Ozempic. They don’t like how Ev designed their attentional system, so they take Adderall. Many such examples.
- New technology / situations leading to new preferences and behaviors: When Ev created the innate taste drives, she was (let us suppose) thinking about the food options available on the savannah, and thinking about what drives would lead to people making smart eating choices in that situation. And she came up with a sensible and effective design for a taste-receptors-and-associated-innate-drives system that worked well for that circumstance. But maybe she wasn’t thinking that humans would go on to create a world full of ice cream and coca cola and miraculin and so on. Likewise, Ev put in some innate drives with the idea that people would wind up exploring their local environment. Very sensible! But Ev would probably be surprised that her design is now leading to people “exploring” open-world video-game environments while cooped up inside. Ditto with social media, organized religion, sports, and a zillion other aspects of modern life. Ev probably didn’t see any of it coming when she was drawing up and debugging her design, certainly not in any detail.
To spell out the analogy here:
- Ev ↔ AGI programmers;
- Human within-lifetime learning ↔ AGI training;
- Adult humans ↔ AGIs;
- Ev “presses go” and lets human civilization “run” for 1e5 years without further intervention ↔ For various reasons I consider it likely (for better or worse) that there will eventually be AGIs that go off and autonomously do whatever they think is a good thing to do, including inventing new technologies, without detailed human knowledge and approval.
- Modern humans care about (and do) lots of things that Ev would have been hard-pressed to anticipate, even though Ev designed their innate drives and within-lifetime learning algorithm in full detail ↔ even if we carefully design the “innate drives” of future AGIs, we should expect to be surprised about what those AGIs end up caring about, particularly when the AGIs have an inconceivably vast action space thanks to being able to invent new technology and build new systems.

Steven Byrnes 6 Jun 2023 21:29 UTC
41 points
28
on: Transformative AGI by 2043 is <1% likely
I disagree with the brain-based discussion of how much compute is required for AGI. Here’s an analogy I like (from here):
Left: Suppose that I want to model a translator (specifically, a MOSFET). And suppose that my model only needs to be sufficient to emulate the calculations done by a CMOS integrated circuit. Then my model can be extremely simple—it can just treat the transistor as a cartoon switch. (image source.)
Right: Again suppose that I want to model a transistor. But this time, I want my model to accurately capture all measurable details of the transistor. Then my model needs to be mind-bogglingly complex, involving dozens of adjustable parameters, some of which are shown in this table (screenshot from here).
What’s my point? I’m suggesting an analogy between this transistor and a neuron with synapses, dendritic spikes, etc. The latter system is mind-bogglingly complex when you study it in detail—no doubt about it! But that doesn’t mean that the neuron’s essential algorithmic role is equally complicated. The latter might just amount to a little cartoon diagram with some ANDs and ORs and IF-THENs or whatever. Or maybe not, but we should at least keep that possibility in mind.
--
For example, this paper is what I consider a plausible algorithmic role of dendritic spikes and synapses in cortical pyramidal neurons, and the upshot is “it’s basically just some ANDs and ORs”. If that’s right, this little bit of brain algorithm could presumably be implemented with <<1 FLOP per spike-through-synapse. I think that’s a suggestive data point, even if (as I strongly suspect) dendritic spikes and synapses are meanwhile doing other operations too.
--
Anyway, I currently think that, based on the brain, human-speed AGI is probably possible in 1e14 FLOP/s. (This post has a red-flag caveat on top, but that’s related to some issues in my discussion of memory, I stand by the compute section.) Not with current algorithms, I don’t think! But with some future algorithm.
I think that running microscopically-accurate brain simulations is many OOMs harder than running the algorithms that the brain is running. This is the same idea as the fact that running a microscopically-accurate simulation of a pocket calculator microcontroller chip, with all its thousands of transistors and capacitors and wires, stepping the simulation forward picosecond-by-picosecond, as the simulated chip multiplies two numbers, is many OOMs harder than multiplying two numbers.

Steven Byrnes 30 Aug 2022 15:15 UTC
LW: 41 AF: 13
16
AF
on: How might we align transformative AI if it’s developed very soon?
Here it’s crucial that Magma’s safe systems—plus the people and resources involved in their overall effort to ensure safety- are at least as powerful (in aggregate) as less-safe systems that others might deploy. This is likely to be a moving target; the basic idea is that defense/deterrence/hardening could reduce “inaction risk” for the time being and give Magma more time to build more advanced (yet also safe) systems (and there could be many “rounds” of this).
I feel like the “good AIs + humans are more powerful than bad AIs” criterion paints much too rosy a picture (especially when “power” is implicitly operationalized as “total compute”), for several (overlapping) reasons :
1. There can be inherent offense-defense imbalances: For example, disabling an electric grid is a different task than preventing an electric grid from being disabled. Thus, the former task can in principle be much easier or much harder than the latter task. Ditto for “creating gray goo” versus “creating a gray goo defense system”, “triggering nuclear war” versus “preventing nuclear war from getting triggered”, etc. etc. I don’t have deep knowledge about attack-defense balance in any of these domains, but I’m very concerned by the disjunctive nature of the problem—an out-of-control AGI would presumably attack in whatever way had the worst (from humans’ perspective) attack-defense imbalance.
2. Humans may not entirely trust the “good” AIs: For example, I imagine the Magma CEO going up to the General of USSTRATCOM and saying “We have the most powerful AI in history, but it’s totally safe and friendly, trust us! We’ve been testing this particular one in our lab for a whole 7 months! As a red-team exercise, can we please have this AI attempt to trigger unintentional launch of the US nuclear arsenal, e.g. by spearphishing or blackmailing the soldiers who work at nuclear-early-warning radar stations, or by hacking into your systems, etc.? Don’t worry, the AI won’t actually launch the weapons—it’s just red-teaming. Trust us!” Then the STRATCOM general says “lol no way in hell, if you let that AI so much as think about how to hack our systems or soldiers I’ll have you executed for treason”. (Or worse, imagine Magma is based in the USA, and they’re trying to help secure the nuclear weapon systems of Russia!!)
I just don’t see how this is supposed to work. If Magma gives a copy of their AI to the General, the latter still wouldn’t use it anytime soon, and also doing that is a terrible idea for other reasons. Or if Magma asks their AI to invent a human-legible nuclear-weapon-securing tool / process, the AI might say, “That’s impossible, I can’t say in advance everything that could possibly be insecure, you have to let me look at how the systems are actually implemented in the real world, and apply my flexible intelligence, if you want this red-teaming exercise to actually work”. Or if Magma proceeds without the permission of the General … well, I find it extraordinarily hard to imagine that tech company executives and employees would actually do that, and that if they do, that it would actually have the desired result (as opposed to the suggested problems not being fixed while meanwhile the CEO gets arrested and the company gets nationalized).
Other examples include: humans may not trust a (supposedly) aligned AI to do recursive self-improvement, or to launch von Neumann probes that can never be called back, etc. But an out-of-control AI could do those things.
3. Relatedly, the “good” AIs are hampered by Alignment Tax: For example, if the “good” AIs are only “good” because their constrained by supervision and boxes and a requirement to output human-legible plans, and they’re running at 0.01 speed so that humans can use interpretability tools to monitor their thoughts, etc.—and meanwhile the out-of-control AIs can do whatever they want to accomplish their goals—then that’s a very big disadvantage.
4. The “good” AIs are hamstrung by human laws, norms, Overton Windows, etc., or getting implausibly large numbers of human actors to agree with each other, or suffering large immediate costs for uncertain benefits, etc., such that necessary defense/deterrence/hardening doesn’t actually happen: For example, maybe the only viable gray goo defense system consists of defensive nanobots that go proliferate in the biosphere, harming wildlife and violating national boundaries. Would people + aligned AIs actually go and deploy that system? I’m skeptical. Likewise, if there’s a neat trick to melt all the non-whitelisted GPUs on the planet, I find it hard to imagine that people + aligned AIs would actually do anything with that knowledge, or even that they would go looking for that knowledge in the first place. But an out-of-control AI would.
This also relates to (1) above—there might be a “weakest link” dynamic where if even one cloud computing provider in the world refuses to use AIs to harden their security, then that creates an opening for an out-of-control unaligned AI to seize a ton of resources, while meanwhile the good aligned AIs won’t do that because it’s illegal.
Conclusion: I keep winding up in the “we’re doomed unless there’s a MIRI-style pivotal act, which there won’t be, because tech company executives would never dream of doing anything like that” school of thought. Except for the hope that the good AIs will magic us a beautiful human-legible solution to the alignment problem, and it’s such a good solution to the alignment problem that we can then start trusting the AGIs with no human oversight or other alignment tax, and these AGIs can recursively-self-improve into insane new superpowers that can solve otherwise-insoluble world problems. Or something.
(Part of the “we’re doomed” above comes from my strong background belief that, within a few years after we have real-deal strategically-aware human-level-planning AGIs at all, we’ll have real-deal strategically-aware human-level-planning AGIs that can be trained from scratch without much compute, e.g. in a university cluster. See here. So there would be a lot of actors all around the world who could potentially make an out-of-control AGI.)
Not an expert, and very curious how other people are thinking about this. :)