I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
Thanks!
One thing I would say is: if you have a (correct) theoretical framework, it should straightforwardly illuminate tons of diverse phenomena, but it’s very much harder to go backwards from the “tons of diverse phenomena” to the theoretical framework. E.g. any competent scientist who understands Evolution can apply it to explain patterns in finch beaks, but it took Charles Darwin to look at patterns in finch beaks and come up with the idea of Evolution.
Or in my own case, for example, I spent a day in 2021 looking into schizophrenia, but I didn’t know what to make of it, so I gave up. Then I tried again for a day in 2022, with a better theoretical framework under my belt, and this time I found that it slotted right into my then-current theoretical framework. And at the end of that day, I not only felt like I understood schizophrenia much better, but also my theoretical framework itself came out more enriched and detailed. And I iterated again in 2023, again simultaneously improving my understanding of schizophrenia and enriching my theoretical framework.
Anyway, if the “tons of diverse phenomena” are datapoints, and we’re in the middle of trying to come up with a theoretical framework that can hopefully illuminate all those datapoints, then clearly some of those datapoints are more useful than others (as brainstorming aids for developing the underlying theoretical framework), at any particular point in this process. The “schizophrenia” datapoint was totally unhelpful to me in 2021, but helpful to me in 2022. The “precession of Mercury” datapoint would not have helped Einstein when he was first brainstorming general relativity in 1907, but was presumably moderately helpful when he was thinking through the consequences of his prototype theory a few years later.
The particular phenomena / datapoints that are most useful for brainstorming the underlying theory (privileging the hypothesis), at any given point in the process, need not be the most famous and well-studied phenomena / datapoints. Einstein wrung much more insight out of the random-seeming datapoint “a uniform gravity field seems an awful lot like uniform acceleration” than out of any of the datapoints that would have been salient to a lesser gravity physicist, e.g. Newton’s laws or the shape of the galaxy or the Mercury precession. In my own case, there are random experimental neuroscience results (or everyday observations) that I see as profoundly revealing of deep truths, but which would not be particularly central or important from the perspective of other theoretical neuroscientists.
But, I don’t see why “legible phenomena” datapoints would be systematically worse than other datapoints. (Unless of course you’re also reading and internalizing crappy literature theorizing about those phenomena, and it’s filling your mind with garbage ideas that get in the way of constructing a better theory.) For example, the phenomenon “If I feel cold, then I might walk upstairs and put on a sweater” is “legible”, right? But if someone is in the very early stages of developing a theoretical framework related to goals and motivations, then they sure need to have examples like that in the front of their minds, right? (Or maybe you wouldn’t call that example “legible”?)
Can you elaborate on why you think “studying the algorithms involved in grammatically parsing a sentence” is not “a good way to get at the core of how minds work”?
For my part, I’ve read a decent amount of pure linguistics (in addition to neuro-linguistics) over the past few years, and find it to be a fruitful source of intuitions and hypotheses that generalize way beyond language. (But I’m probably asking different questions than you.)
I wonder if you’re thinking of, like, the nuts-and-bolts of syntax of specific languages, whereas I’m thinking of broader / deeper theorizing (random example), maybe?
In Section 1 of this post I make an argument kinda similar to the one you’re attributing to Eliezer. That might or might not help you, I dunno, just wanted to share.
the goal remains to implement CEV or something like it, and optimize the universe according to the resulting utility function
I think you mean “the goal remains to ensure that CEV or something like it is eventually implemented, and the universe is thus optimized according to the resulting utility function”, right? I think Eliezer’s view has always been that we want a CEV-maximizing ASI to be eventually turned on, but if that happens, it wouldn’t matter which human turns it on. And then evidently Eliezer has pivoted over the decades from thinking that this is likeliest to happen if he tries to build such an ASI with his own hands, to no longer thinking that.
A starting point is self-reports. If I truthfully say “I see my wristwatch”, then, somewhere in the chain of causation that eventually led to me uttering those words, there’s an actual watch, and photons are bouncing off it and entering my eyes then stimulating neurons etc.
So by the same token, if I say “your phenomenal consciousness is a salty yellow substance that smells like bananas and oozes out of your bellybutton”, and then you reply “no it isn’t!”, then let’s talk about how it is that you are so confident about that.
(I’m using “phenomenal consciousness” as an example, but ditto for “my sense of self / identity” or whatever else.)
So here, you uttered a reply (“No it isn’t!”). And we can assume that somewhere in the chain of causation is ‘phenomenal consciousness’ (whatever that is, if anything), and you were somehow introspecting upon it in order to get that information. You can’t know things in any other way—that’s the basic, hopefully-obvious point that I understand Eliezer was trying to make here.
Now, what’s a ‘chain of causation’, in the relevant sense? Let’s start with a passage from Age of Em:
The brain does not just happen to transform input signals into state changes and output signals; this transformation is the primary function of the brain, both to us and to the evolutionary processes that designed brains. The brain is designed to make this signal processing robust and efficient. Because of this, we expect the physical variables (technically, “degrees of freedom”) within the brain that encode signals and signal-relevant states, which transform these signals and states, and which transmit them elsewhere, to be overall rather physically isolated and disconnected from the other far more numerous unrelated physical degrees of freedom and processes in the brain. That is, changes in other aspects of the brain only rarely influence key brain parts that encode mental states and signals.
In other words, if your body temperature had been 0.1° colder, or if you were hanging upside down, or whatever, then the atoms in your brain would be configured differently in all kinds of ways … but you would still say “no it isn’t!” in response to my proposal that maybe your phenomenal consciousness is a salty yellow substance that oozes out of your bellybutton. And you would say it for the exact same reason.
This kind of thinking leads to the more general idea that the brain has inputs (e.g. photoreceptor cells), outputs (e.g. motoneurons … also, fun fact, the brain is a gland!), and algorithms connecting them. Those algorithms describe what Hanson’s “degrees of freedom” are doing from moment to moment, and why, and how. Whenever brains systematically do characteristically-brain-ish things—things like uttering grammatical sentences rather than moving mouth muscles randomly—then the explanation of that systematic pattern lies in the brain’s inputs, outputs, and/or algorithms. Yes, there’s randomness in what brains do, but whenever brains do characteristically-brainy-things reliably (e.g. disbelieve, and verbally deny, that your consciousness is a salty yellow substance that oozes out of your bellybutton), those things are evidently not the result of random fluctuations or whatever, but rather they follow from the properties of the algorithms and/or their inputs and outputs.
That doesn’t quite get us all the way to computationalist theories of consciousness or identity. Why not? Well, here are two ways I can think of to be non-computationalist within physicalism:
One could argue that consciousness & sense-of-identity etc. are just confused nonsense reifications of mental models with no referents at all, akin to “pure white” [because white is not pure, it’s a mix of wavelengths]. (Cf. “illusionism”.) I’m very sympathetic to this kind of view. And you could reasonably say “it’s not a computationalist theory of consciousness / identity, but rather a rejection of consciousness / identity altogether!” But I dunno, I think it’s still kinda computationalist in spirit, in the sense that one would presumably instead make the move of choosing to (re)define ‘consciousness’ and ‘sense-of-identity’ in such a way that those words point to things that actually exist at all (which is good), at the expense of being inconsistent with some of our intuitions about what those words are supposed to represent (which is bad). And when you make that move, those terms almost inevitably wind up pointing towards some aspect(s) of brain algorithms.
One could argue that we learn about consciousness & sense-of-identity via inputs to the brain algorithm rather than inherent properties of the algorithm itself—basically the idea that “I self-report about my phenomenal consciousness analogously to how I self-report about my wristwatch”, i.e. my brain perceives my consciousness & identity through some kind of sensory input channel, and maybe also my brain controls my consciousness & identity through some kind of motor or other output channel. If you believe something like that, then you could be physicalist but not a computationalist, I think. But I can’t think of any way to flesh out such a theory that’s remotely plausible.
I’m not a philosopher and am probably misusing technical terms in various ways. (If so, I’m open to corrections!)
(Note, I find these kinds of conversations to be very time-consuming and often not go anywhere, so I’ll read replies but am pretty unlikely to comment further. I hope this is helpful at all. I mostly didn’t read the previous conversation, so I’m sorry if I’m missing the point, answering the wrong question, etc.)
I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):
I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)
REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7)
LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7)
NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was ever even thinking! Just for fun, here’s the evolution of one of my main diagrams since 2021: (4/7)
EXAMPLES! It never hurts to have more examples! So I added a few more. I also switched the main running example of Post 13 from “envy” to “drive to be liked / admired”, partly because I’m no longer even sure envy is related to social instincts at all (oops) (5/7)
LLMs! … …Just kidding! LLMania has exploded since 2022 but remains basically irrelevant to this series. I hope this series is enjoyed by some of the six remaining AI researchers on Earth who don’t work on LLMs. (I did mention LLMs in a few more places though ↓ ) (6/7)
If you’ve already read the series, no need to do so again, but I want to keep it up-to-date for new readers. Again, see the changelogs at the bottom of each post for details. I’m sure I missed things (and introduced new errors)—let me know if you see any!
This doesn’t sound like an argument Yudkowsky would make
Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.
My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) decomposable etc., then the ASI would find the “best solution” to maximizing that part. And if numbers matter, then the “best solution” would presumably be many copies of some microscopic thing.
I use rationalist jargon when I judge that the benefits (of pointing to a particular thing) outweigh the costs (of putting off potential readers). And my opinion is that “epistemic status” doesn’t make the cut.
Basically, I think that if you write an “epistemic status” at the top of a blog post, and then delete the two words “epistemic status” while keeping everything else the same, it works just about as well. See for example the top of this post.
(this comment is partly self-plagiarized from here)
Before doing any project or entering any field, you need to catch up on existing intellectual discussion on the subject.
I think this is way too strong. There are only so many hours in a day, and they trade off between
(A) “try to understand the work / ideas of previous thinkers” and
(B) “just sit down and try to figure out the right answer”.
It’s nuts to assert that the “correct” tradeoff is to do (A) until there is absolutely no (A) left to possibly do, and only then do you earn the right to start in on (B). People should do (A) and (B) in whatever ratio is most effective for figuring out the right answer. I often do (B), and I assume that I’m probably reinventing a wheel, but it’s not worth my time to go digging for it. And then maybe someone shares relevant prior work in the comments section. That’s awesome! Much appreciated! And nothing went wrong anywhere in this process! See also here.
A weaker statement would be “People in LW/EA commonly err in navigating this tradeoff, by doing too much (B) and not enough (A).” That weaker statement is certainly true in some cases. And the opposite is true in other cases. We can argue about particular examples, I suppose. I imagine that I have different examples in mind than you do.
~~
To be clear, I think your post has large kernels of truth and I’m happy you wrote it.
If you click my username it goes to my lesswrong user page, which has a “Message” link that you can click.
Related: Arbital postmortem.
Also, if anyone is curious to see another example, in 2007-8 there was a long series of extraordinarily time-consuming and frustrating arguments between me and one particular wikipedia editor who was very bad at physics but infinitely patient and persistent and rule-following. (DM me and I can send links … I don’t want to link publicly in case this guy is googling himself and then pops up in this conversation!) The combination of {patient, persistent, rule-following, infinite time to spend, object-level nutso} is a very very bad combination, it really puts a strain on any system (maybe benevolent dictatorship would solve that problem, while creating other ones). (Gerard also fits that profile, apparently.) Luckily I had about as much free time and persistence as this crackpot physicist did … this was around 2007-8. He ended up getting permanently banned from wikipedia by the arbitration committee (wikipedia supreme court), but boy it was a hell of a journey to get there.
Thanks! I don’t do super-granular time-tracking, but basically there were 8 workdays where this was the main thing I was working on.
Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)
I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.
Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal, etc. or not). This is a scenario that I want to talk about; and if you assign an extremely low credence to that scenario, then whatever, we can agree to disagree. (If you want to argue about what credence is appropriate, you can try responding to me here or links therein, but note that I probably won’t engage, it’s generally not a topic I like to talk about for “infohazard” reasons [see footnote here if anyone reading this doesn’t know what that means].)
I find that a lot of alignment researchers don’t treat this scenario as their modal expectation, but still assign it like >10% credence, which is high enough that we should be able to agree that thinking through that scenario is a good use of time.
I think we’re mostly talking past each other, or emphasizing different things, or something. Oh actually, I think you’re saying “the edges of Network 1 exist”, and I’m saying “the edges & central node of Network 2 can exist”? If so, that’s not a disagreement—both can and do exist. :)
Maybe we should switch away from bleggs/rubes to a real example of coke cans / pepsi cans. There is a central node—I can have a (gestalt) belief that this is a coke can and that is a pepsi can. And the central node is in fact important in practice. For example, if you see some sliver of the label of an unknown can, and then you’re trying to guess what it looks like in another distant part of the can (where the image is obstructed by my hand), then I claim the main pathway used by that query is probably (part of image) → “this is a coke can” (with such-and-such angle, lighting, etc.) → (guess about a distant part of image). I think that’s spiritually closer to a Network 2 type inference.
Granted, there are other cases where we can make inferences without needing to resolve that central node. The Network 1 edges exist too! Maybe that’s all you’re saying, in which case I agree. There are also situations where there is no central node, like my example of car dents / colors / makes.
Separately, I think your neuroanatomy is off—visual object recognition is conventionally associated with the occipital and temporal lobes (cf. “ventral stream”), and has IMO almost nothing to do with the prefrontal cortex. As for a “region where “the blegg neurons”…are, such that if they get killed you (selectively) lose the ability to associate the features of a blegg with other features of a blegg”: if you’re just talking about visual features, then I think the term is “agnosia”, and if it’s more general types of “features”, I think the term is “semantic dementia”. They’re both associated mainly with temporal lobe damage, if I recall correctly, although not the same parts of the temporal lobe.
Response to Dileep George: AGI safety warrants planning ahead
I think I’d vote for: “Network 2 for this particular example with those particular labels, but with the subtext that the central node is NOT a fundamentally different kind of thing from the other five nodes; and also, if you zoom way out to include everything in the whole giant world-model, you also find lots of things that look more like Network 1. As an example of the latter: in the world of cars, their colors, dents, and makes have nonzero probabilistic relations that people can get a sense for (“huh, a beat-up hot-pink Mercedes, don’t normally see that...”) but it doesn’t fit into any categorization scheme.”
just found this survey from 2018
my package “is of somewhat limited use … semantic consistency is detected in a rather obscure way”, haters gonna hate 😂
Thanks for sharing!!
You can compare and contrast with my version (The “mind-body vicious cycle” model of RSI & back pain). Our differences are pretty minor in the grand scheme of things, but they seem to be generally related to my rejection of many of the claims that fall under the Predictive Processing umbrella, unlike you.
I don’t think we disagree much if at all.
I think constructing a good theoretical framework is very hard, so people often do other things instead, and I think you’re using the word “legible” to point to some of those other things.
I’m emphasizing that those other things are less than completely useless as semi-processed ingredients that can go into the activity of “constructing a good theoretical framework”
You’re emphasizing that those other things are not themselves the activity of “constructing a good theoretical framework”, and thus can distract from that activity, or give people a false sense of how much progress they’re making.
I think those are both true.
The pre-Darwin ecologists were not constructing a good theoretical framework. But they still made Darwin’s job easier, by extracting slightly-deeper patterns for him to explain with his much-deeper theory—concepts like “species” and “tree of life” and “life cycles” and “reproduction” etc. Those concepts were generally described by the wrong underlying gears before Darwin, but they were still contributions, in the sense that they compressed a lot of surface-level observations (Bird A is mating with Bird B, and then Bird B lays eggs, etc.) into a smaller number of things-to-be-explained. I think Darwin would have had a much tougher time if he was starting without the concepts of “finch”, “species”, “parents”, and so on.
By the same token, if we’re gonna use language as a datapoint for building a good underlying theoretical framework for the deep structure of knowledge and ideas, it’s hard to do that if we start from slightly-deep linguistic patterns (e.g. “morphosyntax”, “sister schemas”)… But it’s very much harder still to do that if we start with a mass of unstructured surface-level observations, like particular utterances.
I guess your perspective (based on here) is that, for the kinds of things you’re thinking about, people have not been successful even at the easy task of compressing a lot of surface-level observations into a smaller number of slightly-deeper patterns, let alone successful at the the much harder task of coming up with a theoretical framework that can deeply explain those slightly-deeper patterns? And thus you want to wholesale jettison all the previous theorizing? On priors, I think that would be kinda odd. But maybe I’m overstating your radicalism. :)