I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
(Data point: I was complaining a bit about the effects of Inkhaven 1 on lesswrong, but Inkhaven 2 seemed fine.)
No I don’t find that plausible, sorry I don’t have time to explain why but this post section is related to where I’m coming from.
The OP is about the “deep learning sample efficiency gap”. But that’s not a deep learning paper. So I don’t think it provides any evidence here.
I agree that the social world is usually very very important for (1) making options salient, and (2) making options seem appealing, and (3) providing evidence about the consequences of different options. I think that’s the kernel of truth that this post is gesturing at.
But I think you’re taking that observation WAY too far.
In particular, the social world is not REQUIRED for any of those three things.
For one thing, if people learn planning from other people, where did it come from in the first place? Somebody had to have been the first, right?
For another thing, sometimes people do quite unusual things in the effective pursuit of goals. E.g. Jeff Bezos founded Amazon in order to get enough money to pursue his real dream of running a space exploration company. Who would he have learned that from?
(I think some people are more motivated by following norms than others. Sociopaths, autistics, and “high-agency people” would typically be on the lower end of norm-following motivation, so I would look there first to find especially clear-cut evidence of non-social agency.)
For another thing, if you take someone’s general advice (say, they counsel “it’s better to ask for forgiveness than permission!”), and then next week you end up humiliated and with a painful broken arm and giant hospital bill, aren’t you marginally less likely to follow that same heuristic in the future? Conversely, if you adopt their general advice and then next week you end up with a proud new accomplishment under your belt, aren’t you marginally more likely to follow that same heuristic in the future? Obviously yes, right? So doesn’t this constitute “[learning] through feedback about how well they fulfill your goals”?
“Low-level reflexes generalize to heuristics, which in turn generalize to a general planning algorithm”… I would also guess Steven Byrnes believes this (see below).
No, I think I’d mostly disagree with that statement. I think planning is basically innate, although it’s augmented by a lifetime of learning how to plan better (e.g. you can learn metacognitive heuristics from experience, or from reading a book etc.).
It’s not clear to me how “order food → food will come” is even supposed to be learned by the brain’s self-supervised learning/predictive processing or RL. The prediction error/reward comes in _a week_ after the prediction. And if it’s somehow deduced from higher-level knowledge about the world—how did that get learned?
Obviously it’s a hard problem that AI researchers have not solved yet, but it’s equally obvious to me that a solution exists in the brain. It seems crazy to me to deny that. We make a zillion accurate long-term predictions about the world every day (e.g. “if I put on ripped pants right now at 8am, then my knees might get cold when I’m outside at 10pm tonight, and I know this because it happened to me yesterday”). We make way too many long-term predictions in way too many circumstances to have learned all of them from observing or listening to other people. And even the things that we did learn from someone else telling us, that person in turn had to have learned it somehow, and if we trace that chain back then it eventually has to end in somebody actually figuring something out by observing the world. Right?
Have you really never in your life figured out something like “X today implies Y tomorrow” all by yourself that you didn’t learn from someone else??
I feel like I’m probably misunderstanding your position here, because it really seems crazy to me.
I’ve gotten into the habit of trying to model what’s going on when I experience an impulse for an action that could be interpreted as ”long-term planning”, and it seems to me that it’s all actually just a bunch of superficial, distinct, socially learned behavioral patterns, rather than any planning through a world model or any general/sophisticated heuristics for accomplishing long-term goals
Maybe you should read Cate Hall’s book when it comes out? :-P
OK here’s an example that I challenge you to explain: if I’m hungry, I might take a bus to the restaurant to get a slice of pizza, but if I’m not hungry, then I won’t.
The obvious explanation that I endorse is: when I’m hungry, eating pizza seems good and motivating, so I make a plan to eat pizza, and execute the plan. When I’m not hungry, eating pizza seems pointless or aversive, so I don’t.
By contrast, this seems impossible to explain in your framework. If I’m just copying people, how can that get linked to my own interoceptive sensation of hunger? That sensation is private to me, and other people’s sensations of hunger is private to them. There’s no SOCIAL logic behind connecting my own internal sensation of hunger to a plan-to-eat. Right?
Moreover, the plan to eat pizza is clearly “planning through a world-model”. For example, if it’s 4am and the buses aren’t running and the pizza place is closed, then I won’t try to take a bus to the pizza place. If there’s a wildfire blazing between me and the restaurant, then I also won’t try to go there. I will set out to the restaurant only if it seems like eating pizza is the plausible result of doing so. Because I want to eat pizza.
Of course, I’m not omniscient, and even beyond that, sometimes I “know” something but temporarily forgot about it. Like, maybe I forgot that the restaurant owner was on vacation. Oops. But that doesn’t undermine the idea that I am hungry, and trying to get pizza so I can eat it. The goal (eating pizza) is in my mind, and I am brainstorming how to make that goal happen. Right??
Anyway, I reiterate the first paragraph of my comment, that there’s a kernel of truth here, and that it’s very important, even if I think you’re taking it way too far.
That all sounds fine, if we’re engaged in a pragmatic project for deciding what to do, and want to propose an answer that you and I can get behind, and that lots of people around the world can also get behind.
I think Arjun is (rightly) complaining about something different, namely that Eliezer and you and others frequently slip into treating this answer as being fundamentally privileged / “Right”, as opposed to merely a pragmatic option that you and I and lots of people can get behind.
E.g. here’s Nate referring to “the future’s potential value”, as if there’s a metric for that which is canonical and characteristic of humanity-as-a-whole. I think that’s moral-realist (or “crypto”-moral-realist) thinking, sneaking in.
(Interesting post, thanks for writing it!)
I do believe the brain has much higher sample efficiency than existing DNN algorithms, in the sense that matters for guessing future ASI compute requirements. But I agree that pinning down the comparison is a bit subtle.
(Also, sample-efficiency is not the main reason why I think that FLOP-required-for-ASI is low, but rather trying to guess how much compute the brain is doing. But sure, sample-efficiency is not totally irrelevant to how I think about these things, I suppose.)
The sensory data going to the brain is (I think) >99% visual, and >99.5% visual + audio. (IIRC … I didn’t double-check, and it’s kinda controversial how to calculate it anyway…)
So it’s interesting that congenitally blind people, and deafblind people, are basically just as smart and competent as sighted & hearing people, except obviously in contexts where the missing sensory data is directly relevant. I think this observation generally pushes against a perspective that centers the story of human intelligence around our abundant sensory data.
And more specifically, RE your Appendix, if we’re going to compare frontier LLM training data with human sensory data, we should also be putting blind and deafblind people onto that same plots / tables. And also, if we’re comparing sighted people to frontier models, we need to include the frontier models’ visual training data, not just text token training data … I don’t know how many extra bytes that would be, but I’d guess a lot.
I’m not exactly sure what point you’re trying to make with the discussion of Dreamer, EfficientZero, and related, but (copying from an argument I had on this topic in 2021):
I think that if somebody wants to understand AlphaZero, the fact that it trained on 40,000,000 games of self-play is a highly relevant and interesting datapoint. Suppose you were to then say “…but of those 40,000,000 games, fundamentally it really only needed 100 games with the external simulator to learn the rules. The other 39,999,900 games might as well have been ‘in its head’. This was proven in follow-up work.”. I would reply: “Oh. OK. That’s interesting too. But I still care about the 40,000,000 number. I still see that number as a very important part of understanding the nature of AlphaZero and similar systems.”
Anyway, if a human is playing chess in his head, or replaying a memory of
that embarrassing thing that I did one time in middle schoolwhat they did yesterday, then he is not paying attention to sensory input. He’s probably mostly zoning out. So in a certain sense, the replay is replacing sensory data, as opposed to increasing the effective total amount of data, in humans. So, like, the thing in §3 where you note that LLMs can be more “sample-efficient” by doing 4 epochs of the same data, or the thing that EfficientZero etc. does, well, if you’re talking about sample-efficiency for the pragmatic reason of trying to solve AI problem where you have lots of compute but strictly limited data, then cool, that kind of thing is helpful and important. But if you’re talking about sample-efficiency in the context of trying to compare and contrast humans versus current AIs, then I think those tricks are somewhat off-topic.I concede that “brains are kinda like insanely huge 100-trillion-parameter LLMs, and that’s BOTH why we don’t have AGI yet AND why brains are (in certain senses) more sample-efficient” is a story that hangs together. And it’s a pretty popular story in LLM circles because it also fits in with scale-is-all-you-need. I really don’t think that story is right, for lots of reasons, including neuroscience stuff that I don’t want to get into, but also just, like, noticing all the ways that brains are quite different from insanely huge LLMs. There’s the continual learning stuff, the model-based RL stuff, the brain’s complete absence of “true” imitative learning, the way that cortical microcircuits simply do not look anything like transformer layers, etc.
A teenager can learn to drive in a few dozen hours; self-driving systems are trained for years on billions of miles of data. …
Steven Byrnes appears to read the gap as evidence that current algorithms are far from what the brain is doing, such that much better algorithms must be waiting to be found.
I think you’re attributing an argument to me which I wasn’t making (in the context of that post that you copied the diagram from). I agree that comparing 30 hours of teen driving practice to umpteen gazillion hours of Waymo training data is apples-and-oranges because the teen also has life experience.
But I was making a different point, which (in my own words) was: “…we don’t have AGI (artificial general intelligence) yet—not as I use the term…”. (I’m not even sure you disagree with that??)
Anyway, it is NOT the case that it’s possible to make self-driving cars by taking some generic learning algorithm that we already know about, and letting it spend the equivalent of 18 years roaming around and doing stuff in various virtual environments like VR & MineCraft, and watching YouTube videos, and reading books, and whatever, and THEN have it spend 30 hours with minimal instruction driving actual cars, and bam, now you have a human-level self-driving car. There is no generic learning algorithm today that can do that, right? If there were such an algorithm, then surely somebody would have done that already. That would have been way way way easier than what Waymo and Tesla etc. have been actually doing. So I think this example is fair game: the brain can do things that no existing AI algorithm can do, even in an apples-to-apples comparison that holds data availability fixed.
Maybe your response would be: “Oh yeah that’s easy, someone could totally do that, it’s just that nobody has bothered because the resulting AI would be too big to fit in a car computer”?? Or “Oh yeah, we totally know how to do that, it’s just that it would require more compute than would be affordable or practical at the present time”?? If so, I disagree with both of those possible objections, and we can get into why if it’s crux-y.
What people take this to mean varies widely. Steven Byrnes appears to read the gap as evidence that current algorithms are far from what the brain is doing, such that much better algorithms must be waiting to be found. His guess is that human-level, human-speed AGI will require not a datacenter but “one consumer gaming GPU,” even for training from scratch. Yarrow Bouchard on the EA Forum, reads the same gap as evidence that AGI isn’t close at all, precisely because nobody knows how to close it. Nearly opposite conclusions from the same starting observation.
I’m confused, these don’t sound “nearly opposite” to me, they sound very compatible. Did you misread something? Or maybe you’re noticing that Yarrow & I have opposite vibe and emphasis, even when we’re saying basically the same thing?
(I very strongly disagree with Yarrow about all kinds of things, but I don’t think this paragraph is pointing to an example. Here’s an example where I was partly agreeing and partly disagreeing with Yarrow on something in the vicinity of this topic.)
If you ask lots of people whether their moral preferences ought to be self-consistent, they’ll mostly say yes. If you ask lots of people whether their moral preferences are more valid after they think about them longer, after a good night’s sleep, they’ll also mostly say yes.
But also, if you ask lots of people whether it’s moral for their family to be tortured, they’ll mostly say no. And they probably won’t say that no-torture is less important than self-consistency.
Here are three (IMO reasonable) people arguing that moral deliberation / self-consistency does not straightforwardly and universally trump other ways to reach normative conclusions: Scott Alexander:
But I’m not sure I want to play the philosophy game. Maybe MacAskill can come up with some clever proof that the commitments I list above imply I have to have my eyes pecked out by angry seagulls or something. If that’s true, I will just not do that, and switch to some other set of axioms. If I can’t find any system of axioms that doesn’t do something terrible when extended to infinity, I will just refuse to extend things to infinity.
plus Stuart Armstrong here, and Joe Carlsmith discusses this a bunch (kinda arguing both sides) here & here & here.
Anyway, if we’re gonna treat CEV (and related things like Long Reflection) as meta-ethical ground truth (and not just as pragmatic projects to design a widely-acceptable ASI motivation system, per my other comment), then we have to grant moral deliberation and self-consistency a special status, NOT just “well yeah self-consistency is one of the things that people feel is good and right, along with all the other things that people feel are good and right”. And I think Arjun is asking: where would this special status come from?
It’s evidently not grounded in people’s moral intuitions, because people’s moral intuitions in favor of self-consistency are not systematically stronger or different-in-kind from people’s moral intuitions in favor of justice or whatever else. Alternatively, if we want to ground it in, like, “well they’d appreciate the value of self-consistency if they thought about it more”, then that’s circular question-begging, because it’s already granting a special status to deliberation.
I mostly agree with this (see here). My meta-ethical stance is kinda more nihilism-adjacent when compared to Eliezer (& Nate, Habryka, etc.) who are more moral-realism-adjacent. For example they’ll casually refer to “the future’s potential value” as if it’s a meaningful metric that is canonical and characteristic of humanity as a whole, not just value-from-a-particular-person’s-perspective, nor value-relative-to-a-certain-semi-arbitrary-operationalization-of-the-details-of-CEV, etc.
That said, we do face an issue that I happen to expect an ASI singleton in my lifetime, and its preferences will determine the future, for better or worse. Things like CEV / Long Reflection seem to have promise as political projects—like, flags that lots of people might feel motivated to rally around, because they all feel enthusiastic about the future that this would lead to, and which I personally also feel enthusiastic about (well, at least potentially, the details matter). They certainly seem less bad and unfair than lots of other options. Are the CEV / Long Reflection results well-defined and independent of arbitrary details of the deliberation process? My guess is: Probably not! But oh well, we have to do something, and there aren’t obviously better options.
Most of your comment seems specific to LLMs, and I don’t work on those, so no opinion.
Most of the humans whom I’ve seen put forward as moral and ethical exemplars (people who’ve foster-parented dozens or hundreds of children, donated organs to strangers, saved refugees from famine, war, persecution, or all three, spoken out against institutional violence at great personal risk, etc.) have based those actions on something closer to a virtue ethical or deontological framework than a consequentialist utilitarian one.
This might be tangential to your larger point, but based on your list of examples, I think you (like most people) are implicitly using virtue-ethics as a rubric to judge which humans are most praiseworthy. So it’s no surprise that the winners are generally acting out of virtue-ethics. By contrast, if you ask a utilitarian which humans are most praiseworthy, they would be less likely to mention the foster-parents etc., and much more likely to mention, like, Norman Borlaug, Bill Gates, these people, etc. And I would guess that those latter people would be somewhat more consequentialist-utilitarian than average in how they choose their actions. (That’s just a guess, I don’t know much about most of them, except that I watched a biopic of Bill Gates once and he didn’t come across as extremely stereotypically virtuous.)
(I’m making a narrow point that you used a circular argument, I am not trying to imply here that AIs should or shouldn’t be virtuous. But see this comment.)
Thanks. I just edited the OP to say that my original text might be an overstatement.
I still think the stopgap plan doesn’t help me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limited number of goals (they kinda have to be simple, concrete, “atomic”, and/or directly related to people’s feelings, and/or have a ground truth that can be calculated explicitly, more-or-less). This thing we’re talking about here (involving a distinction between the supervisor’s instrumental vs terminal goals) is pretty complex and abstract, and not something I have any good idea of how to install as a goal / motivation, alas.
LLMs are pretty different, no comment on that.
I feel like some of the stuff about “nitpicking” / “non central objections” / principle of charity / etc. is people talking past each other regarding two different things.
The FIRST THING is “non-load-bearing errors”. An unusually clear-cut example would be: Alice publishes a math proof, and the summation in equation (17) starts from 0 when it’s supposed to start from 1. It’s kinda obvious from context that it’s supposed to start from 1, and the proof as a whole would be valid once that’s corrected, but it’s still an error as written. Bob reads the manuscript and points out the mistake to Alice.
The SECOND THING is “Gricean failures”. An unusually clear-cut example would be: Alice says “I need to fill my car with gas”, and Bob says, “Well, no, you mean fill the car’s gas tank with gas. You’re not gonna be closing the doors and pouring gasoline through the windows onto the seats!!”
Hopefully we can all agree that Bob is being helpful in the first example and unhelpful (and annoying) in the second example. Outside of formal contexts like math, communication is always hard, and always involves imperfect analogies, ambiguities, etc. The speaker can and should do what they can to make the listener’s job easier, but ultimately the listener will inevitably need to apply at least some interpretive effort, using the principle of charity, to figure out what the speaker probably intended. Hence Grice’s maxims.
(I think my two chosen examples are at opposite extremes of a spectrum, with shades of gray in between, as opposed to “non-load-bearing errors” versus “Gricean failures” being two discrete categories.)
So anyway, I feel like at least some of the dispute is that some people are accusing Said of doing the annoying & unhelpful second thing (“Gricean failures”), and then the OP (and Said himself) are reacting with horror to the idea that people don’t want to be apprised of the first thing (“non-load-bearing errors”).
(I’m not very familiar with Said (I don’t recall him commenting on my posts ever?) so don’t have a very strong opinion either way, but I just read the famous 2018 comment on “Zetetic Explanation”, and I think I’d vote for this comment being an example of the bad second thing, not the good first thing.)
I can’t find the book you’re thinking of. :( [Could it be this one??]
Do you remember where you saw / heard that, or what it’s based on?
3 UPDATES TO MY OLD BLOG POSTS:
(1) I wrote a thing in 2023 about narcissistic personality disorder, but oops, it was wrong, so I struck it out. See “Valence” series §5.5. (I stand by the other parts of that post.)
[UPDATE MAY 2026: I no longer endorse this section (although it has some suggestive ideas that I think have nonzero overlap with the truth). I have a different theory of NPD now, informed by general aspects of human social instincts that I hadn’t yet figured out when I wrote this in 2023. I’ll hopefully write up “My Model of NPD, Take 2” when I get a chance. In the meantime, the extremely brief version (subject to change!) of my new take is something like: NPD starts from unusually strong physiological arousal upon receiving someone else’s attention (eye contact etc.). And then that has various downstream consequences that you can reason through after studying my post Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking (2025).]
(2) I added a new section §3.3 to Against empathy-by-default (2024). I think the post was fine as written, but it had an overall vibe that insinuated an overly-simplistic takeaway message that needed some nuance. I didn’t appreciate that until later, but now I do, and I don’t want people to get the wrong idea.
3.3 (Added April 2026) Things vaguely analogous to the empathy-by-default argument can happen for certain visceral reactions—just not for the core motivation / reward / RL system
In the above (especially §3.1), I think I conveyed a general vibe that within-lifetime learning will eventually, inexorably, correct all “errors” in learned brain models. But a while after writing this post, I came to better appreciate how certain visceral reactions in the brain can be set up so as to sometimes prevent updates (“corrections”). This is how people can wind up with stable phobias, and stable food-aversions, and stable traumas, and stable autistic “special interests”, and so on, even when there’s no particular innate “ground truth” underlying them. These stable “errors” are the exception not the rule, but it’s interesting that they exist at all, and that they can in some cases last a lifetime.
I discuss the algorithmic trick behind these in a later post: “Perils of under- vs over-sculpting AGI desires” (2025), specifically §6.2: “‘Defer-to-predictor mode’ in visceral reactions, and ‘trapped priors’. In brief, the trick centers around what I call “defer-to-predictor mode”, where e.g. a visceral expectation of imminent disgust can cause an actual disgust reaction. But there’s a loop-y thing, wherein the actual disgust is in turn the ground truth for how we learn that a visceral expectation of disgust is warranted. Thanks to this loop-y thing, we can wind up without any error signal telling our brains that the disgust was never warranted in the first place.
…
(3) Neuroscience of human social instincts: a sketch (2024) got another batch of changes (following a different set of important changes just last month), cleaning and streamlining a kinda muddled and partly-incorrect discussion of learning rate modulation, including by introducing new figures and new terminology.
2026-05-19: I rewrote §3–§5.1 to remove unnecessary complication, and clean up some errors and muddled thinking. More details:
In §3.2, I previously had a toy example of learning rate modulation in the thought assessors, where I was daydreaming about Taylor Swift, and then I suddenly orient to a spider jumping at me, and the learning rate modulation (I argued) was necessary to prevent learning that Taylor Swift is a risk factor for spiders jumping out at me. I do think that’s an actual solution to an actual problem, and that it’s implemented in the brain partly via the well-known “cholinergic interneuron pause” in response to (generalized) orienting reflexes. But I described this example poorly (and somewhat incorrectly), and more importantly it’s an example that’s not directly related to this post, and I think it was just causing unnecessary confusion (even I was confused when I re-read it). So I switched to a new example that overlaps much more with §4. I also deleted the discussion of learning rate modulation in the Thought Generator, which I decided was somewhat misleading and confusing as written, and off-topic anyway.
That change to §3, in turn, allowed me to shorten and streamline §4, including in ways that hopefully made §5.1 a bit clearer in turn.
The new version introduces and uses a new term I just made up, “interoceptive concept finder”, for a particular type of short-term predictor.
(The archival PDF is now up to version 3.)
In case you missed the round of edits last month, here’s that changelog entry as well:
2026-04-30: I changed terminology from “the ‘thinking of a conspecific’ flag” to “the social attention reflex”. I think the new term has better connotations, especially the way it invokes a parallel to “orienting reflex” and “startle reflex”, which likewise are associated with fast, transient, and involuntary changes in both attention and other innate signals like pleasure and arousal.
Relatedly, I deleted a few words suggesting that the social attention reflex is more likely to be in the medial hypothalamus than the lateral hypothalamus. My old term (“thinking of a conspecific” flag) suggested a social-related state variable, which struck me as more medial-ish. But now I’m thinking of it more as a fast reflex, which strikes me as more lateral-ish if anything. But I dunno, I’m just guessing.
I also dramatically shortened and simplified §6.1: (“Key idea: My ‘compassion / spite circuit’ is disproportionately active and important while the conspecific is thinking about me-in-particular”). I decided that this is a pretty straightforward point, and I was making it unnecessarily complicated.
Other minor wording tweaks (especially §3.2) for clarity.
(Copying a discussion I had elsewhere.)
THEM: The gating’s not selective. When the spider shows up in the dark corner, the argument predicts I get scared of everything co-active in cortex: the spider, that corner, the person who happens to be standing next to me, the kind of fabric I happen to be wearing that day, etc. Where does it end?
ME: I think fewer things are “co-active in the cortex” than you suggest. I think attention flits around like ten times a second, and my whole argument in that section was that involuntary attention would ensure that I’m mostly thinking about the spider when the corresponding visceral thought assessor update happens.
THEM: Let me try a couple specific examples.
You and I are exploring a dark basement together. A spider lands on you. I get more scared of spiders, plausibly more scared of dark corners, not more scared of you.
More sharply, I’m exploring a dark basement alone, and a spider lands on an old unused exercise bike. I don’t think I’m going to get more scared of that or any other exercise bike.
I think in these example, the spider lands on you or the exercise bike, so I’m going to be paying substantive attention to you / the bike?
ME: Hmmm. You’re right about my previous reply. But I think I kinda bite those bullets. I think something visceral is learned, and will manifest in the future, but calling the result “I am scared of [blah]” has some wrong connotations.
For one thing, the ground-truth reaction from seeing a spider is generally stronger than the defer-to-predictor anticipation of that reaction. So e.g. if you have strong reason to believe that a spider might jump out at you soon, you might say “I’m scared right now”, but you might also say something more specific: “I’m scared that a spider will jump out at me”. The nervousness is real and unpleasant, but the actual spider would be worse.
For another thing, if a spider jumps out from an exercise bike once ever, then the short-term predictor is learning something like: “Exercise bike is weak evidence of danger, AND this particular basement is weak evidence of danger, AND this one specific exercise bike is weak evidence of danger, AND this lighting condition is weak evidence of danger, …”. And then later you see a different exercise bike in a different location, different lighting, etc. The predictor would see this as quite weak but nonzero evidence for physiological arousal, and maybe it would be too weak to notice. How would the evidence become stronger? (A) If you see that same exercise bike in the same basement in the same lighting, that would add up to stronger evidence. Also, (B) if you see spiders jumping out of five exercise bikes in five different contexts over the course of a few months, then the predictor will keep strengthening and strengthening the connection from “exercise bike” to physiological arousal, until the effect is very noticeable. I think both of those match my experience.
Also, if a predictor learns that “exercise bike” is weak-but-nonzero evidence for physiological arousal, and then you see a bunch of other exercise bikes where nothing goes wrong, presumably that weak evidence is erased (or overridden by a different system) (cf. “extinction” in psych jargon).
Oh, hmm, good point, thanks. Let me try again:
When I think of humans who get difficult things done, or figure difficult things out, they tend to care about accomplishing those things, a lot, and in a direct and explicit way, not just e.g. as a facet of what kind of person they see themselves as. I mean, maybe “what kind of person I see myself as” has something to do with how they originally came to care about those things, but it’s not what they’re explicitly thinking about. They’re thinking directly about the object-level prize at the end of the journey, and how to get that prize.
E.g. plenty of climate change activists think of climate change activism as a good and virtuous thing to do, but I think the subset of climate change activists who are really moving the needle are the ones who are directly thinking about climate change being directly bad, and really want it to stop, and are focused directly on how to make that happen.
E.g. plenty of mathematicians think of math as a good and praiseworthy activity, but I think that the person who will solve the Riemann hypothesis will be a person who is (in addition to being smart etc.) really damn curious about why the Riemann hypothesis is true, and focused directly on figuring that out. Or they’re really damn eager to become famous by solving the Riemann hypothesis, or whatever else.
It seems to me that this is a general pattern—i.e., we need direct-consequentialism not just consequentialism-incidentally-arising-from-virtue to accomplish difficult novel tasks—and my hunch is that this pattern generalizes to brain-like AGI. If so, then we will face the problem of balancing consequentialist direct top-level goals with non-consequentialist direct top-level goals, rather than merely facing the (probably easier) problem of avoiding the former altogether.
(This is all a lightly-held opinion.)
FWIW I also have never got what is supposedly ordinal about the simulacrum levels beyond 1, the honest one. The other ‘levels’ just look like various orthogonal breeds of fakery, to me. Haven’t scrutinised deeply.
(Off-topic but fun) I think it’s at least somewhat ordinal, e.g. Zvi’s “Level 1: Symbols describe reality. Level 2: Symbols pretend to describe reality. Level 3: Symbols pretend to pretend to describe reality. Level 4: Symbols need not pretend to describe reality.”
See also Thane’s attempt.
(On reflection, this is more a semi-redundant riff on what you already wrote, and less a good responsive comment, but oh well, I already wrote it so I guess I’ll hit publish.)
When I think about the challenges with applying Solomonoff induction in practice, which the scientific method was designed around, I see two main things.
The first, as you point out, is that hypotheses are sufficiently modular that we CAN develop them piecemeal, and sufficiently complicated that we MUST develop them piecemeal. Thus, scientific hypotheses (being just one modular piece of a “real” hypothesis) can “remain agnostic” about certain observations. As I joked here, “if you treat The Law Of Conservation Of Energy as a “hypothesis”, and you “ask” Conservation Of Energy what the half-life of tritium is, then Conservation Of Energy will tell you “Huh? How should I know? Why are you asking me?”” This property (that hypotheses can be agnostic about things) is also characteristic of logical induction / prediction markets (as you point out), and infra-Bayesianism has that property too.
The second is that parsimony / Occam’s razor / Solomonoff prior is central to finding the truth, but scientists range from being imperfect at assessing the complexity-vs-parsimony of a theory, to being atrociously bad at it. So the scientific enterprise is set up to rely as little as possible at complexity-assessments. Thus, as you point out, if we could perfectly assess the complexity-vs-parsimony of a hypothesis, then there would be no need to treat prediction and retrodiction differently. The retrodiction problem is about putting too many bits into the hypothesis, and it’s only a problem because people are lousy at taking a theory and reading out the number of bits in it (i.e. they don’t notice epicycles and special pleading). So again, the scientific institution is set up to minimize reliance on complexity-assessments. But that minimum reliance is still higher than zero. You just can’t get away from it entirely. Even “data” is not theory-free, because you need theory to get from “raw” data to so-called observations.
The results are different in different fields (and sometimes pathological), as a field might or might not equilibrate to a state where practitioners with the sharpest discernment of complexity-vs-parsimony command the most respect and power and sway. See my comments here & here for hot-take examples from real-world academic fields.
So anyway, if I were working on this project, the first thing I would try is to say that the “ideal” is Solomonoff induction searching for a true hypothesis which happens to be modular (i.e. it has different pieces covering different, cleanly-separable domains), and then introduce a constraint that you can only measure bits-of-complexity with an extremely noisy ruler, and try to judge truth-seeking setups like LI etc. by how well they approximate the Solomonoff ideal under those assumptions / constraints. (...But I dunno, I didn’t think about it very hard.)