I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
The purpose of studying LDT would be to realize that the type signature you currently imagine Steve::consequentialist preferences to have is different from the type signature that Eliezer would imagine.
The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.
You can totally have preferences about the past that are still influenced by your decision (e.g. Parfit’s hitchhiker).
Decisions don’t cause future states, they influence which worlds end up real vs counterfactual. Preferences aren’t over future states but over worlds—which worlds would you like to be more real?
AFAIK Eliezer only used the word “consequentialism” in abstract descriptions of the general fact that you (usually) need some kind of search in order to find solutions to new problems. (Like I think just using a new word for what he used to call optimization.) Maybe he also used the outcome pump as an example, but if you asked him what how consequentialist preferences look like in detail, I’d strongly bet he’d say sth like preferences over worlds rather than preferences over states in the far future.
However, we would like to diversify the public face of MIRI and potentially invest heavily in a spokesperson who is not Eliezer, if we can identify the right candidate.
Is this still up to date?
To me it seems a bit surprising that you say we agree on the object level, when in my view you’re totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.
I also think the utility maximizer frame is useful, though there are 2 (IMO justified) assumptions that I see as going along with it:
There’s sth like a simplicity prior over the space of utility functions (because there needs to be some utility maximizing structure implemented in the AI).
The utility function is a function of the trajectory of the environment. (Or in even better formalization it may take as input a program which is the environment.)
I think using a learned value function (LVF) that computes valence of thoughts is a worse frame to use for tackling corrigibility because it’s harder to clearly evaluate what actions the agent will end up taking. And because this kind of “imagine some plan and what the outcome would be and let the LVF evaluate that” doesn’t seem to me how smarter than human minds operate—considering what change in the world an action would cause seems more natural than whether some imagined scene seems appealing. Even humans like me move away from the LVF frame, e.g. I’m trying to correct for scope insensitivity of my LVF by doing sth more like explicit expected utility calculations.[1]
“You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).
I’m more like “Your abstract guesturing didn’t let me see any concrete proposal that would make me more hopeful, and even if good proposals are in that direction it seems to me like most of the work would still be ahead instead of it being like ‘we can just do it sorta like that’ as you seem to present it. But maybe I’m wrong and maybe you have more intuitions and will find a good concrete proposal.”.
I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading.
Maybe study logical decision theory? Not sure where to best start but maybe here:
“Logical decision theories” are algorithms for making choices which embody some variant of “Decide as though you determine the logical output of your decision algorithm.”
Like consequentialism in the sense of “what’s the consequence of choosing the logical output of your decision algorithm in a particular way”, where consequence here isn’t a time-based event but rather the way the universe looks like conditional on the output of your decision algorithm.
- ^
I’m not confident those are the only reason why LVF seems worse here, I didn’t fully articulate my intuitions yet.
I want to note that it seems to me that Jeremy is trying to argue you out of the same mistake I tried to argue you out in this thread.
The problem is that you use “consequentialism” different than Eliezer means it. I suppose he only used the word in a couple of occasions where he tried to get accross the basic underlying model without going into excessive details, and it may read to you like your “far futue outcome pumping” matches your definitions there (though back when I looked over your cited support that Eliezer means it, it didn’t seem at all like the evidence points to this interpretation). But if you get a deep understanding of logical decision theory, or you study a lot of MIRI papers where they (where the utility of agents is iirc always over trajectories of the environment program[1]), you see what Eliezer’s deeper position is.
Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true.
I think you’re strawmanning Eliezer and propagating a wrong understanding of what “consequentialism” was supposed to refer to, and this seems like an important argument to have separately from what’s true. But a good point that we should distinguish arguing about this from arguing about what’s true.
Going forward, I suggest you use another word like “farfuturepumping” instead of “consequentialism”. (I’ll also use another word for Eliezer::consequentialism and clarify it, since it’s apparently often misunderstood.)
As quick summary, which may not be easily understandable due to inferential distance, I think that me and Eliezer both think that:
Smart AIs will be utility optimizing, but this utility is over computations/universe-trajectories, not future states.
This is a claim about how AI cognition will look like, not just about that its behavior will be coherent according to some utility function. Smart AIs will think in some utility-maximizing ways, even though it initially may be quite a mess where it’s really hard to read off what values are being optimized for, and the values may change a bit as the AI changes.
Coherence arguments only imply that a coherent agent will behave as if they optimized a utility function, not about what cognitive algorithm the agent uses. There’s an extra step needed to get to cognitive utility maximization, and AFAIK it hasn’t been explained well anywhere, but maybe it’s sorta intuitive?
It’s perfectly alright to have non-farfuturepumping preferences like you describe, but just saying it’s possible isn’t enough, you actually need to write down the utility function over universe-trajectories.
This is because if you just say “well it’s possible, so there”, you may fail to think concretely enough to see how a utility function that has the properties you imagine would actually be quite complex, and thus unlikely to be learned.
Why can’t you have a utility function but also other preferences?
There needs to be some tradeoff between the utility function and the other preferences, and however you choose this the result can be formalized as utility function. If you don’t do this you can engage in abstract wishful thinking where you can imagine a different tradeoff for different cases and thereby delude yourself about your proposal robustly working.
Why can’t you just specify that in some cases utility function u1 should be used, and in others u2 should be used?
Because when u1 is used then there’s an instrumental incentive to modify the code of the AI s.t. always u1 is used. You want reflective consistency to avoid such problems.
I would recommend you to chat with Jeremy (and maybe reread our comment thread).
- ^
Yes utility is often formalized over the space of outcomes, but the space of outcomes is iirc the space of trajectories.
- 19 Sep 2025 19:48 UTC; 2 points) 's comment on Foom & Doom 2: Technical alignment is hard by (
The authors propose to get an international treaty to pause progress towards superintelligence, including both scaling & R&D. I’m for it, although I don’t hold out much hope for such efforts to have more than marginal impact. I expect that AI capabilities would rebrand as AI safety, and plow ahead:
The problem is: public advocacy is way too centered on LLMs, from my perspective. Thus, those researchers I mentioned, who are messing around with new paradigms on arXiv, are in a great position to twist “Pause AI” type public advocacy into support for what they’re doing!
[...]
I think these people are generally sincere but mistaken, and I expect that, just as they have fooled themselves, they will also successfully fool their friends, their colleagues, and government regulators…
This seems way too pessimistic to me. (Or like sure it’s going to be hard and I’m not super optimistic, but given that you’re also relatively pessimistic the international AI R&D shutdown approach doesn’t seem too unpromising to me.)
Sure they are going to try to convince government regulators that their research is great for safety, but we’re going to try to convince the public and regulators otherwise.
I mean it’s sorta understandable to say that we currently seem to be in a relatively weak position and getting sufficient change seems hard, but movements can grow quickly. Yeah understandable that this doesn’t seem super convincing, but I think we have a handful of smart people who might be able to find ways to effectively shift the gameboard here. Idk.
More to the point though, conditional that we manage to internationally ban AI R&D, it doesn’t obviously seem that much more difficult or that much less likely that we manage to also ban AI safety efforts which can lead to AI capability increases, based on the understanding that those efforts are likely delusional and alignment is out of reach. (Tbc I would try to not ban your research, but given that your agenda is the only one I am aware of into which I put significantly more than 0 hope, it’s not clear to me that it’s worth overcomplicating the ban around that.)
Also in this common knowledge problem domain, self-fulfilling prophecies are sorta a thing, and I think it’s a bit harmful to the cause if you post on twitter and bluesky that you don’t have much hope in government action. Tbc, don’t say the opposite either, keep your integrity, but maybe leave the critizism on lesswrong? Idk.
Can you make “sort by magic” the default sort for comments under a post? Here’s why:
The problem: Commenting late on a post (after the main reading peak) is disincentivized, not only because fewer people will read the post and look over the comments, but also because most people only look over the top scoring comments and won’t scroll down far enough to read your new comment. This also causes early good comments to continue to accumulate more karma because more people read those, so the usual equilibrium is that early good comments stay on top and late good comments don’t really get noticed.
Also, what one cares about for sorting is the quality of a comment, and the correct estimator for that would be “number of upvotes per views”. I don’t know how you calculate magic but it seems very likely like a better proxy for this than top scoring. (If magic doesn’t seem very adequate and you track page viewcounts, you could also get a more principled new magic sort, though you’d have to track for each comment what viewcount the page had at the time when the comment was posted. Like if the average ratio of upvotes per views is a/b, you could assign each comment a score of (upvotes+a)/(page_views_since_comment_was_posted+b), and sort descending by score.)
Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
One particular problem is that preferences which aren’t over world trajectories aren’t robust:
Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
(But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)
I do think there’s optimization pressure to steer for not being caught being misleading, but I think it’s rather because of planning how to achieve other goals while modelling reality accurately, instead of the AI learning to directly value “don’t get caught being misleading” in its learned value function.
Though possibly the AI could still learn to value this (or alternatively to value “don’t be misleading”), but in such a case these value shards seem more like heuristic value estimators applied to particular situations, rather than a deeply coherent utility specification over universe-trajectories. And I think such other kind of preferences are probably not really important when you crank up intelligence past the human level because those will be seen as constraints to be optimized around by the more coherent value parts, and you run into nearest unblocked strategy problems. (I mean, you could have a preference over universe trajectories that at no timestep you be misleading, but given the learning setup I would expect a more shallow version of that preference to be learned. Though it’s also conceivable that the AI rebinds it’s intuitive preference to yield that kind of coherent preference.)
So basically I think we don’t just need to get the AI to learn a “don’t be misleading” value shard, where problems are that (1) it might be outvoted by other shards in cases where being misleading would be very beneficial, and (2) the optimization for other goals might find edge instantiations that are basically still misleading but don’t get classified as such. So we’d need to learn it in exactly the right way.
(I have an open discussion thread with Steve on his “Consequentialism and Corrigibility” post, where I mainly argue that Steve is wrong about Yud’s consequentialism being just about future states and that it is instead about values over universe trajectories like in the corrigibility paper. IIUC Steve thinks that one can have “other kinds of preferences” as a way to get corrigibility. He unfortunately didn’t make it understandable to me how such a preference may look like concretely, but one possibility is that he thinks about such “accessor of the current situation” kind of preferences, because humans have such short-term preferences in addition to their consequentialist goals. But I think when one cranks up intelligence the short term values don’t matter that much. E.g. the AI might do some kind of exposure therapy to cause the short-term value shards to update to intervene less. Or maybe he just means we can have a coherent utility over universe trajectories where the optimum is indeed a non-deceptive strategy, which is true but not really a solution because such a utility function may be complex and he didn’t specify how exactly the tradeoffs should be made.)
Great post! The over- vs undersculpting distinction does currently seem a lot nicer to me than I previously considered the outer- vs inner-alignment distinction.
Some comments:
1:
The “over-/undersculpting” terminology seems a bit imperfect because it seems like there might be a golden middle, whereas actually we have both problems simultaneously. But maybe it’s fine because we sorta want sth in the middle, it’s just that hitting a good middle isn’t enough. And it does capture well that having more of one problem might lead to having less of the other problem.
2:
The human world offers an existence proof. We’re often skeptical of desire-changes—hence words like “brainwashing” or “indoctrination”, or radical teens telling their friends to shoot them if they become conservative in their old age. But we’re also frequently happy to see our desires change over the decades, and think of the changes as being for the better. We’re getting older and wiser, right? Well, cynics might suggest that “older and wiser” is cope, because we’re painting the target around the arrow, and anyway we’re just rationalizing the fact that we don’t have a choice in the matter. But regardless, this example shows that the instrumental convergence force for desire-update-prevention is not completely 100% inevitable—not even for smart, ambitious, and self-aware AGIs.
This might not generalize to super-von-Neumann AGIs. Normal humans are legit not optimizing hard enough to come up with the strategy of trying to preserve their goals in order to accomplish their goals.
Finding a reflectively stable motivation system that doesn’t run into the goal-preservation instrumental incentive is what MIRI tried in their corrigibility agenda. They failed because it turned out to be unexpectedly hard. I’d say that makes it unlikely that an AGI will fall into such a reflectively-stable corrigibility basin when scaling up intelligence a lot, even when we try to make it think in corrigible ways. (Though there’s still hope for keeping the AI correctable if we keep it limited and unreflective in some ways etc.)
3:
As an example (borrowing from my post “Behaviorist” RL reward functions lead to scheming), I’m skeptical that “don’t be misleading” is really simpler (in the relevant sense) than “don’t get caught being misleading”. Among other things, both equally require modeling the belief-state of the other person. I’ll go further: I’m pretty sure that the latter (bad) concept would be learned first, since it’s directly connected to the other person’s immediate behavior (i.e., they get annoyed).
I (tentatively) disagree with the frame here, because “don’t get caught being misleading” isn’t a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC). I think it’s probably correct that humans usually act based on such myopic value accessors, but in cases where very hard problems need to be solved, what matters are the more coherent situation independent values. So my story for why the AI would be misleading is rather because it plans how to best achieve sth and being misleading without getting caught is a good strategy for this.
I mean, there might still be myopic value accessor patterns, though my cached reply would be that these would just be constraints being optimized around by the more coherent value parts, e.g. by finding a plan representation where the myopic pattern doesn’t trigger. Aka the nearest unblocked strategy problem. (This doesn’t matter here because we agree it would learn “don’t get caught”, but possible that we still have a disagreement here like in the case of your corrigibility proposal.)
I listened to it via speechify (though you need pro for acceptable listening speed). If you want sth better you could try asking AskWhoCastsAI (possibly offering to pay him).
Seems like a fine time to share my speculations about yet unresolved easter eggs from the story. I’m not overly confident on either of those.
I present some hints first in case you want to try to think about it yourself.
The core (and power) of the Elder Wand
From chapter 122:
Harry took the Elder Wand out of his robes, gazed again at the dark-grey wood that Dumbledore had passed down to him. Harry had tried to think faster this time, he’d tried to complete the pattern implied by the Cloak of Invisibility and the Resurrection Stone. The Cloak of Invisibility had possessed the legendary power of hiding the wearer, and the hidden power of allowing the wearer to hide from Death itself in the form of Dementors. The Resurrection Stone had the legendary power of summoning an image of the dead, and then Voldemort had incorporated it into his horcrux system to allow his spirit to move freely. The second Deathly Hallow was a potential component of a system of true immortality that Cadmus Peverell had never completed, maybe due to his having ethics.
And then there was the third Deathly Hallow, the Elder Wand of Antioch Peverell, that legend said passed from wizard to stronger wizard, and made its holder invincible against ordinary attacks; that was the known and overt characteristic...
The Elder Wand that had belonged to Dumbledore, who’d been trying to prevent the Death of the world itself.
The purpose of the Elder Wand always going to the victor might be to find the strongest living wizard and empower them still further, in case there was any threat to their entire species; it could secretly be a tool to defeat Death in its form as the destroyer of worlds.
But if there was some higher power locked within the Elder Wand, it had not presented itself to Harry based on that guess. Harry had raised up the Elder Wand and spoken to it, named himself a descendant of Peverell who accepted his family’s quest; he’d promised the Elder Wand that he would do his best to save the world from Death, and take up Dumbledore’s duty. And the Elder Wand had answered no more strongly to his hand than before, refusing his attempt to jump ahead in the story. Maybe Harry needed to strike his first true blow against the Death of worlds before the Elder Wand would acknowledge him; as the heir of Ignotus Peverell had already defeated Death’s shadow, and the heir of Cadmus Peverell had already survived the Death of his body, when their respective Deathly Hallows had revealed their secrets.
At least Harry had managed to guess that, contrary to legend, the Elder Wand didn’t contain a core of ‘Thestral hair’. Harry had seen Thestrals, and they were skeletal horses with smooth skin and no visible mane on their skull-like heads, nor tufts on their bony tails. But what core was truly inside the Elder Wand, Harry hadn’t yet felt himself knowing; nor had he been able to find, anywhere on the Elder Wand, the circle-triangle-line of the Deathly Hallows that should have been present.
Previously in the Askaban arc, it was also mentioned that the sign of the Deathly Hallows on the invisibility cloak was drawn in thestral blood, binding in that part of the thestral’s power into the cloak, to make the the wearer as invisible to death’s shadow as thestrals are to the unknowing.
Suppose there’s some structure to it, try to fill out this table:
Invisibility Cloak Thestral Death’s shadow (=dementors) Ressurection Stone ? Personal Death Elder Wand ? maybe Death of Worlds (?) My guess
Invisibility Cloak Thestral Death’s shadow (=dementors) Ressurection Stone Unicorn Personal Death Elder Wand Centaur Death of Worlds So the second power of the Elder Wand may be some divination power. That would fit well to preventing the Death of Worlds, although it’s a bit unclean to have two explanations for Dumbledore’s divination power.
The true source of Dumbledore’s power
From chapter 86 (emphasis mine):
“The Hall of Prophecy,” Minerva whispered. She’d read about that place, said to be a great room of shelves filled with glowing orbs, one after another appearing over the years. Merlin himself had wrought it, it was said; the greatest wizard’s final slap to the face of Fate. Not all prophecies conduced to the good; and Merlin had wished for at least those spoken of in prophecy, to know what had been spoken of them. That was the respect Merlin had given to their free will, that Destiny might not control them from the outside, unwitting. Those mentioned within a prophecy would have an glowing orb float to their hand, and then hear the prophet’s true voice speaking. Others who tried to touch an orb, it was said, would be driven mad—or possibly just have their heads explode, the legends were unclear on this point. Whatever Merlin’s original intention, the Unspeakables hadn’t let anyone enter in centuries, so far as she’d heard. Works of the Ancient Wizards had stated that later Unspeakables had discovered that tipping off the subjects of prophecies could interfere with seers releasing whatever temporal pressures they released; and so the heirs of Merlin had sealed his Hall.
From chapter 119:
During the First Wizarding War, there came a time when I realised that Voldemort was winning, that he would soon hold all within his hand.
In that extremity, I went into the Department of Mysteries and I invoked a password which had never been spoken in the history of the Line of Merlin Unbroken, did a thing forbidden and yet not utterly forbidden.
I listened to every prophecy that had ever been recorded.
Confusion: Accessing the Hall of Prophecy doesn’t sound like sth that happened the first time in the history of the Line of Merlin Unbroken.
Notice: Dumbledore’s letter does not strictly say that the forbidden thing Dumbledore did was listening to all the prophecies. Those statements could refer to separate events.
Another useful excerpt from ch 80 (emphasis mine):
This is the Hall of the Wizengamot; there are older places, but they are hidden. Legend holds that the walls of dark stone were conjured, created, willed into existence by Merlin, when he gathered the most powerful wizards left in the world and awed them into accepting him as their chief. And when (the legend continues) the Seers continued to foretell that not enough had yet been done to prevent the end of the world and its magic, then (the story goes) Merlin sacrificed his life, and his wizardry, and his time, to lay in force the Interdict of Merlin.
From chapter 110:
“Distraction? ” roared Dumbledore, his sapphire eyes tight with fury. “You killed Master Flamel for a distraction? ”
Professor Quirrell looked dismayed. “I am wounded by the injustice of your accusation. I did not kill the one you know as Flamel. I simply commanded another to do so.”
“How could you? Even you, how could you? He was the library of all our lore! Secrets you have forever lost to wizardry! ”
Confusion: Dubledore seems a bit more magically powerful than Voldemort, so minus Elder Wand he should probably still be at least almost as powerful as Voldemort. Magical power comes mostly from lore, so if Dumbledore’s lore comes from Flamel, then it’s a bit surprising that Voldemort was able to just order someone to kill Flamel.
So how would you resolve those confusions given the hints I dropped here?
Last hint:
The method to trap objects or people in a timeless space in the mirror is called “Merlin’s method”.
My guess
Merlin trapped himself in the mirror. The forbidden password Dumbledore spoke allowed him to talk to Merlin through the mirror. Merlin gave Dumbledore additional lore to fight Voldemort. Voldemort has likely figured while he was trapped for 9 years.
(This also means that once Harry figures this out he can read the forbidden letter in the department of mysteries and use the technique to (at least temporarily) retrieve Dumbledore from the mirror. (Yeah I know Dumbledore said he couldn’t retrieve Voldemort, but I think that’s just because Dumbledore doesn’t want to, and wanting to is a requirement for the mirror.))
Thanks, will edit!
Other Eliezerfics that come to mind are:
The Sword of Good
Three Worlds Collide
A girl corrupted by the Internet is the summoned hero
Some other ones than planecrash here: https://www.glowfic.com/users/366
Of those I would mainly just recommend “aviation is the most dangerous routine activity” and “for no laid course prepare”.
Dark Lord’s Answer: Review and Economics Excerpts
I’d be interested in trying thinking assistants to help me with my work. The main time window where I’d want that would probably be 10:30am-2:30pm CEST (with 1h break in the middle) but I’m slightly flexible. (Feel free to PM me about this even if you read this in a year or so from me now posting this.)
I’m working on a long-term non-ML alignment agenda (and also on leveling up my rationality) for which I’m currently doing introspection and concrete analyses of how I solve problems.
There are two ways confirmation bias works. One is that it’s easier to think of confirming evidence than disconfirming evidence. The associative links tend to be stronger. When you’re thinking of a hypothesis you tend to believe, it’s easy to think of evidence that supports it.
The stronger one is that there’s a miniature Ugh field[1] surrounding thinking about evidence and arguments that would disprove a belief you care about. It only takes a flicker of a thought to make the accurate prediction about where considering that evidence could lead: admitting you were wrong, and doing a bunch of work re-evaluating all of your related beliefs. Then there’s a little unconscious yuck feeling when you try to pay attention to that evidence.
I usually like to call only the first “confirmation bias” and only the second “motivated reasoning”.
Also I’d rather phrase the first like: Our expectations influence our information processing in a way that causes causes confirming evidence to be more salient and thereby we update on it more.
I’m still a bit confused why this is the case. Your “the links are weaker” seems quite plausible, but if so I’d still like to understand why the links are weaker.
On priors I rather would’ve expected that the brain uses surprise-propagation algorithms that promote information to attention that doesn’t fit our existing models, since those have the most relevant information to update on.
I’d be interested in more precise models of confirmation bias.
It’s not at all obvious to me that motivated reasoning is worse than the first kind of confirmation bias (they might both be really devastating).
Secondly, there are three popular books which I would advise not to read. They are “Eat that Frog”, “7 habits of highly effective people” and “Getting Things Done—the art of stress-free productivity”. I found that all of them are 5% signal and 95% noise and their most important messages could have been summarized on 5 to 10 pages respectively.
I think Getting Things Done is awesome. I read it 3 times and it was super useful for me. I’ve created myself an (in my opinion particularly nice) GTD system in notion and I love it.
You don’t need to use GTD for the 1-2 main projects you’re working on in a particular week (though of course you still want to organize tasks and notes for those somehow), but it’s super useful for managing everything else so you have more time/energy focusing on your core projects.
Though it may take a while to set up a good system and learn to use it well. You want to tune it to fit your needs, e.g. the example “context” categories by which to structure next action lists may not fit your purposes that well.
The 5% signal seems especially surprising to me w.r.t. GTD. There’s just so much good content in the book. Of course it could be summarized further but examples are important for understanding the content. Even the basic system setup which it guides you through is quite a lot of content to implement for one productivity book, but there’s a lot more in the book which you can start to pay more attention to once you’ve established a decent system with capture, inbox processing, and weekly review habits.
The main value is the GTD organizing system, but there’s also great advice that can be applied independent of the system, e.g. the 5-step natural planning model (iirc):
Answer “Why do you want to do the project / achieve the goal?”
Answer “What is the goal of the project?”. Visualize success if possible.
Brainstorm
Organize
Decide
I guess maybe it’s not that obvious that planning in roughly such a way is good if you haven’t tried, but it’s good.
Thanks.
I think you are being led astray by having a one-dimensional notion of intelligence.
(I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won’t help you much for the parts of the problem we are most bottlenecked on.)
You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn’t be large.
Or do you imagine strategically keeping some information from the AI?
Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)
Even if the alignment works out perfectly, when the AI is smarter and the humans are like “actually we want to shut you down”, the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn’t actually going to happen, it can just be like “sorry, that’s not actually in your extrapolated interests, you will perhaps understand later when you’re smarter”, and then tries to fulfill human values.
But if we’re confident alignment to humans will work out we don’t need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.
If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don’t find out or won’t (be able to) change its values back, because that’s the strategy that’s best according to the AI’s new values.
Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
Likewise just updating on new information, not changing terminal goals.
Also note that parents often think (sometimes correctly) that they better know what is in the child’s extrapolated interests and then don’t act according to the child’s stated wishes.
And I think superhumanly smart AIs will likely be better at guessing what is in a human’s interests than parents guessing what is in their child’s interest, so the cases where the strategy gets updated are less significant.
I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
From my perspective CIRL doesn’t really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)
I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.
I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.
But it’s not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn’t work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like “eh probably it works fine but not sure”, but my current biggest doubt is “when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?”.
Let me try to understand in more detail how you imagine the AI to look like:
How does the learned value function evaluate plans?
Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just “the human walks toward the reward button” or also “getting more relevant knowledge is usually good”.
Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like “value of information”, “gaining instrumental resources”, etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
If there are deceptive plans like “ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup”, would the LVF detect that as being low expected button presses?
I can try to imagine in more detail about what may go wrong once I better see what you’re imagining.
(Also in case you’re trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)
I don’t like superagency, but yeah seems important to have a better word for this. Maybe just RCR as abbreviation. Or hard-going or hard-optimizing.
I sometimes used “Harry-Factor” when talking to people who read HPMoR to describe what kind of intelligence I mean, and made examples like what he came up with in the last army battle, but obviously we want a different word.