Mateusz Bagiński

Karma: 2,311

Agent foundations, AI macrostrategy, civilizational sanity, human enhancement.

I endorse and operate by Crocker’s rules.

I have not signed any agreements whose existence I cannot mention.

Mateusz Bagiński 13 Nov 2025 9:59 UTC
4 points
0
in reply to: Wei Dai’s comment on: Please, Don’t Roll Your Own Metaethics
the total idea/argument space being exponentially vast and underexplored due to human limitations, therefore high confidence being unjustified in light of this
There’s also the thing that the idea/argument space contains dæmons/attractors exploiting shortcomings of human cognition, thus making humans hold them with higher confidence than they would if they didn’t have those limitations.

Mateusz Bagiński 12 Nov 2025 14:23 UTC
2 points
0
on: Teleosemantics & Swampman
One way to patch/augment the distinction between original and derived intentionality is to allow for viewing phenomena with derived intentionality as doing some optimization/selection decohered from the purpose of the original intentional phenomenon (inner/mesa-optimizer-like).
For example, Process 1 with original intentionality (“intentionally”) spawns Process 2 with derived intentionality, which (“non-intentionally”) spawns Subprocess 2a, which then (again, “intentionally”) spawns Process 3.
If we draw a ~~phylogenetic tree~~ messy, directed mycelium of optimization processes, some “children” will be “legitimate” (“intentionally spawned”), whereas other “children” will be “illegitimate” (“non-intentionally spawned”).
It’s not as much about giving something “special status”, as rather about qualifying the type of parent-children relationship between two processes/artifacts. (It’s probably also possible to fold the cases where there’s no distinction between the optimizer and the optimizee.)
My own feeling is that this original/derived idea is somewhat chasing a ghost, like needing there to be something special which elevates the goals of a human above the purposes of a hammer. Granted, there are differences between how humans have goals and how a hammer has a purpose. Original vs derived just seems to me like the wrong frame for that.
I would like to hear you elaborate on that.

Mateusz Bagiński 12 Nov 2025 10:05 UTC
8 points
0
on: How I Learned That I Don’t Feel Companionate Love
So I got my parents’ genomes sequenced. One of them had the frameshift-inducing mutation, as expected. The other had a few substitutions which I share. Alas, that parent’s substitutions… were also shared by my other parent
Do you know if anyone ever remarked that your parents are noticeably unusual on the axis of oxytocin-dependent functionality?

Mateusz Bagiński 11 Nov 2025 14:42 UTC
2 points
0
in reply to: Jan_Kulveit’s comment on: Limits to Legibility
In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as ‘caring’ by legible, symbolically described, verbal moral reasoning subject to memetic pressure.
C. Thi Nguyen has written about this (in general, not specifically EA) under the labels of value capture/collapse/swamping.
https://philpapers.org/rec/NGUVCH
https://www.youtube.com/watch?v=zt03qjTyefU

Mateusz Bagiński 10 Nov 2025 14:21 UTC
2 points
0
in reply to: Raemon’s comment on: The title is reasonable
I don’t think anyone said “coherent”. I think (and think Eliezer thinks) that if something like Sable was created, it would be a hodge-podge of impulses without a coherent overall goal, same as humans are by default.
from https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making :
In saying the above, I do not mean the following:
(1) Any practical AI that you’re dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this “goal slot” (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this]
Instead, I mean something more like the following:
(2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a “goal slot”, it’s not letting you touch it.)

Mateusz Bagiński 10 Nov 2025 14:16 UTC
6 points
0
in reply to: dr_s’s comment on: I ate bear fat with honey and salt flakes, to prove a point
Huh… This reminds me of a conversation I had ~11 years ago with a Polish guy recounting his visit to Chechnya and Chechnyan hospitality, including them giving him bread with sour cream and sugar.

Mateusz Bagiński 10 Nov 2025 13:43 UTC
2 points
0
in reply to: 1a3orn’s comment on: The title is reasonable
The upshot is that (afaict) there’s no such thing in humans as “working out one’s true values” apart from an environment, where for humans the most salient feature of the environment (for boring EvoPsych reasons) is what the people around one are like and how they’ll react. People who think they’re “working out their true values” in the sense of crystalizing facts about themselves, rather than running forward a state-function of the the self, friends, and environment, are (on this view) just self-deceiving.
How do you think Jeremy Bentham came to the conclusion that animal welfare matters morally and that there’s nothing morally wrong with homosexuality? Are you claiming that he ran forward a computation of how the relevant parts of his social milieu are going to react, and did what maximized the expected value of reaction?
I buy that this is how most of human “value formation” happens, but I don’t buy that this is all that happens. I think that humans vary in some trait similar to the need for cognition (probably positively correlated), which is something like “how much one is bothered by one’s value dissonances”, independent of social surroundings.
Like, you could tell a similar history about intellectual/scientific/technological progress, and it would be directionally right, but not entirely right, and the “not entirely” matters a lot.
Aside from all that, I expect that a major part of AIs’ equivalent of social interaction will be other AIs or general readouts of things on the internet downstream of human and non-human activity that do not exert a strong pressure in the direction of being more human-friendly, especially given that AIs do not share the human social machinery (as Ray says).

Mateusz Bagiński 10 Nov 2025 0:34 UTC
8 points
0
on: Two Times I Was Surprised By My Own Values
For many years, I thought I didn’t want children. Then, at some point, I noticed that whenever I look up the Wikipedia page of a person from the past that I admire in some respect, I check whether they have any living descendants, and I’m a bit sad if they don’t.
This is not quite an inconsistency in values, but a sort of dissonance that made me do a bit of soul-searching to re-examine the reasons (I thought) I had for not wanting kids.

Mateusz Bagiński 10 Nov 2025 0:28 UTC
3 points
0
in reply to: TsviBT’s comment on: Mourning a life without AI
The diff is distributed between the diffs of P(AGI^[1] in n years) and P(doom | AGI in n years), so Nina might just have a lower timeline-conditional p(doom) and not significantly longer timelines.
1. ^
  or ASI or whatever

Mateusz Bagiński 9 Nov 2025 16:50 UTC
2 points
0
in reply to: Thane Ruthenis’s comment on: Legible vs. Illegible AI Safety Problems
Generally agree with the caveat that...
A human succeeds if the step’s difficulty is below some threshold, and fails and goes back to square one otherwise.
...the difficulty of a step is generally somewhat dependent on some contingent properties of a given human mind.

Mateusz Bagiński 8 Nov 2025 13:21 UTC
4 points
1
in reply to: sil’s comment on: Legible vs. Illegible AI Safety Problems
The extreme variance of responses/reception to the GD paper indicates that it is an obvious thing for some people (e.g., Zvi in his review of it), whereas for other people it’s a non-issue if you solve alignment/control (I think Ryan Greenblatt’s responses under one of Jan Kulveit’s posts about GD).
So I’d say it’s a legible problem for some (sub)groups and illegible for others, although there are some issues around conceptual engineering of the bridge between GD and orthodox AI X-risk that, as far as I’m aware, no one has nailed down yet.

Mateusz Bagiński 8 Nov 2025 13:15 UTC
2 points
0
in reply to: Wei Dai’s comment on: Legible vs. Illegible AI Safety Problems
At this point, I am not surprised by this sort of thing at all, only semi-ironically amused, but I’m not sure whether I can convey why it’s not surprising to me at all (although I surely would be surprised by this if somebody made it salient to me some 5 or 10 years ago).
Perhaps I just got inoculated by reading about people making breakthroughs with simple or obvious in-hindsight concepts or even hearing ideas from people that I thought were obviously relevant/valuable to have in one’s portfolio of models, even though for some reason I hadn’t had it until then, or at least it had been less salient to me than it should have.
Huxley’s famous response to the idea of natural selection was “How extremely stupid not to have thought of that!”
Anders Sandberg said that he had had all the pieces of the Grabby Aliens model on the table and only failed to think of an obvious way to put them together.
One frame (of unclear value) I have for this kind of thing is that the complexity/salience/easiness-to-find of an idea before and after is different because, well, a bunch of stuff in the mind is different.

Mateusz Bagiński 7 Nov 2025 14:15 UTC
2 points
0
in reply to: tailcalled’s comment on: Tomás B.’s Shortform
AFAIK, it is best (expected-outcomes-wise) to be short, but for “boring genetic reasons” (as opposed to (genetic or non-genetic) disease reasons), because fewer cells means smaller propensity to develop cancer and a bunch more stuff (holding everything else constant).

Mateusz Bagiński 7 Nov 2025 13:00 UTC
7 points
7
in reply to: interstice’s comment on: People Seem Funny In The Head About Subtle Signals
This and/or “selecting for a partner who is good at reading my signals” and/or plausible deniability is my go-to explanation for the dating case, but I don’t think it applies to everything discussed in this post or even to all the stuff in the dating case.

Mateusz Bagiński 6 Nov 2025 14:17 UTC
4 points
0
in reply to: Linda Linsefors’s comment on: People Seem Funny In The Head About Subtle Signals
1. 40%?
2. No, my impression has always been that you aim for comfy clothes.
  1. Maybe modulo cases of you wearing an AI Safety Camp t-shirt or something like that.
  2. Maybe you’re kinda trying to signal preference for comfy clothes in addition to that by deliberately trying to choose clothes that someone would choose iff they prioritize comfiness above all else. Not that I have any specific evidence of that, just putting a hypothesis on the table.

Mateusz Bagiński 6 Nov 2025 13:03 UTC
17 points
0
on: Mateusz Bagiński’s Shortform
In his MLST podcast appearance in early 2023, Connor Leahy describes Alfred Korzybski as a sort of “rationalist before the rationalists”:
Funny story: rationalists actually did exist, technically, before or around World War One. So, there is a Polish nobleman named Alfred Korzybski who, after seeing horrors of World War One, thought that as technology keeps improving, well, wisdom’s not improving, then the world will end and all humans will be eradicated, so we must focus on producing human rationality in order to prevent this existential catastrophe. This is a real person who really lived and he actually sat down for like 10 years to like figure out how to like solve all human rationality God bless his autistic soul. You know, he failed obviously but you know you can see that the idea is not new in this regard.
Korzybski’s two published books are Manhood of Humanity (1921) and Science and Sanity (1933).
E. P. Dutton published Korzybski’s first book, Manhood of Humanity, in 1921. In this work he proposed and explained in detail a new theory of humankind: mankind as a “time-binding” class of life (humans perform time binding by the transmission of knowledge and abstractions through time which become accreted in cultures).
Having read the book (and having filtered it through some of my own interpretaion of it and perhaps some steelmanning) I am inclined to interpret his “time-binding” as something like (1) accumulation of knowledge from past experience across time windows that are inaccessible to any other animals (both individual (long childhoods) and cultural learning); and (2) the ability to predict and influence the future. This gets close in the neighborhood of “agency as time-travel”, consequentialist cognition, etc.
In the wiki page of his other book:
His best known dictum is “The map is not the territory”: He argued that most people confuse reality with its conceptual model.
(But that is relatively well-known.)
Korzybski intended the book to serve as a training manual. In 1948, Korzybski authorized publication of Selections from Science and Sanity after educators voiced concerns that at more than 800 pages, the full book was too bulky and expensive.
As Connor said...
God bless his autistic soul. You know, he failed obviously but
...but 60 years later, his project would be restarted.
See also: https://www.lesswrong.com/posts/qc7P2NwfxQMC3hdgm/rationalism-before-the-sequences

Mateusz Bagiński 5 Nov 2025 14:26 UTC
2 points
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
[Tangent:]
There is a sort of upside to this, in that to the extent that people are more inclined to post shortforms than longforms due to the lower perceived/expected effort of the former, there is a possibility of (optional?) UX engineering to make writing longforms feel a bit more like writing shortforms, so that people who have something to write but also have a feeling of “ugh, that would be a lot of effort, I’ll do it when I’m not as tired [or whatever]” would be more inclined to write and post it.
Relatedly, every few days, I find myself writing some long and detailed message in a DM, which I would be less motivated to write in my personal notes, let alone write a blog post about it, and sometimes the message turns out to look like a first draft of a blog post.^[1] How to hijack this with UX?^[2]
1. ^
  After I started talking about it, I found out that apparently “write an article like a message to an intellectual-peer friend” is something like a folk advice.
2. ^
  Of course, also: How to hijack this with stuff other than UX?

Mateusz Bagiński 5 Nov 2025 9:51 UTC
2 points
0
on: Intentionality
“Intentionality” fits somewhat nicely Michael Bratman’s view of intentions as partial plans: you fix some aspect of your policy to satisfy a desire, so that you are robust against noisy perturbations (noisy signals, moments of “weakness of will”, etc), can use the belief that you’re going to behave in a certain way as an input to your further decisions and beliefs (as well as other agents’ precommitments), not have to precompute everything in runtime, etc.^[1]
A downside of the word is that it collides in the namespace with how “intentionality” is typically used in philosophy of mind, close to referentiality (cf. Tomasello’s shared intentionality).
Perhaps the concept of “deliberation” from LOGI is trying to point in this direction, although it covers more stuff than consulting explicit representations.
The human mind, owing to its accretive evolutionary origin, has several major distinct candidates for the mind’s “center of gravity.” For example, the limbic system is an evolutionarily ancient part of the brain that now coordinates activities in many of the other systems that later grew up around it. However, in (cautiously) considering what a more foresightful and less accretive design for intelligence might look like, I find that a single center of gravity stands out as having the most complexity and doing most of the substantive work of intelligence, such that in an AI, to an even greater degree than in humans, this center of gravity would probably become the central supersystem of the mind. This center of gravity is the cognitive superprocess which is introspectively observed by humans through the internal narrative—the process whose workings are reflected in the mental sentences that we internally “speak” and internally “hear” when thinking about a problem. To avoid the awkward phrase “stream of consciousness” and the loaded word “consciousness,” this cognitive superprocess will hereafter be referred to as deliberation.
[ … ]
Deliberation describes the activities carried out by patterns of thoughts. The patterns in deliberation are not just epiphenomenal properties of thought sequences; the deliberation level is a complete layer of organization, with complexity specific to that layer. In a deliberative AI, it is patterns of thoughts that plan and design, transforming abstract high-level goal patterns into specific low-level goal patterns; it is patterns of thoughts that reason from current knowledge to predictions about unknown variables or future sensory data; it is patterns of thoughts that reason about unexplained observations to invent hypotheses about possible causes. In general, deliberation uses organized sequences of thoughts to solve knowledge problems in the pursuit of real-world goals.
Cf. https://www.lesswrong.com/w/deliberate-practice. Wiktionary defines “deliberate” in terms of “intentional”: https://en.wiktionary.org/wiki/deliberate#Adjective.
1. ^
  At least that’s the Bratman-adjacent view of intention that I have.

Mateusz Bagiński 3 Nov 2025 12:09 UTC
2 points
0
in reply to: Wei Dai’s comment on: Trying to understand my own cognitive edge
Thanks!
The entire thing seems to have a very https://www.lesswrong.com/posts/bhLxWTkRc8GXunFcB/what-are-you-tracking-in-your-head vibes, though that’s admittedly not very specific.
What stands out to me in the b-money case is that you kept tabs on “what the thing is for”/”the actual function of the thing”/”what role it is serving in the economy”, which helped you figure out how to make a significant improvement.
Very speculatively, maybe something similar was going on in the UDT case? If the ideal platonic theory of decision-making “should” tell you and your alt-timeline-selves how to act in a way that coheres (~adds up to something coherent?) across the multiverse or whatever, then it’s possible that having anthropics as the initial motivation helped.

Mateusz Bagiński 3 Nov 2025 9:53 UTC
4 points
0
on: Trying to understand my own cognitive edge
the main thing that appears to have happened is that I had exceptional intuitions about what problems/fields/approaches were important and promising
I’d like to double-click on your exceptional intuitions, though I don’t know what questions would be most revealing if answered. Maybe: could you elaborate on what you saw that others didn’t see and that made you propose b-money, UDT, the need for an AI pause/slowdown, etc?
E.g., what’s your guess re what Eliezer was missing (in his intuitions?) in that he came up with TDT but not UDT? Follow-up: Do you remember what the trace was that led you from TDT to UDT? (If you don’t, what’s your best guess what it was?)