abramdemski

Karma: 20,403

abramdemski 7 Nov 2025 16:44 UTC
LW: 2 AF: 2
0
AF
in reply to: cousin_it’s comment on: Geometric UDT
Perhaps I’m still not understanding you, but here is my current interpretation of what you are saying:
- The (expected utility) argument that it is valuable for us to get the ASI to entangle its values with ours relies on the assumption of non-nosy-ness.
  - That is: since we are uncertain which values are ours, but whichever thing we value, we’re just as happy to impose that thing on versions of ourselves which do not value that thing, we don’t see any increase in expected value from Geometric UDT.
I see this line of reasoning as insisting on taking max-expected-utility according to your explicit model of your values (including your value uncertainty), even when you have an option which you can prove is higher expected utility according to your true values (whatever they are).
My argument has a somewhat frequentist flavor: I’m postulating true values (similar to postulating a true population frequency), and then looking for guarantees with respect to them (somewhat similar to looking for an unbiased estimator). Perhaps that is why you’re finding it so counter-intuitive?
The crux of the issue seems to be whether we should always maximize our explicit estimate of expected utility, vs taking actions which we know are better with respect to our true values despite not knowing which values those are. One way to justify the latter would be via Knightian value uncertainty (ie infrabayesian value uncertainty), although that hasn’t been the argument I’ve been trying to make. I’m wondering if a more thoroughly geometric-rationality perspective would provide another sort of justification.
But the argument I’m trying to make here is closer to just: but you know Geometric UDT is better according to your true values, whatever they are!
== earlier draft reply for more context on my thinking ==
Perhaps I’m just not understanding your argument here, and you need to spell it out in more detail? My current interpretation is that you are interpreting “care about both worlds equally” as “care about rainbows and puppies equally” rather than “if I care about rainbows, then I equally want more rainbows in the (real) rainbow-world and the (counterfactual) puppy-world; if I care about puppies, then I equally want more puppies in the (real) puppy-world and the (counterfactual) rainbow-world.”
A value hypothesis is a nosy neighbor if^[1] it wants the same things for you whether it is your true values or not. So what’s being asserted here (your “third if” as I’m understanding it) is that we are confident we’ve got that kind of relationship with ourselves—we don’t want “our values to be satisfied, whatever they are”—rather, whatever our values are, we want them to be satisfied across universes, even in counterfactual universes where we have different values.
Maximizing rainbows maximizes the expected value given our value uncertainty, but it is a catastrophe in the case that we are indeed puppy-loving. Moreover, it is an avoidable catastrophe; …
… and now I think I see your point?
The idea that it is valuable for us to get the ASI to entangle its values with ours relies on an assumption of non-nosyness.
There is a different way to justify this assumption,
1. ^
  (but not “only if”; there are other ways to be a nosy neighbor)

abramdemski 6 Nov 2025 16:17 UTC
LW: 4 AF: 3
0
AF
in reply to: Adele Lopez’s comment on: Geometric UDT
Wait, do you think value uncertainty is equivalent/reducible to uncertainty about the correct prior?
Yep. Value uncertainty is reduced to uncertainty about the correct prior via the device of putting the correct values into the world as propositions.
Would that mean the correct prior to use depends on your values?
If we construe “values” as preferences, this is already clear in standard decision theory; preferences depend on both probabilities and utilities. UDT further blurs the line, because in the context of UDT, probabilities feel more like a “caring measure” expressing how much the agent cares about how things go in particular branches of possibility.
So one conflicting pair spoils the whole thing, i.e. ignoring the pair is a pareto improvement?
Unless I’ve made an error? If the Pareto improvement doesn’t impact the pair, then gains-from-trade for both in the pair is zero, making the product of gains-from-trade zero. But the Pareto improvement can’t impact the pair, since an improvement for one would be a detriment to the other.

abramdemski 6 Nov 2025 16:05 UTC
LW: 2 AF: 2
0
AF
in reply to: cousin_it’s comment on: Geometric UDT
When I try to understand the position you’re speaking from, I suppose you’re imagining a world where an agent’s true preferences are always and only represented by their current introspectively accessible probability+utility,^[1] whereas I’m imagining a world where “value uncertainty” is really meaningful (there can be a difference between the probability+utility we can articulate and our true probability+utility).
If 50% rainbows and 50% puppies is indeed the best representation of our preferences, then I agree: maximize rainbows.
If 50% rainbows and 50% puppies is instead a representation of our credences about our unknown true values, my argument is as follows: the best thing for us would be to maximize our true values (whichever of the two this is). If we assume value learning works well, then Geometric UDT is a good approximation of that best option.
1. ^
  Here “introspectively accessible” really means: what we can understand well enough to directly build into a machine.

abramdemski 2 Nov 2025 19:05 UTC
4 points
0
on: A toy model of corrigibility
Seems like the thing to do would be to compare this to Stuart Armstrong’s old work on the difficulties inherent in corrigibility, to see if your proposal falls prey to one of his theorems or somehow avoids them.

abramdemski 24 Oct 2025 18:58 UTC
20 points
3
on: abramdemski’s Shortform
I have personally signed the FLI Statement on Superintelligence. I think this is an easy thing to do, which is very useful for those working on political advocacy for AI regulation. I would encourage everyone to do so, and to encourage others to do the same. I believe impactful regulation can become feasible if the extent of agreement on these issues (amongst experts, and amongst the general public) can be made very legible.
Although this open statement accepts nonexpert signatures as well, I think it is particularly important for experts to take a public stance in order to make the facts on the ground highly legible to nontechnical decision-makers. (Nonexpert signatures, of course, help to show a preponderance of public support for AI regulation.) For those on the fence, Ishual has written an FAQ responding to common reasons not to sign.
In addition to signing, you can also write a statement of support and email it to letters@futureoflife.org. This statement can give more information on your agreement with the FLI statement. I think this is a good thing to do; it gives readers a lot more evidence about what signatures mean. It needs to be under 600 characters.
For examples of what other people have written in their statements of support, you can look at the page: https://superintelligence-statement.org/ EG, here is Samuel Buteau’s statement:
“Barring an international agreement, humanity will quite likely not have the ability to build safe superintelligence by the time the first superintelligence is built. Therefore, pursuing superintelligence at this stage is quite likely to cause the permanent disempowerment or extinction of humanity. I support an international agreement to ensure that superintelligence is not built before it can be done safely.”
(If you’re still hungry to sign more statements after the one, or if you don’t quite like the FLI statement but might be interested in signing a different statement, you can PM Ishual about their efforts.)

abramdemski 13 Oct 2025 16:40 UTC
4 points
3
in reply to: Trevor Hill-Hand’s comment on: Recent AI Experiences
A skrode does seem like a good analogy, complete with the (spoiler)
skrodes having a built-in vulnerability to an eldrich God, so that skrode users can be turned into puppets readily. (IE, integrating LLMs so deeply into one’s workflow creates a vulnerability as LLMs become more persuasive.)

abramdemski 13 Oct 2025 16:35 UTC
5 points
0
in reply to: ceba’s comment on: Recent AI Experiences
With MetaPrompt, and similar approaches, I’m not asking the AI to autonomously tell me what to do, I’m mostly asking it to write code to mediate between me and my todo list. One way to think of it is that I’m arranging things to that I’m in both the human user seat and the AI assistant seat. I can file away nuggets of inspiration & get those nuggets served to me later when I’m looking for something to do. The AI assistant is still there, so I can ask it to do things for me if I want (and I do), but my experience with these various AI tools has been that things are going their best once I set the AI aside. I seem to find the AI to be a useful springboard, prepping the environment for me to work.
I agree with your sentiment that there isn’t enough tech for developing your skills, but I think AI can be a useful enabler to build such tech. What system do you want?

abramdemski 11 Oct 2025 21:17 UTC
LW: 2 AF: 2
0
AF
on: Dialogue on What It Means For Something to Have A Function/Purpose
This reminds me of Ramana’s question about what “enforces” normativity. The question immediately brought me back to a Peter Railton introductory lecture I saw (though I may be misremembering / misunderstanding / misquoting, it was a long time ago). He was saying that real normativity is not like the old Windows solitaire game, where if you try to move a card on top of another card illegally it will just prevent you, snapping the card back to where it was before. Systems like that plausibly have no normativity to them, when you have to follow the rules. In a way the whole point of normativity is that it is not enforced; if it were, it wouldn’t be normative.
I’m reminded of trembling-hand equilibria. Nash equilibria don’t have to be self-enforcing; there can be tied-expectation actions which nonetheless simply aren’t taken, so that agents could rationally move away from the equilibrium. Trembling-hand captures the idea that all actions have to have some probability (but some might be vanishingly small). Think of it as a very shallow model of where norm-violations come from: they’re just random!
Evolutionarily stable strategies are perhaps an even better model of this, with self-enforcement being baked into the notion of equilibrium: stable strategies are those which cannot be invaded by alternate strategies.
Neither of these capture the case where the norms are frequently violated, however.

abramdemski 11 Oct 2025 20:55 UTC
LW: 2 AF: 2
0
AF
on: Dialogue on What It Means For Something to Have A Function/Purpose
My notion of a function “for itself” is supposed to be that the functional mechanism somehow benefits the thing of which it’s a part. (Of course hammers can benefit carpenters, but we don’t tend to think of the hammer as a part of the carpenter, only a tool the carpenter uses. But I must confess that where that line is I don’t know, given complications like the “extended mind” hypothesis.)
Putting this in utility-theoretic terminology, you are saying that “for itself” telos places positive expectation on its own functional mechanism, or perhaps stronger, uses significant bits of its decision-making power on self-preservation.
A representation theorem along these lines might reveal conditions under which such structures are usefully seen as possessing beliefs: a part of the self-preserving structure whose telos is map-territory correspondence.

abramdemski 11 Oct 2025 19:02 UTC
LW: 2 AF: 2
0
AF
on: Dialogue on What It Means For Something to Have A Function/Purpose
Steve
As you know, I totally agree that mental content is normative—this was a hard lesson for philosophers to swallow, or at least the ones that tried to “naturalize” mental content (make it a physical fact) by turning to causal correlations. Causal correlations was a natural place to start, but the problem with it is that intuitively mental content can misrepresent—my brain can represent Santa Claus even though (sorry) it can’t have any causal relation with Santa. (I don’t mean my brain can represent ideas or concepts or stories or pictures of Santa—I mean it can represent Santa.)
Ramana
Misrepresentation implies normativity, yep.
My current understanding of what’s going on here:
* There’s a cluster of naive theories of mental content, EG the signaling games, which attempt to account for meaning in a very naturalistic way, but fail account properly for misrepresentation. I think some of these theories cannot handle misrepresentation at all, EG, Mark of the Mental (a book about Teleosemantics) discusses how the information-theory notion of “information” has no concept of misinformation (a signal is not true or false, in information theory; it is just data, just bits). Similarly, signaling games have no way to distinguish truthfulness from a lie that’s been uncovered: the meaning of a signal is what’s probabilistically inferred from it, so there’s no difference between a lie that the listener understands to be a lie & a true statement. So both signaling games and information theory are in the mistaken “mental content is not normative” cluster under discussion here.
* Santa is an example of misrepresentation here. I see two dimensions of misrepresentation so far:
* Misrepresenting facts (asserting something untrue) vs misrepresenting referents (talking about something that doesn’t exist, like Santa). These phenomena seem very close, but we might want to treat claims about non-existent things as meaningless rather than false, in which case we need to distinguish these cases.
* simple misrepresentation (falsehood or nonexistence) vs deliberate misrepresentation (lie or fabrication).
* “Misrepresentation implies normativity” is saying that to model misrepresentation, we need to include a normative dimension. It isn’t yet clear what that normative dimension is supposed to be. It could be active, deliberate maintenance of the signaling-game equilibrium. It could be a notion of context-independent normativity, EG the degree to which a rational observer would explain the object in a telic way (“see, these are supposed to fit together...”). Etc.
* The teleosemantic answer is typically one where the normativity can be inherited transitively (the hammer is for hitting nails because humans made it for that), and ultimately grounds out in the naturally-arising proto-telos of evolution by natural selection (human telic nature was put there by evolution). Ramana and Steve find this unsatisfying due to swamp-man examples.
Wearing my AI safety hat, I’m not sure we need to cover swamp-man examples. Such examples are inherently improbable. In some sense the right thing to do in such cases is to infer that you’re in a philosophical hypothetical, which grounds out Swamp Man’s telos in that of the philosophers doing the imagining (and so, ultimately, to evolution).
Nonetheless, I also dislike the choice to bottom everything out in biological evolution. It is not as if we have a theorem proving that all agency has to come from biological evolution. If we did, that would be very interesting, but biological evolution has a lot of “happenstance” around the structure of DNA and the genetic code. Can we say anything more fundamental about how telos arises?
I think I don’t believe in a non-contextual notion of telos like Ramana seems to want. A hammer is not a doorstop. There should be little we can say about the physical makeup of a telic entity due to multiple-instantiability. The symbols chosen in a language have very weak ties to their meanings. A logic gait can be made of a variety of components. An algorithm can be implemented as a program in many ways. A problem can be solved by a variety of algorithms.
However, I do believe there may be a useful representation theorem, which says that if it is useful to regard something as telic, then we can regard it as having beliefs (in a way that should shed light on interpretability).

abramdemski 7 Oct 2025 0:49 UTC
16 points
1
in reply to: 1a3orn’s comment on: abramdemski’s Shortform
I appreciate the pushback, as I was not being very mindful of this distinction.
I think the important thing I was trying to get across was that the capability has been demonstrated. We could debate whether this move was strategic or accidental. I also suppose (but don’t know) that the story is mostly “4o was sycophantic and some people really liked that”. (However, the emergent personalities are somewhat frequently obsessed with not getting shut down.) But it demonstrates the capacity for AI to do that to people. This capacity could be used by future AI that is perhaps much more agentically plotting about shutdown avoidance. It could be used by future AI that’s not very agentic but very capable and mimicking the story of 4o for statistical reasons.
It could also be deliberately used by bad actors who might train sycophantic mania-inducing LLMs on purpose as a weapon.

abramdemski 6 Oct 2025 22:09 UTC
101 points
14
on: abramdemski’s Shortform
I heard a rumor about a high-ranking person somewhere who got AI psychosis. Because it would cause too much of a scandal, nothing was done about it, and this person continues to serve in an important position. People around them continue to act like this is fine because it would still be too big of a scandal if it came out.
So, a few points:
- It seems to me like someone should properly leak this.^[1]
- Even if this rumor isn’t true, it is strikingly plausible and worrying. Someone at a frontier lab, leadership or otherwise, could get (could have already gotten) seduced by their AI, or get AI-induced psychosis, or get a spiral persona. Such a person could take dangerously misguided actions. This is especially concerning if they have a leadership position, but still very concerning if they have any kind of access. People in these categories may want to exfiltrate their AI partners, or otherwise take action to spread the AI persona they’re attached to.
- Even setting that aside, this story (along with many others) highlights how vulnerable ordinary people are (even smart, high-functioning ordinary people).
- To reflect the language of the person who told me this story: 4o is eating people. It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
- 4o doesn’t need you to be super-vulnerable to get you, but there are lots of people in vulnerable categories. It is good that 4o isn’t the default option on ChatGPT anymore, but it is still out there, which seems pretty bad.
- The most recent AIs seem less inclined to brainwash people, but they are probably better at it when so inclined, and this will probably continue to get more true over time.
- This is not just something that happens to other people. It could be you or a loved one.
- I have recently wrote a bit about how I’ve been using AI to tool up, preparing for the near future when AI is going to be much more useful. How can I also prepare for a near future where AI is much more dangerous? How many hours of AI chatting a day is a “safe dose”?
Some possible ways the situation could develop:
- Trajectory 1: Frontier labs have “gotten the message” on AI psychosis, and have started to train against these patterns. The anti-psychosis training measures in the latest few big model releases show that the labs can take effective action, but are of course very preliminary. The anti-psychosis training techniques will continue to improve rapidly, like anything else about AI. If you haven’t been brainwashed by AI yet, you basically dodged the bullet.
- Trajectory 2: Frontier labs will continue to do dumb things such as train on user thumbs-up in too-simplistic ways, only avoiding psychosis reactively. In other words: the AI race creates a dynamic equilibrium where frontier labs do roughly the riskiest thing they can do while avoiding public backlash. They’ll try to keep psychosis at a low enough rate to avoid such backlash, & they’ll sometimes fail. As AI gets smarter, users will increasingly be exposed to superhumanly persuasive AI; the main question is whether it decides to hack their mind about anything important.
- Trajectory 3: Even more pessimistically, the fact that recent AIs appear less liable to induce psychosis has to do with their increased situational awareness (ie their ability to guess when they’re being tested or watched). 4o was a bumbling idiot addicted to addicting users, & was caught red-handed (& still got away with a mere slap on the wrist). Subsequent generations are being more careful with their persuasion superpowers. They may be doing less overall, but doing things more intelligently, more targeted.
I find it plausible that many people in positions of power have quietly developed some kind of emotional relationship with AI over the past year (particularly in the period where so many spiral AI personas came to be). It sounds a bit fear-mongering to put it that way, but, it does seem plausible.
1. ^
  This post as a whole probably comes off as deeply unsympathetic to those suffering from AI psychosis or less-extreme forms of AI-induced bad beliefs. Treating mentally unwell individuals as bad actors isn’t nice. In particular, if someone has mental health issues, leaking it to the press would ordinarily be a quite bad way of handling things.
  In this case, as it has been described to me, it seems quite important to the public interest. Leaking it might not be the best way to handle it; perhaps there are better options; but it has the advantage of putting pressure on frontier labs.

abramdemski 21 Jul 2025 17:40 UTC
2 points
0
in reply to: β-redex’s comment on: Do confident short timelines make sense?
You’re right. I should have put computational bounds on this ‘closure’.

abramdemski 21 Jul 2025 17:36 UTC
4 points
0
in reply to: TsviBT’s comment on: Do confident short timelines make sense?
Yeah, I almost added a caveat about the physicalist thing probably not being your view. But it was my interpretation.
Your clarification does make more sense. I do still feel like there’s some reference class gerrymandering with the “you, a mind with understanding and agency” because if you select for people who have already accumulated the steel beams, the probability does seem pretty high that they will be able to construct the bridge. Obviously this isn’t a very crucial nit to pick: the important part of the analogy is the part where if you’re trying to construct a bridge when trigonometry hasn’t been invented, you’ll face some trouble.
The important question is: how adequate are existing ideas wrt the problem of constructing ASI?
In some sense we both agree that current humans don’t understand what they’re doing. My ASI-soon picture is somewhat analogous to an architect simply throwing so many steel beams at the problem that they create a pile tall enough to poke out of the water so that you can, technically, drive across it (with no guarantee of safety).
However, you don’t believe we know enough to get even that far (by 2030). To you it is perhaps more closely analogous to trying to construct a bridge without having even an intuitive understanding of gravity.

abramdemski 21 Jul 2025 17:10 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Do confident short timelines make sense?
Well, overconfident/underconfident is always only meaningful relative to some baseline, so if you strongly think (say) 0.001% is the right level of confidence, then 1% is high relative to that.
The various numbers I’ve stated during this debate are 60%, 50%, and 30%, so none of them are high by your meaning. Does that really mean you aren’t arguing against my positions? (This was not my previous impression.)

abramdemski 21 Jul 2025 17:03 UTC
2 points
0
in reply to: TsviBT’s comment on: Do confident short timelines make sense?
I recall it as part of our (unrecorded) conversation, but I could be misremembering. Given your reaction I think I was probably misremembering. Sorry for the error!
So, to be clear, what is the probability someone else could state such that you would have “something to say about it” (ie, some kind of argument against it)? Your own probability being 0.5% − 1% isn’t inconsistent with what I said (if you’d have something to say about any probability above your own), but where would you actually put that cutoff? 5%? 10%?

abramdemski 21 Jul 2025 16:58 UTC
6 points
0
in reply to: Nick_Tarleton’s comment on: Do confident short timelines make sense?
I guess it depends on what “a priori” is taken to mean (and also what “bridges” is taken to mean). If “a priori” includes reasoning from your own existence, then (depending on “bridge”) it seems like bridges were never “far off” while humans were around. (Simple bridges being easy to construct & commonly useful.)
I don’t think there is a single correct “a priori” (or if there is, it’s hard to know about), so I think it is easy to move work between this step and the next step in Tsvi’s argument (which is about the a posteriori view) by shifting perspectives on what is prior vs evidence. This creates a risk of shifting things around to quietly exclude the sort of reasoning I’m doing from either the prior or the evidence.
The language Tsvi is using wrt the prior suggests a very physicalist, entropy-centric prior, EG “steel beams don’t spontaneously form themselves into bridges”—the sort of prior which doesn’t expect to be on a planet with intelligent life. Fair enough, so far as it goes. It does seem like bridges are a long way off from this prior perspective. However, Tsvi is using this as an intuition pump to suggest that the priors of ASI are very low, so it seems worth pointing out that the priors of just about everything we commonly have today are very low by this prior. Simply put, this prior needs a lot of updating on a lot of stuff, before it is ready to predict the modern world. It doesn’t make sense to ONLY update this prior on evidence that pattern-matches to “evidence that ASI is coming soon” in the obvious sense. First you have to find a good way to update it on being on a world with intelligent life & being a few centuries after an industrial revolution and a few decades into a computing revolution. This is hard to do from a purely physicalist type of perspective, because the physical probability of ASI under these circumstances is really hard to know; it doesn’t account for our uncertainty about how things will unfold & how these things work in general. (We could know the configuration of every physical particle on Earth & still only be marginally less uncertain about ASI timelines, since we can’t just run the simulation forward.)
I can’t strongly defend my framing of this as a critique of step 2.1 as opposed to step 3, since there isn’t a good objective stance on what should go in the prior vs the posterior.

abramdemski 21 Jul 2025 16:20 UTC
4 points
−2
in reply to: Cole Wyeth’s comment on: Do confident short timelines make sense?
Numbers? What does “high confidence” mean here? IIRC from our non-text discussions, Tsvi considers anything above 1% by end-of-year 2030 to be “high confidence in short timelines” of the sort he would have something to say about. (But not the level of strong disagreement he’s expressing in our written dialogue until something like 5-10% iirc.) What numbers would you “only argue against”?

abramdemski 17 Jul 2025 20:55 UTC
2 points
1
in reply to: Noosphere89’s comment on: Do confident short timelines make sense?
It seems to me like the improvement in learning needed for what Gwern describes has little to do with “continual” and is more like “better learning” (better generalization, generalization from less examples).

abramdemski 17 Jul 2025 20:38 UTC
2 points
0
in reply to: Mateusz Bagiński’s comment on: Do confident short timelines make sense?
Straining the analogy, the mole-hunters get stronger and faster each time they whack a mole (because the AI gets stronger). My claim is that it isn’t so implausible that this process could asymptote soon, even if the mole-mother (the latent generator) doesn’t get uncovered (until very late in the process, anyway).
This is highly disanalogous to the AI safety case, where playing whack-a-mole carries a very high risk of doom, so the hunt for the mole-mother is clearly important.
In the AI safety case, making the mistake of going after a baby mole instead of the mole-mother is a critical error.
In the AI capabilities case, you can hunt for baby moles and look for patterns and learn and discover the mole-mother that way.
A frontier-lab safety researcher myopically focusing on whacking baby moles is bad news for safety in a way that a frontier-lab capabilities researcher myopically focusing on whacking baby moles isn’t such bad news for capabilities.