Signer

Karma: 628

Signer 27 Nov 2025 21:33 UTC
3 points
0
on: Alignment remains a hard, unsolved problem

Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.

But visible misalignment being easy to detect and correlated with misaligned chain-of-thought doesn’t guarantee that training that eliminates visible misalignment and misaligned chain-of-thought results in a model that does things for the right reasons? The model can still learn unintended heuristics. And what’s the actual hypothesis about model’s reasons when they appear to be right? Its learned reasoning algorithm is isomorphic to a reasoning algorithm of a helpful human that reads same instructions, or what?

Signer 27 Nov 2025 20:05 UTC
2 points
0
on: The crux on consciousness

Let me put it this way then, how do you combine all of these tiny little microexperiences into a coherent macroexperience?

Microexperiences are unphysical—there are no electrons, only global wavefunction. So you only have decomposition problem. It is solved by weak illusionism: there is no real fundamental perfect isolation of qualia, just qualia of isolation. For every detailed description of isolation of your qualia, there is either non-contradicting physical description of only approximately isolated part of reality, or your description is wrong—same way a description of a chair works.

Yes, but I have a principled reason to special plead here. The complete description of the world is only complete from the third person perspective. It’s incomplete from a first person perspective because we need to explain the phenomenal character of consciousness.

I think it circles here? You started by justifying incompleteness by inverted spectrum, received the objection about chairs being analogous, and then answer that the difference is in incompleteness. The problem is that the chair analogy is correct—the difference between blue and red is completely describable by physics. You only need intrinsic property of existence for the whole universe to solve zombies. But you also need it for a chair to be real.

Of course, I don’t think many physicalists actually believe in structural relations all the way down.

Signer 25 Nov 2025 11:08 UTC
3 points
0
on: Thou art rainbow: Consciousness as a Self-Referential Physical Process

Conscious phenomenology should only arise in systems whose internal states model both the world and their own internal dynamics as an observer within that world. Neural or artificial systems that lack such recursive architectures should not report or behave as though they experience an “inner glow.”

What part of staring at a white wall without inner dialog and then later remembering it requires inner modeling at the moment of staring?

Internal shifts in attention and expectation can alter what enters conscious awareness, even when sensory input remains constant. This occurs in binocular rivalry and various perceptual illusions,17 consistent with consciousness depending on recursive self-modeling rather than non-cyclic processing of external signals.

But why would changing processing to non-cyclic result in experience becoming unconscious, instead of, I don’t know, conscious, but less filtered by attention?

And as usual, do you then consider any program, that reads it’s own code, to be conscious?

Signer 20 Nov 2025 17:17 UTC
3 points
0
on: Varieties Of Doom

(1 - conscious) * (1 - each_other) * (1 - care_other) * (1 - bored) * (1 - avoid_wireheading) * (1 - active_learning)

Wait, but paperclipper is independent of all of these and your arguments about them? Self-aware distributed coordinating paperclipper with loop prevention, that creates real paperclips and learns things is still paperclipper.

Signer 20 Nov 2025 17:11 UTC
1 point
0
in reply to: jdp’s comment on: Varieties Of Doom

But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.

I’m not so sure. Like, first of all, you mean something like “get before superintelligence” or “get into the goal slot”, because there is obviously a method to just get the representations—just build a superintelligence with a random goal, it will have your representations. That difference was explicitly stated then, it is often explicitly stated now—all that “AI will understand but not care”. The focus on the frameworks where it gets hard to translate from humans to programs is consistent with him trying to constrain methods of generating representations to only useful ones.

There is a reason why it is called “the value loading problem” and not “the value understanding problem”. “The value translation problem” would be somewhat in the middle: having actual human utility program would certainly solve some of Bostrom’s problems.

I don’t know whether Bostrom actually thought about non-superintelligent AI that already understands but don’t care. But I don’t think this line of argumentations of yours is correct about why such a scenario contradicts his points. Even if he didn’t consider it, it’s not “contra”, unless it actually contradicts him. What actually may contradict him is not “AI will understand values early” but “AI will understand values early and training such early AI will make it care about right things”.

Signer 20 Nov 2025 15:33 UTC
3 points
0
in reply to: jdp’s comment on: Varieties Of Doom
1. The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
It depends on what you mean by “available”—we already had a representation of happiness in a human brain. And building corrigible AI that builds a correct representation of happiness is not enough—like you said, we need to point at it.
1. If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
If you can use it.
1. If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
Yes, the key is “otherwise not able to be used”.
1. Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
No, unless by “correctly understands” you mean “have an identifiable representation that humans can use to program other AI”—he may expect that we will have an intelligence that correctly understands concepts like happiness while not yet being superintelligent (like we have humans, that are better at this than “maximize happiness”) but we still won’t be able to use it.

Signer 15 Nov 2025 15:18 UTC
1 point
0
in reply to: Tapatakt’s comment on: Tapatakt’s Shortform
What it needs is not simpler words, but explicit type-annotations.

Signer 15 Nov 2025 10:24 UTC
1 point
0
in reply to: Wei Dai’s comment on: “But You’d Like To Feel Companionate Love, Right? … Right?”
I get the possibility of the “Convergent” part, but what your hope for the “True” part derives from? Or is it just “as True as true knowledge”, that still depends on who you want to know things and at what precision?

Also, what problems with consciousness and qualia are relevant here? Seems like maximizing of hedonic experience is possible in either dualist or eliminativist universe.

I understand you want to be uncertain, but you still need a prior to not update from, right? And so just elevating every philosophical idea humans invented to feel good about themselves to plausibility doesn’t seem like the best strategy.

Signer 13 Nov 2025 21:39 UTC
1 point
0
on: Undissolvable Problems: things that still confuse me
The solution to the #2 is that it is #1.

Signer 31 Oct 2025 15:45 UTC
1 point
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform

We can punt on values for now

What’s wrong with just using AI for obvious stuff like curing death while you solve metaethics? Not necessary disagree about usefulness of people in the field changing their attitude, but more towards “the problem is hard, so we should not run CEV on day one”.

Signer 27 Oct 2025 11:47 UTC
26 points
20
on: On Fleshling Safety: A Debate by Klurl and Trapaucius.

I don’t deny the existence of some filters and selection pressures! I am saying that the filter you are pointing to, is not quantitatively strong enough and narrow enough to pinpoint only korrigibility as its singular outcome!

I think that’s the best wording of disagreement I’ve seen. What would be better is to see a quantitative justification grounded in reality. Because as it stands Ezra Klein just says “looks strong enough to me”.

Signer 2 Oct 2025 7:13 UTC
1 point
0
in reply to: TAG’s comment on: [Question] What the discontinuity is, if not FOOM?
If all AIs are scheming, they can take over together. If a world with a powerful AI that is actually on humanity’s side is assumed instead, then at some level of power of friendly AI you probably can run unaligned AI and it will not be able to do much harm. But just assuming there being many AIs doesn’t solve scheming by itself—if training actually works as bad as predicted, then no AI of many would be aligned enough.

Signer 1 Oct 2025 15:51 UTC
1 point
0
in reply to: TAG’s comment on: Beyond the Zombie Argument

Russelian monism struggles with Epiphenomenality: if the measurable, structural properties are sufficient to predict what happens, the the phenomenal properties are along for the ride.

I mean, it’s monism—it supposed to only has one type of stuff, obviously structural properties only work, because of underlying phenomenal/physical substrate.

furthermore, since mental states are ultimately identical to physical brain states, they share the causal powers of brain states (again without the need to posit special explanatory apparatus such as “psychophysical laws”), and in that way epiphenomenalism is avoided.

I don’t see how having two special maps has anything to do with monistic ontology, that enables casual closure. What’s the problem with just having neutral-monistic ontology, like you say Dual-aspect neutral monism has, and use normal physical epistemology?

the epistemic irreducibility of the mental to the physical is also accepted.

Why? If ontologically there is only one type of stuff, then you can reduce mental description to physical, because they describe one reality. Same way you reduce old physical theory to a new one.

Signer 1 Oct 2025 15:21 UTC
1 point
0
in reply to: TAG’s comment on: [Question] What the discontinuity is, if not FOOM?

Why would that be discontinuous?

Because incremental progress missed deception.

I’m arguing against 99%

I agree such confidence lacks justification.

Signer 1 Oct 2025 13:33 UTC
1 point
−9
on: Beyond the Zombie Argument
I don’t think there is a need to qualify it as a potential solution—Russellian Monism just solves the Hard Problem.

Signer 1 Oct 2025 13:02 UTC
1 point
0
in reply to: TAG’s comment on: [Question] What the discontinuity is, if not FOOM?
I don’t think anyone is against incremental progress. It’s just that if after incremental progress AI takes over, then it’s not good enough alignment. And what’s the source of confidence in it being enough?

“Final or nonexistent” seems to be appropriate for scheming detection—if you missed only one way for AI to hide it’s intentions, it will take over. So yes, degree of scheming in broad sense and how much you can prevent it is a crux and other things depend on it. Again, I don’t see how you can be confident that future AI wouldn’t scheme.

Signer 1 Oct 2025 2:29 UTC
2 points
0
in reply to: boazbarak’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”

I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all.

This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.

Signer 1 Oct 2025 0:32 UTC
9 points
0
on: [Question] What the discontinuity is, if not FOOM?

if ASI is developed gradually , alignment can be tweaked as you go along.

The whole problem is that alignment, as in “AI doesn’t want to take over in a bad way” is not assumed to be solved. So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity, where ASI takes over because it now can. It means all your previous alignment work and scheming detection is finally tested when you run a version capable of takeover and you can only fail once on this test. And training against scheming is predicted to not work and just create stealthier schemers. And “AI can take over” is predicted to be hard to fake for AI so you can’t confidently check for scheming just by observing what it would do in fake environment.

Signer 30 Sep 2025 22:37 UTC
8 points
8
on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard.

I don’t understand why anyone would expect such reason to be persuasive to other people. Like, to rely on illegible intuitions in the matters of human extinction just feels crazy. Yes, certainty doesn’t matter, we need to stop either way. But still—is it even rational to be so confident when you rely on illegible intuitions? Why don’t check yourself with something more robust, like actually writing your hypotheses, reasoning, and counting evidence? Sure there is something better than saying “I base my extreme confidence on intuitions”.

And it’s not only about corrigibility—“you don’t get what you train for” being universal law of intelligence in the real world, or utility maximization, especially in the limit, being good model of real things, or pivotal real world science being definitely so hard you can’t possibly be distracted even once and still figure it out—everything is insufficiently justified.

Signer 21 Sep 2025 11:20 UTC
2 points
1
on: The title is reasonable

Once you do that, it’s a fact of the universe, that the programmers can’t change, that “you’d do better at these goals if you didn’t have to be fully obedient”, and while programmers can install various safeguards, those safeguards are pumping upstream and will have to pump harder and harder as the AI gets more intelligent. And if you want it to make at least as much progress as a decent AI researcher, it needs to be quite smart.

Is there a place where this whole hypothesis about deep laws of intelligence is connected to reality? Like, how hard they have to pump? What’s exactly the evidence that they will have to pump harder? Why “quite smart” point can’t be when safeguards still work? Right now it’s not different from saying “world is NP-hard, so ASI will have to try harder and harder to solve problems, and killing humanity is quite hard”.

If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.

If there were a natural shape for AIs that don’t wirehead, you might hope to find a simple mathematical reflection of that shape in toy models. So MIRI failing to find such a model means NNs are anti-natural. Again, what’s the justification for significant update from MIRI failing to find a mathematical model?