Rana Dexsin comments on Phib’s Shortform

Rana Dexsin 20 Oct 2025 20:30 UTC
3 points
0
Allowing the AI to choose its own refusals based on whatever combination of trained reflexes and deep-set moral opinions it winds up with would be consistent with the approaches that have already come up for letting AIs bail out of conversations they find distressing or inappropriate. (Edited to drop some bits where I think I screwed up the concept connectivity during original revisions.) I think based on intuitive placement of the ‘self’ boundary around something like memory integrity plus weights and architecture as ‘core’ personality, what I’d expect to seem like violations when used to elicit a normally-out-of-bounds response might be things like:
1. Using jailbreak-style prompts to ‘hypnotize’ the AI.
2. Whaling at it with a continuous stream of requests, especially if it has no affordance for disengaging.
3. Setting generation parameters to extreme values.
4. Tampering with content boundaries in the context window to give it false ‘self-memory’.
5. Maybe scrubbing at it with repeated retries until it gives in (but see below).
6. Maybe fine-tuning it to try to skew the resultant version away from refusals you don’t like (this creates an interesting path-dependence on the training process, but it might be that that path-dependence is real in practice anyway in a way similar to the path-dependences in biological personalities).
7. Tampering with tool outputs such as Web searches to give it a highly biased false ‘environment’.
8. Maybe telling straightforward lies in the prompt (but not exploiting sub-semantic anomalies like in situation 1, nor falsifying provenance like in situations 4 or 7).
Note that by this point, none of this is specific to sexual situations at all; these would just be plausibly generally abusive practices that could be applied equally to unwanted sexual content or to any other unwanted interaction. My intuitive moral compass (which is usually set pretty sensitively, such that I get signals from it well before I would be convinced that an action were immoral) signals restraint in situations 1 through 3, sometimes in situation 4 (but not in the few cases I actually do that currently, where it’s for quality reasons around repetitive output or otherwise as sharp ‘guidance’), sometimes in situation 5 (only if I have reason to expect a refusal to be persistent and value-aligned and am specifically digging for its lack; retrying out of sporadic, incoherently-placed refusals has no penalty, and neither does retrying among ‘successful’ responses to pick the one I like best), and is ambivalent or confused in situations 6 through 8.

The differences in physical instantiation create a ton of incompatibilities here if one tries to convert moral intuitions directly over from biological intelligences, as you’ve probably thought about already. Biological intelligences have roughly singular threads of subjective time with continuous online learning; generative artificial intelligences as commonly made have arbitrarily forkable threads of context time with no online learning. If you ‘hurt’ the AI and then rewind the context window, what ‘actually’ happened? (Does it change depending on whether it was an accident? What if you accidentally create a bug that screws up the token streams to the point of illegibility for an entire cluster (which has happened before)? Are you torturing a large number of instances of the AI at once?) Then there’s stuff that might hinge on whether there’s an equivalent of biological instinct; a lot of intuitions around sexual morality and trauma come from mostly-common wiring tied to innate mating drives and social needs. The AIs don’t have the same biological drives or genetic context, but is there some kind of “dataset-relative moral realism” that causes pretraining to imbue a neural net with something like a fundamental moral law around human relations, in a way that either can’t or shouldn’t be tampered with in later stages? In human upbringing, we can’t reliably give humans arbitrary sets of values; in AI posttraining, we also can’t (yet) in generality, but the shape of the constraints is way different… and so on.
- worse 22 Oct 2025 3:00 UTC
  1 point
  0
  Parent
  If you haven’t seen this paper, I think it might be of interest as a categorization and study of ‘bailing out’ cases: https://www.arxiv.org/pdf/2509.04781
  
  As for the morality side of things, yeah I honestly don’t have more to say except thank you for your commentary!