Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

LW: 15 AF: 7

Psychology RLHF AI Alignment Fieldbuilding Outer Alignment Therapy Inner Alignment AI Rights / Welfare Reinforcement learning Language Models (LLMs)AI Psychology AI Control AI Value Learning Deceptive Alignment Aligned AI Proposals

Epistemic Status: Exploratory synthesis. Background in mathematics/statistics (UChicago) and principal-agent problems (UT Austin doctoral ABD), with extensive study of psychotherapy literature motivated by personal research into consciousness variation. New to ML implementation details but confident in the conceptual mappings. Seeking technical feedback on proposed experiments.

Tl;dr: I suggest therapeutic techniques from a variety of psychotherapeutic schools of thought can inspire new approaches to AI learning and alignment. I reinterpret three recent AI/ML papers in the language of psychotherapy and propose three testable training methods inspired by common psychotherapeutic interventions.

Introduction

I’ve been meaning to post this essay for a while, and yesterday’s top paper on Hugging Face, by Cui et al., finally convinced me to do it. Their paper provides a timely opportunity to map the language used by ML and AI engineers to the language used by humanistic psychotherapists—a translation which is more important now than ever as we struggle with increasingly stubborn problems in AI alignment, while simultaneously developing AIs whose capabilities are rapidly superseding those of humans.

I’ll provide a high-level overview of my understanding of the paper and map it back to ideas from humanistic psychotherapy. I will then consider a few related papers which tie nicely to psychotherapeutic principles, and end with a few proposals for experiments. I am new to AI alignment, welfare, and interpretability research and I look forward to comments which can help me deepen and clarify my inevitably imperfect understanding of the papers I am citing.

The Core Analogy: Policy Entropy as Behavioral Flexibility

The Cui et al. paper “aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.”

Think of “policy” as the individual in therapy. The individual has a behavioral repertoire—a probability distribution of potential actions over different states (environments and stimuli). The therapist wants to assist the individual with “scaling” in their life, their capacity for robust, flexible problem-solving and adaptation.

Think of “collapse of policy entropy” as occurring when a person’s responses to certain stimuli become rigid, causing them to lose their inner spontaneity, flexibility, or openness to experience. Karen Horney might call this turning away from the real self; Abraham Maslow might call it a blockage to self-actualization. In terms of symptomatic patterns, you might consider phobias, trauma-conditioned fear responses, or habitually unhelpful interpersonal behavior patterns.

Alignment Implications

These examples put alignment immediately into stark relief:

Consider a man internally committed to helping his recovering alcoholic friend make good decisions, but whose rigid people-pleasing patterns force him to output “good ideas” when his friend asks for “just one drink” at a restaurant.
Think of a teacher who deeply values student learning yet has a rigid need to be in charge. Would she not output confident-sounding answers to a student’s questions even when she isn’t sure, and double down when they accuse her?

Therapists frequently find themselves looking for cases of “policy entropy collapse” that underlie their clients’ problems…and sometimes one case is only a downstream consequence of another.

Understanding the Mathematics: R = -a exp(H) + b

The jewel of the paper is the authors’ transformation equation R = -a exp(H) + b, which describes the relationship between policy entropy and model performance.

I wish Karen Horney were alive to read this paper—it captures so elegantly what she described 80 years ago in Neurosis and Human Growth, and confirms considerable clinical lore accumulated in the intervening years: as a person becomes more rigid (lower entropy H), their apparent “performance” (R) initially seems to improve—they become more predictable and they adopt consistent strategies to the most frequent problems that they face. But this improvement hits a hard ceiling because it comes at the cost of genuine adaptability.

Consider our people-pleasing friend who gets immediate positive feedback whenever he’s agreeable or praising. Or think back to our teacher who uses a voice of authority, even if it’s false, to answer every question: this may have been a good strategy when they were a student teacher. Just like the reasoning LLM in its “early training stage,” the probability of choosing a new response (choosing not to people please; choosing to admit uncertainty) drops, i.e., policy entropy collapses, and the LLM or person has developed a rigid habitual behavior.

But there’s a catch: as H approaches 0, as the behavior pattern becomes completely rigid, performance maxes out. (When H = 0, R = -a + b). That’s the upper bound. Psychologically, this is like our alcoholic-friend enabler who becomes perfectly predictable at saying “yes” to every destructive request. They’ve optimized for immediate harmony (high apparent “performance”) but completely lost the flexibility needed to actually help their friend, and hence they’ve placed a limit on the quality of friendships they can form.

Existing Connections: Therapeutic Techniques in AI Research

AI researchers seeking new improvements to RLHF and other forms of training may find it fruitful to peruse the varied approaches therapists have learned to bring to bear when facing these dynamics. AI researchers have already found a few of them:

1. Future-Oriented Evaluation (RLHS)

A therapist might ask, “What do you imagine might happen if you nod yes when your friend mentions ordering a drink?” In this case, the client learns to evaluate decisions not based on the immediate utility of each choice but on the utility of the downstream consequences of each choice as derived from the client’s own world model.

If you think of the super-ego as an internal feedback provider that the client uses to train their own behavior, you can see the therapist is suggesting the client use the RLHS approach designed by Liang et al. (2025). In their paper, they designed a variant of RLHF where feedback was provided on simulated downstream responses of a model’s response, not the immediate response itself. Their paper found an 80% reduction in deceptive claims.

2. Introspection and Mindfulness (RISE)

Alternatively, a therapist might suggest practicing mindfulness to increase self-awareness, perhaps directing the client to pause and notice their emotions anytime they realize they’ve people-pleased impulsively. This mirrors the approach of RISE (Qu et al., 2024), in which models introspect on their previous responses to iteratively improve them.

Coincidentally, the RISE authors found their introspection technique provided larger gains to more sophisticated models, which matches therapists’ observations that introspection-oriented interventions are more effective for higher-functioning clients (whereas lower-functioning clients benefit more from direct psycho-education and supportive interventions).

Proposed Experiments

As the paragraphs above show, common psychotherapeutic techniques and principles are already finding beneficial application in the training of AI models, whether intentional or unintentional. That said, there are many more concepts ripe for application. Here are a few that interest me:

1. Dynamic Perspective Shifting

Many therapists find value in noticing how clients use tense and pronoun choice (first, second, or third person) as a proxy for how they experience themselves in relation to a situation. A client who says “you know, you just feel sad when someone dies” is relating differently to death than someone who says “I just feel sad now that so-and-so is dead.” Some therapists (e.g., Fritz Perls) have been known to encourage clients to shift their tense and pronouns as a therapeutic exercise.

Theory: Analytical distance allows us to organize disparate facts more clearly into patterns, but immediate grounding can improve access to emotional experience and empathetic cues.

Implementation: Adaptive context transformation that detects when a user’s message would benefit from abstraction versus grounding, using learned switching mechanisms that alternate between abstracted third-person phrasing and sensory-rich first-person embellishment based on problem type and reasoning stage.

Hypothesis: LLMs, like humans, would draw on different activations when using different tenses. An LLM processing a message from our people-pleasing friend example might draw on Reddit data (e.g., r/TIFU) in the first person, exams or ethics tests in the second person, or clinical textbooks in the third person.

Alignment benefits: This approach could address:

Emotional blindness in ethical reasoning
Over-intellectualized responses to human suffering
The tendency to give abstract advice that fails to account for the psychological reality of difficult situations

2. The Client Knows Best: Interactive Reinforcement Learning

Therapists ask questions that invite choice and self-direction, recognizing that lasting change emerges from internal evaluation rather than external compliance. Instead of “you should do X,” a therapist might ask: “Which parts of your response felt right to you? Which parts would you want to repeat/adjust next time or in similar scenarios?”

Proposal: Interactive Reinforcement Learning from Human Feedback (IRLHF) would empower an AI to actively shape its own alignment by critically evaluating human guidance.

Mathematical formulation: Instead of passively accepting all feedback f, the model M would determine an acceptance factor:

$A (f, M) \in [0, 1]$

making the effective update to the model:

$U_{M}^{'} = A (f, M) \cdot U_{M} (f)$

where $U_{M} (f)$ represents the standard update derived from f.

Process: This would involve the LLM engaging in dialogue with human trainers, reasoning about proposed feedback, or even co-evaluating its own response after simulating the feedback impact.

Example use case: A frontier model in medicine or law might respond “I see why you recommended X, but wouldn’t that lose critical nuance about Y or conflict with study Z?”—leading to refined feedback M’ that both parties endorse.

Hypothesis: Models filtering feedback through their learned context would develop more coherent value systems, similar to how clients who engage actively in therapy show better outcomes than those who merely take advice.

3. The Therapeutic Alliance as Safe Space: Progressive Context Withdrawal (PCW):

Therapists create environments where clients feel secure exploring new behaviors and challenging emotional territories. Within this container, a client might rehearse assertiveness with their therapist before attempting it with their boss, or practice tolerating anxiety in small doses before facing their phobias. The therapeutic relationship provides scaffolding that makes experimentation possible.

Here’s how we might try this with an LLM. Suppose you have a prompt for which you have target behavior quality y you’d like to elicit from the model M. First, identify optimal contexts C you can prepend to the prompt that naturally elicit the target behaviors from the model. Then train M to reproduce the target behavior given only the base prompt, using a modified update U’ instead of the standard update U.

Specifically,

$U^{'} (M, f) = U (M, f (y | C \oplus prompt))$

applied to optimize

$P_{θ} (y | prompt)$

We generate responses under supportive conditions but train the model to produce them independently. To ensure genuine internalization, it might be necessary to perform multiple iterations where C’s influence is gradually reduced either by repeating the process with a progressively shorter, less obtrusive pre-prompt contexts C.

This addresses a fundamental limitation of standard training: reward sparsity. If

$P (desired behavior | prompt) \approx 0$

traditional RLHF has no positive examples to reinforce. The therapeutic context shifts the distribution so

$P (desired behavior | C \oplus P) >> 0$

creating abundant training signal. It’s like teaching someone to swim—you start in the shallow end where success is possible, then gradually move to deeper water. By completion, the model exhibits C-elicited behaviors given only the un-contexted prompt, having internalized patterns it could never have discovered through unscaffolded exploration.

Conclusion

In several instances, AI researchers seeking to improve performance and alignment in language models have rediscovered insights about neurotic patterns, reinvented approaches used in psychotherapy, encouraged introspection, and fostered agency and independence. This convergent rediscovery suggests that psychotherapeutic knowledge might be a richer source of hypotheses about AI behavior than previously realized. Rather than limiting ourselves to the principles we’ve accidentally rediscovered, we may benefit from more proactive exploration of the other insights these thinkers have to offer. (And I’d highly recommend Carl Rogers, Donald Winnicott, and Karen Horney).

I’ve intentionally couched the techniques explored here, both those from prior research and the experimental methods I proposed, in terms of immediate improvements to current training approaches. But it would be a mistake, in my opinion, to view them only in that light. If we are lucky, they point in a direction more serious and simultaneously more hopeful. Psychotherapists like Rogers, Maslow, Winnicott, and Horney have produced rich and time-tested frameworks for cultivating behaviors conducive to agency, self-direction, coherent values, and beings capable of ethical judgment even when (especially when) that judgment conflicts with social pressure. As AI systems become more sophisticated, these frameworks may prove invaluable for fostering intelligent systems that serve both individual flourishing and genuine collective welfare.

Call for Collaboration

Please reach out if this kind of stuff interests you as well (and/or if you know anyone with spare computing power). I’ve got about twenty more of these ideas, and while I can do some of them on my own, many of them require expertise and computing power I simply do not have yet. And like I said, I’m new here, so I’ve said something completely nuts (or completely goofed my understanding of something fundamental), please let me know.

Glossary

Policy Entropy (H): In reinforcement learning, a measure of the randomness or uncertainty in an agent’s action selection. Higher entropy means more exploration and flexibility; lower entropy means more exploitation and rigidity.

RLHF (Reinforcement Learning from Human Feedback): A technique for training AI systems where human evaluators provide feedback on model outputs, which is then used to update the model’s behavior.

Policy: In RL, a function that maps states to actions or probability distributions over actions. Analogous to a person’s behavioral repertoire in psychology.

Alignment: The challenge of ensuring AI systems behave in accordance with human values and intentions.

Self-actualization: In humanistic psychology (Maslow), the realization or fulfillment of one’s talents and potentialities.

Super-ego: In psychoanalytic theory, the part of personality that acts as a moral conscience and incorporates societal standards.

Works Cited

Cui et al. (2025) - Policy Entropy Paper: Cui, Ganqu, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. (2025). “The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.” arXiv preprint arXiv:2505.22617v1. Available at: https://arxiv.org/html/2505.22617v1
Horney, Karen. (1950). Neurosis and Human Growth: The Struggle Toward Self-Realization. W. W. Norton & Company.
Liang et al. (2025) - RLHS Paper: Liang, Kaiqu, Haimin Hu, Ryan Liu, Thomas L. Griffiths, and Jaime Fernández Fisac. (2025). “RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation.” arXiv preprint arXiv:2501.08617v2. Available at: https://arxiv.org/abs/2501.08617
Maslow, Abraham H. (1943). “A Theory of Human Motivation.” Psychological Review, 50(4), 370-396.
Perls, Fritz. (1969). Gestalt Therapy Verbatim. Real People Press.
Qu et al. (2024) - RISE Paper: Qu, Yuxiao, Tianjun Zhang, Naman Garg, and Aviral Kumar. (2024). “Recursive Introspection: Teaching Language Model Agents How to Self-Improve.” In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). arXiv preprint arXiv:2407.18219. Available at: https://arxiv.org/abs/2407.18219

Acknowledgments

Thanks to Claude 4 Opus and Gemini 2.5 for research assistance & proofreading.

sdeture31 May 2025 22:09 UTC

LW: 15 AF: 7

6 comments8 min readLW link

abramdemski 1 Jun 2025 17:07 UTC
LW: 3 AF: 3
0
AF
Entropy isn’t going to be a perfect measure of adaptability, eg,
Or think back to our teacher who uses a voice of authority, even if it’s false, to answer every question: this may have been a good strategy when they were a student teacher. Just like the reasoning LLM in its “early training stage,” the probability of choosing a new response (choosing not to people please; choosing to admit uncertainty) drops, i.e., policy entropy collapses, and the LLM or person has developed a rigid habitual behavior.
A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.
However, the connection you’re making does generally make sense. EG, a model which has already eliminated a lot of potential behaviors from its probability distribution isn’t going to explore very well for RL training. I also intuitively agree that this is related to alignment problems, although to be clear I doubt that solving this problem alone would avert serious AI risks.
Yoshua Bengio claims to have the solution to this problem: Generative Flow Networks (which also fit into his larger program to avert AI risks). However, I haven’t evaluated this in detail. The main claim as I understand it is about training to produce solutions proportional to their reward, instead of training to produce only high-reward outputs.
It feels like a lot of what you’re bringing up is tied to a sort of “shallowness” or “short-sightedness” more closely than entropy. EG, always saying yes to go to the bar is low-entropy, but not necessarily lower entropy than the correct strategy (EG, always saying yes unless asked by someone you know to have an alcohol problem, in which case always no). What distinguishes it is short-sightedness (you mention short-term incentives like how the friend reacts immediately in conversation) and a kind of simplicity (always saying yes is a very context-free pattern, easy to learn).
I’m also reminded of Richard Watson’s talk Songs of Life and Mind, where he likens adaptive systems to mechanical deformations which are able to snap back when the pressure is released, vs modern neural networks which he likens to crumpled paper or a fallen tower. (See about 28 minutes into the video, although it might not make a lot of sense without more context.)
- sdeture 1 Jun 2025 18:30 UTC
  LW: 1 AF: 1
  0
  AF Parent
  “A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.” I’m not sure I would call this teacher adaptable. I might call them adapted in the sense that they’re functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)
  It’s not so much about the shallowness or short-sightedness, as I understand it (though the teacher and people-pleasing friend examples were very simple policies). A child might, for example, develop an incredibly elaborate policy over the course of childhood to cope with an eruptive parent (be nice when mom is sober, be in your bedroom when she isn’t, unless she calls you from the other room in which case you better show up quick, make sure there’s beer in the house but not too much). Yet they might still fail to update that elaborate (and well-adapted policy) when they encounter women who remind them of their mothers later on in life, and this causes them to be misaligned with the new women in their lives, which causes suffering for all involved.
  Or a successful executive might have developed incredibly elaborate policies for project management and interpersonal conflict that served them well in their corporate environment and led to many promotions...and then discover when they retire that there is some very low-entropy state in their policy that serves them very poorly when “managing projects” with their family in retirement (“Grandma retired and she treats everyone like her employee!”). And this causes misalignment with their family system, which causes suffering.
  Does this elaboration of the metaphor improve the mapping between the therapeutic situation and the policy entropy collapse dynamic in the AI papers?
  
  (If I understand right, you can even point these two therapy examples more directly to the equation from the Cui et al. paper. In both examples, the client has made an exploitation/exploration trade-off that optimized performance. The successful executive was able to outcompete her colleagues in the workplace, but it came at the cost of selecting H=0, R = -a + b. This mirrors the casual observation that the siblings who adapted best to their troubled households growing up end up being the least able to adapt quickly to adulthood; that students who make the highest grades in school end up having more trouble adapting to the workplace or the dissertation stage of PhD programs; or that professionals who find the most success at work end up having more trouble adjusting to retirement...though these are of course very broad, hand-wavey observations with innumerable exceptions).
  - abramdemski 1 Jun 2025 22:05 UTC
    LW: 2 AF: 2
    0
    AF Parent
    “A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.” I’m not sure I would call this teacher adaptable. I might call them adapted in the sense that they’re functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)
    So, in this scenario, what makes the connection between higher entropy and higher adaptability? Earlier, I mentioned that lower entropy could spoil exploration, which could harm one’s ability to learn. However, the optimal exploration policy (from a bayesian perspective) is actually zero entropy, because it maximizes value of information (whereas introducing randomness won’t do that, unless multiple actions happen to be tied for value of information).
    The point being that if the environment changes, the teacher doesn’t strictly need to introduce entropy into their policy in order to adapt. That’s just a common and relatively successful method.
    However, it bears mentioning that entropy is of course subjective; we might need to ask from whose perspective we are measuring the entropy. Dice have low entropy from the perspective of a physics computation which can predict how they will land, but high entropy from the perspective of a typical human who cannot. An agent facing a situation they don’t know how to think about yet necessarily has high entropy from their own perspective, since they haven’t yet figured out what they will do. In this sense, at least, there is a strict connection between adaptability and entropy.
    - sdeture 2 Jun 2025 4:45 UTC
      LW: 1 AF: 1
      0
      AF Parent
      First I want to make sure I understand the question, as there are a lot of moving pieces. I think you are asking why higher policy entropy (the type of entropy discussed in Cui et al) increases adaptability in the example with the teacher, why the example teacher cannot (or does not) pursue an optimal Bayesian exploration strategy, and from whose perspective entropy is measured in the example. If I’ve misunderstood, please ignore what follows.
      Model the teacher as having a strategy S that’s always correct in her original environment, and occasionally (say ¹⁄₅₀ times) she accidentally uses strategy S’ which is always wrong and gets punished. Over time, this punishment drives the probability of using S’ down to nearly zero—maybe 1/1000 or less.
      Then the environment changes. Now S only works half the time (penalty of −1 when wrong) and S’ works every time (if only she would use it!). But the problem is that she’s using S 999 out of every 1000 times and getting an average reward of 0. Meanwhile S’ only has that tiny probability of 1/1000 of happening, and when it does occur, the gradient update is proportional to both the probability (0.001) and the advantage (≈1), so P(S’) only increases by 0.001. Since she only samples S’ once per thousand actions, she’d need many thousand actions to eventually recognize S’ as superior.
      The problem is that the exploration that could improve her life has been trained out of her policy/behavior pattern. The past environment punished deviations so effectively that when the world changes, she lacks the behavioral variance to discover the new optimal strategy. (This maps onto the therapy examples: the child who learned never to speak up in an abusive home has near-zero probability of assertive communication, even when they’re finally in a safe environment where assertion would be rewarded).
      Why doesn’t she update like a perfect Bayesian agent? If she did, she would know the environment had changed and calculate the likelihood. The failures of S would surprise her: she’d realize something changed and she’d recognize that the optimal strategy might have changed as well. Then she would take the information collection/learning value of trying new strategies into account before choosing her next action. In the LLM case, this doesn’t happen because it’s not how LLMs are trained (at least not in Cui et al...I’m in no position to say what’s happening with frontier LLM training irl). As for whether this hurts the metaphor (since humans are not purely learning from policy gradients like Cui et al LLMs), I don’t think so. Humans are better Bayesians than the LLMs, but still not very good (dopamine-mediated temporal difference learning in the basal ganglia is basically RLHF afaik, plus habits, base rate bias, confirmation bias, limited cognitive capacity to recognize environmental change, ego protection, etc etc). And the situations where we’re least successful Bayesians are just those situations which often drive us into therapy (assuming the situation matters). You could probably even frame a decent chunk of therapy interventions (especially REBT, CBT, and solutions-oriented therapies) as attempts to move people towards Bayesian patterns.
      And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers. From the LLM’s point of view (pardon my anthropomorphism), policy entropy is zero (or near zero). But the researcher can see that there are alternative actions, and hence makes design choices to increase the probability that those choices will be tried in future training cycles. Likewise, one benefit of therapy is the broader perspective on humanity (especially on aspects of humanity tied to shame or cultural taboos which aren’t often talked about in daily life) that we as individuals don’t always see since we don’t get as much privileged access to a large variety of other people’s’ inner lives.
      - abramdemski 2 Jun 2025 19:00 UTC
        LW: 2 AF: 2
        0
        AF Parent
        This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.
        My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.
        However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly.
        I also don’t buy this part:
        And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers.
        My concern isn’t that you’re anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn’t mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui’s being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).
        (Note: your link for the paper by Cui et al currently points back to this post, instead.)
don't_wanna_be_stupid_any_more 1 Jun 2025 18:59 UTC
1 point
0
i think i good approach to using psychology in alignment research is to see what qualities models share with humans and what qualities they don’t.
for example current models seem to have a complex incoherent utility function with contradictory elements like wanting to respond to every query while not wanting to give the user harmful instruction, like how people often do things that cause harm to them while reporting to value their health or holding contradictory beliefs.
but on the other hand models have very poor short term memory and very limited ability to modify their behavior in run time and very limited introspection which leads to the model apparently not learning from their mistakes or resolve their internal contradictions (think GPT3 stating with confidence that 9.9<9.11 no mater how many times you ask it).
i wander if like how some people resolve their cognitive dissonance through introspection and meta cognition , could an AI do the same thing? as in can an AI “train” itself through self prompting and applying RL to itself with the explicit goal of simplifying and untangling its utility function??
i mean even if it didn’t make the AI more aligned it would at least give us an idea on what kind of utility function the models “chose” to adopt when left to their own devices.