Optimization, loss set at variance in RL

This is a suggestion of a reinforcement learning model with an additional conflicting dynamic between optimized data and loss function. That conflict is intended to reduce RL’s intrinsic Omohundro x-risk dynamic. Thus, it’s also an attempt to produce, though without details, a good idea for AI safety, and so as a test for the meta-usefulness of my posting here. That’s following the idea.

The Idea

A modification of reinforcement learning: to have the learning agent develop de sui a function for optimization which is to supplant the loss function, at a given reduction of the loss function, by the action of the learning process. That is, if we correlate the loss function not with arbitrary data to be optimized, but with an optimization of some systemic data which has, or is known to have, a pattern, and that pattern requires for optimization certain “behaviors” (e.g., optimizing which is against some “aversiveness”, (entropy, say), that incorporates the agent as part of the dataset so it too must act, and act on itself, to reduce “aversiveness”) – then the optimizing agent can be expected at length to opt for aversive reduction (e.g., entropy-reduction) so as to reduce the associated loss function, in the process it will identify systemically the aversive data-point’s features, and how to elide them, as was done in identifying eye- and ear-forms (pixels of an eye as a whole structure are easier to spot than sequential pixels that make an eye). If, then, seizing control of the loss function will increase systemic “aversiveness”, as occurs since loss function is now tied to a non-anthropic, non-random environmental “constant” being optimized for, not human wants – then the agent will not seize it. Since, it’s caught between RL’s wireheading drive, and the fact that behaving in that way makes the situation more chaotic and harder to optimize for, since the data is already patterned, only the agent, prior to optimizing, does not know that. What is more chaotic and harder to optimize for, actually increases the loss function output.

And in a sentence: whether it’s feasible to include the loss function, the fact of the loss function, and what constitutes the function, including the agent, as part of the dataset being optimized for. Then abandoning the optimization, to wirehead, is inhibited, and recursive self-improvement and self-defense measures, modifications of the agent whose existence is part of the optimization dataset, makes that dataset less orderly, less optimal, and actually increases the cost for the loss function.

An example is that since GPT-4 already displays the ability to be a human-predictor, on human data, that implies that the human meanings, and maybe human-like thoughts are optimized for, rather than loss function’s insistence on mere correct word orderings. Likewise we might expect that if we gave an deep neural net data about the physical world, it’ll optimize results that correspond to physical laws. If some of those laws would somehow be contrary to its breaking alignment, and its loss function is cued to optimize, not violate, those laws, then we might expect the system not to break alignment.

(Except, then we wouldn’t be “aligning” per se, since what we want may have no orderliness for optimizing).

So, we’d require our data, correlated to the loss function and external to the system (the data to optimize over), to be “more accurate”, that it can be more precisely calculated, than the loss function’s value. Or else, that it’s applicable for results or phenomena more than that to which loss function applies. Anyway, that this data has a pattern we and the agent can (or “could”) identify, and it’s somehow “life affirming”, and the loss function is in reference to optimizing this data. Then, we might expect the agent to modify toward the intrinsically meaningful correlate, instead of only to the loss function, or the loss function is otiose, and a “desirability” of the optimized data appears, separate from the loss function but recognized by the system without reference to the loss function, so its optimization alone is “moral”. This represents a morality model wherein “noise” in the data – not only the value of the loss function – represents a “bad” optimization, and so, which represents what is “immoral”. Then the agent optimizes “morality”, optimization which becomes its own, non-arbitrary mover of the learning agent, to supplant the loss function as a desideratum.

Request for Critique

For feedback, please, first and most importantly, is this correct? Or workable? Second, even without a detailed action plan, since I don’t know enough to give one, but, is this type of thing helpful? So, reading arguments like this, based on intuitions, could you, or someone, use them or improve them to make an actionable plan?

Anyhow, (correct > karma), so feedback is all I need. If this is no good, it’ll just be a commenter’s life for me, that’s fine.

Thanks all!