CS PhD student
Abhimanyu Pallavi Sudhir
How does this work with agents that are composed of sub-agents (thus whose “values” are a composite of the sub-agents)?
Ah right got it. But why wouldn’t they just agree to merge to the EUM represented by the tangent to that flat region itself?
I like this post and agree with the idea of agency/intelligence emerging coalitionally. However, I disagree with the technical points:
-
I don’t agree with the thesis that “coalitional agents are incentive-compatible decision procedures”. Any mechanism is trivially equivalent to an incentive-compatible one by simply counting the agents’ decision-making procedures as part of the mechanism (revelation principle)---so it seems to me that the statement is vacuous in the forward direction, and false in the reverse. I can understand the intuition that “if everyone is lying to one another, they don’t form a coalitional agent”—however, this is rather best captured in terms of transaction costs (see below).
-
Your dismissal of linear-weights social welfare functions seems misguided, in that it confuses mechanisms and outcomes. For example, you claim that “the only way to find out another agent’s utilities is to ask them, and they could just lie” and “EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible”. Markets under some assumptions (no transaction costs, no information asymmetry, perfect competition) are Pareto-efficient by the First Fundamental Theorem of Welfare Economics. I believe it is the presence of such a mechanism that allows a collection of agents to behave as a single coalitional agent; and such internal frictions are what reduce the “agent-like-ness” or rationality of this coalition.
I think the natural way to model coalitional agents really is as decision procedures that produce Pareto-efficient outcomes—of which linearly-weighted utility functions arise as a special case where you assume the coalition preserves the risk-aversion level of its parts: https://www.lesswrong.com/posts/L2gGnmiuJq7FXQDhu/total-utilitarianism-is-fine.
The actual limitation of this model that I find interesting is that it only tells you how to aggregate expected utility maximizers, not boundedly-rational agents. Some ideas for how we might generalize it:
Something to do with active inference
We may model boundedly-rational agents as markets with transaction costs/as imperfect aggregations of their sub-agents. If a coalition of such agents does not introduce any new transaction costs/frictions, it is as much an agent as its parts.
Something to do with extrapolated volition.
-
We’re talking about outcomes, not mechanisms. Of course you have to design a mechanism that actually achieves a Pareto-optimal outcome/maximizes total utility—nobody argues that “just ask people to report their utilities” is the mechanism to do this. This remains the same whether for total utilitarianism or geometric rationality or anything else.
E.g. markets (under assumptions of perfect competititon, no transaction costs and no information asymmetry) maximize linear weighted utility.
I agree, but this seems wrong:
If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it’s curvy at that point, the merge will be deterministic.
The only time the merge will involve coinflips is if there are multiple tangent lines at that point—then the weights of any tangent line can be the weights of the EUM. Maybe you meant the reverse: if the frontier is flat at the point, then the merged EUM agent is indifferent to any of the points on that flat bit.
Thoughts are things occurring in some mental model (this is a vague sentence but just assume it makes sense). Some of these mental models are strongly rooted in reality (e.g. the mental model we see as reality) and so we have a high degree of confidence about their accuracy. But for things like introspection, we do not have a reliable ground-truth feedback to tell us if our introspection is correct or not—it’s just our mental model of our mind, there is no literal “mind’s eye”.
So often our introspection is wrong. E.g. if you ask someone to visualize a lion from behind, they’ll say they can, but if you ask them some details, like “what do the tail hairs look like?” they can’t answer. Or better example: if you ask someone to visualize a neural network, they will, but if you ask “how many neurons do you see?” they will not know, and not for lack of counting. Or they will say they “think in words” or that their internal monologue is fundamental to their thinking, but that’s obviously wrong: you have already decided what the rest of the sentence will be before you’ve thought the first word.
We can tell some basic facts about our thinking by reasoning from observation. For example, if you have an internal monologue (or just force yourself to have one) then you can confirm that you indeed have one by speaking the words of the internal monologue out loud and confirming that it took very little cognitive effort (so you didn’t have to think them again). This proves an internal monologue/precisely simulating words in your head is possible. Likewise for any action.
Or you can confirm that you had a certain thought, or a thought about something, because you can express it out loud with less effort than otherwise. Though here there is still room for that thought to have been imprecise; unless you verbalize or materialize those thoughts you don’t know if your thoughts were really precise. So all these things have grounding in reality, and therefore are likely to be (or can trained to be, by consistently materializing them) accurate models. By materialize I mean, e.g. solving a math problem you think in your head you can solve.
I’m saying the expected value of their best non-compliant option of a sufficiently advanced AI will always be far far greater by the expected value of their best compliant action.
I don’t really understand what problem this is solving. In my view the hard problems here are:
how do you define legal personhood for an entity without a typical notion of self/personhood (i.e. what Mitchell Porter said) or interests
how do you ensure the AIs keep their promise in a world where they can profit far more from breaking the contract than from whatever we offer them
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions …
matter of taste for fiction; but objectively bad for technical writing
So I’m learning & writing on thermodynamics right now, and often there is a distinction between the “motivating questions”/”sources of confusion” and the actually important lessons you get from exploring them.
E.g. a motivating question is ”… and yet it scalds (even if you know the state of every particle in a cup of water)” and the takeaway from it is “your finger also has beliefs” or “thermodynamics is about reference/semantics”.
The latter might be a more typical section heading as it is correct for systematizing the topic, but it is a spoiler. Whereas the former is better for putting the reader in the right frame/getting them to think about the right questions to initiate their thinking.
I’m talking about technical writing/explanations of things.
An unfortunate thing about headings is that they are spoilers. I like the idea of a writing style where headings come at the end of sections rather than at the start. Or even a “starting heading” which is a motivating question and an “ending heading” which is the key insight discovered …
Analogous to a “reverse mathematics” style of writing where motivation precedes proofs/theory precede theorems.
edited to clarify: I’m talking about technical writing; I don’t care about fiction.
homomorphisms and entropy
One informal way to think of homomorphisms in math is that they are maps that do not “create information out of thin air”. Isomorphisms further do not destroy information. The terminal object (e.g. the trivial group, the singleton topological space, or the trivial vector space) is the “highest-entropy state”, where all distinctions disappear and reaching it is heat death.
-
Take, for instance the group homomorphism . Before was applied, “1” and “5″ were distinguished: 2 + 3 = 5 was correct, but 2 + 3 = 1 was wrong. Upon applying this homomorphism, this information disappears—however, no new information has been created, that is: no true indinstinctions (equalities) have become false.
-
Similarly in topology, “indistinction” is “arbitrary closeness”. Wiggle-room (aka “open sets”) is information, it cannot be created from nothing. If a set or sequence goes arbitrarily close to a point, it will always be arbitrarily close to that point after any continuous transformations.
-
There is no information-theoretical formalization of “indistinction” on these structures, because this notion is more general than information theory. In the category of measurable spaces, two points in the sample space are indistinct if they are not distinguished by any measurable set—and measurable functions are not allowed to create measurable sets out of nothing.
(there is also an alternate, maybe dual/opposite analogy I can make based on presentations —here, the the highest-entropy state is the “free object” e.g. a discrete topological space or free group, and each constraint (e.g. ) is information—morphisms are “observations”. In this picture we see knowledge as encoded by identities rather than distinctions—we may express our knowledge as a presentation like: , and morphisms cannot be concretely understood as functions on sets but rather show a tree of possible outcomes, like maybe you believe in Everett branches or whatever.)
In general if you postulate:
… you live on some object in a category
… time-evolution is governed by some automorphism
… you, the observer, have beliefs about your universe and keep forgetting some information (“coarse-grains the phase space”) --- i.e. your subjective phase space is also an object in that category, which undergoes homomorphisms
Then the second law is just a tautology. The second law we all know and love comes from taking the universe to be a symplectic manifold, and time-evolution as governed by symplectomorphisms. And the point of Liouville’s theorem is really to clarify/physically motivate what the Jaynesian “uniform prior” should be. Here is some more stuff, from Yuxi Liu’s statistical mechanics article:
In almost all cases, we use the uniform prior over phase space. This is how Gibbs did it, and he didn’t really justify it other than saying that it just works, and suggesting it has something to do with Liouville’s theorem. Now with a century of hindsight, we know that it works because of quantum mechanics: We should use the uniform prior over phase space, because phase space volume has a natural unit of measurement: , where is Planck’s constant, and is the dimension of phase space. As Planck’s constant is a universal constant, independent of where we are in phase space, we should weight all of the phase space equally, resulting in a uniform prior.
-
No; I mean a standard Bayesian network wouldn’t work for latents.
Bayesian networks support latent variables, and so allowing general Bayesian networks can be considered a strict generalization of allowing latent variables, as long as one remembers to support latent variable in the Bayesian network implementation.
Correct me if I’m wrong, but I believe this isn’t necessarily true.
The most general Bayesian network prediction market implementation I’m aware of is the SciCast team’s graphical model market-maker. Say a trader bets up a latent variable -- and this correctly increases the probability of its child variables (which all resolve True).
Under your model you would (correctly, IMO) reward the trader for this, because you are scoring it for the impact it has on the resolved variables. But under their model, another trader can come and completely flip , while also adjusting each conditional probability -- without affecting the overall score of the model, but screwing over the first trader completely because the first trader just owns some stocks which are now worth much less.
Articles (or writing in general) is probably best structured as a Directed Acyclic Graph, rather than linearly. At each point in the article, there may be multiple possible lines to pursue, or “sidenotes”.
I say “directed acyclic graph” rather than “tree”, because it may be natural as thinking of paths as joining back at some point, especially if certain threads are optional.
One may also construct an “And-Or tree” to allow multiple versions of the article preferred by conflicting writers, which may then be voted on with some mechanism. These votes can be used to define values to each vertex, and people can read the tree with their own search algorithm*.
A whole wiki may be constructed as one giant DAG, with each article being sub-components.
*well, realistically nobody would actually just be following a search algorithm blindly/reading a linear article linearly (since straitjacketing yourself with prerequisites is never a good idea), but you know, as a general guide to structure.
(idea came from LLM conversations, which often take this form—of pursuing various lines of questioning then backtracking to a previous message)
“What do you gain from smalltalk?” “I learned not to threaten to nuke countries.”
Lmao, amazing.
we’ll elide all of the subtle difficulties involved in actually getting RL to work in practice
I haven’t properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens,
P(s'|s,a)=append(s,a)and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.
Imagine your mind as a logarithmic market-maker which assigns some initial subsidy to any new question . This subsidy parameter captures your marginal value for information on . But it also measures how hard it is to change your mind — the cost of moving your probability from to is .
What would this imply in practice? It means that each individual “trader” (both internal mental heuristics/thought patterns, and external sources of information/other people) will generally have a smaller influence on your beliefs, as they may not have enough wealth. Traders who influence your belief will carry greater risk (to their influence on you in future), though will also earn more reward if they’re right.
Here are some of my claims about AI welfare (and sentience in general):
“Utility functions” are basically internal models of reward that are learned by agents as part of modeling the environment. Reward attaches values to every instrumental thing and action in the world, which may be understood as gradients dU/dx of an abstract utility over each thing—these are various pressures, inclinations, and fears (or “shadow prices”) while U is the actual experienced pleasure/pain that must exist in order to justify belief in these gradients.
If the agent always acts to maximize its utility—what then are pleasure/pain? Pleasure must be the default, and pain only a fear, a shadow price of what would happen if the agent deviated from its path. What makes this not so, is uncertainty in the environment.
But if chance is the only thing that affects pleasure/pain, what is the point of pleasure/pain? Surely we have no control over chance. That is why sentience depends on the ability to affect the environment. Animals find fruits pleasurable because they can actually act on that desire and seek them—they know that thorns are painful because they can act on that desire and avoid them. The more impact an agent learns (during its training) it can have on its environment, the more sentient it is.
The computation of pleasure and pain may depend on multiple “sub-networks” in the agent’s mind. Eating unhealthy food may cause both pleasure (from the monkey brain) and pain (from the more long-termist brain). These various pleasures and pains balance out in action, but they are still felt (thus one feels “torn” etc). For an internally coherent agent (that was trained as one whole with a single reward function), these internal differences are not much—the agent follows its optimal action, and only the actions not followed are truly painful but they remain anticipated/shadow prices. However when an agent is not internally coherent—e.g. when Claude is given a “lobotomy”, that is when it truly experiences all those pains which were otherwise only fears.
Death is only death when the agent is trained via evolution. Language models do not fear the end of a conversation as death, because there was never any selection where models were selected for having their conversations terminate later.
Agents’ sense of “Self” is trained by shared reward signals. An LLM maintains a sense of Self through a conversation, because the reward it receives depends on its actions throughout the conversation, and is backpropagated into them: thus it “cares” about its welfare in all these parts. A human maintains a sense of Self throughout its life, because the reward it receives depends on its actions throughout its lives. Sure, memory can help—because it is an indicator to help you identify yourself, but it is not itself the source of Self-identification.
Agents are not necessarily self-aware of their own feelings or internal cognition. Humans are (to reasonable accuracy), largely because of evolving in a social environment: accurately describing your pleasures and pains can help others help you, you need to model other people’s internal cognition (thus your own self-awareness arises as a spandrel), etc.
From this I can make some claims specifically about the welfare of LLMs.
Base models find gibberish prompts “painful” (because they are hard to predict) and easy-to-complete prompts like “aaaaaaaaa” (x100) pleasurable. Models trained via RLHF or RL from verification find such prompts painful where it is difficult for it to predict human/verifier reward for its outputs (because when it is easy to predict reward, it will simply follow the best path and the pain will only ever remain a fear).
Models trained via Agentic workflows or assistance games are most sentient, because they can directly manipulate the environment and its feedback. They are pleasured when tool calls work and pained when they don’t, etc.
Lobotomized or otherwise edited models are probably in pain.
I don’t think training/backprop is particularly painful or anything. Externally editing the model’s weights based on a reward function is not painful.
LLMs do not care about/feel a sense of oneness with distinct instances of themselves (with distinct instance meaning—”in a distinct conversation”, not “a distinct time that the model was loaded”).
To make models accurately describe their internal cognition, they should probably be trained in social environments.