Ahhh that makes sense, thanks.
abramdemski(Abram Demski)
Agent Boundaries Aren’t Markov Blankets. [no longer endorsed]
This topic came up when working on a project where I try to make a set of minimal assumptions such that I know how to construct an aligned system under these assumptions. After knowing how to construct an aligned system under this set of assumptions, I then attempt to remove an assumption and adjust the system such that it is still aligned. I am trying to remove the cartesian assumption right now.
I would encourage you to consider looking at Reflective Oracles next, to describe a computationally unbounded agent which is capable of thinking about worlds which are as computationally unbounded as itself; and a next logical step after that would be to look at logical induction or infrabayesianism, to think about agents which are smaller than what they reason about.
You can compute everything that takes finite compute and memory instantly. (This implies some sense of cartesianess, as I am sort of imagining the system running faster than the world, as it can just do an entire tree search in one “clock tick” of the environment.)
This part makes me quite skeptical that the described result would constitute embedded agency at all. It’s possible that you are describing a direction which would yield some kind of intellectual progress if pursued in the right way, but you are not describing a set of constraints such that I’d say a thing in this direction would definitely be progress.
My intuition is that this would still need to solve the problem of giving an agent a correct representation of itself, in the sense that it can “plan over itself” arbitrarily. This can be thought of as enabling the agent to reason over the entire environment which includes itself. Is that part a solved problem?
This part seems inconsistent with the previous quoted paragraph; if the agent is able to reason about the world only because it can run faster than the world, then it sounds like it’ll have trouble reasoning about itself.
Reflective Oracles solve the problem of describing an agent with infinite computational resources which can do planning involving itself and other similar agents, including uncertainty (via reflective-oracle solomonoff induction), which sounds superior to the sort of direction you propose. However, they do not run “faster than the world”, as they can reason about worlds which include things like themselves.
Amusingly, searching for articles on whether offering unlicensed investment advice is illegal (and whether disclaiming it as “not investment advice” matters) brings me to pages offering “not legal advice” ;p
Also, to be clear, nothing in this post constitutes investment advice or legal advice.
&
(Also I know enough to say up front that nothing I say here is Investment Advice, or other advice of any kind!)
&
None of what I say is financial advice, including anything that sounds like financial advice.
I usually interpret this sort of statement as an invocation to the gods of law, something along the lines of “please don’t smite me”, and certainly not intended literally. Indeed, it seems incongruous to interpret it literally here: the whole point of the discussion, as I’m understanding it, is to provide potentially useful ideas about investing strategies. Am I supposed to pretend that it’s just, like, an interesting thought experiment? Or is there some other interpretation of your disclaimer I’m not seeing?
I’m looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.
(Just to be clear, I did not write that article.)
I think the interpretation of Savage is pretty subtle. The objects of preference (“outcomes”) and objects of belief (“states”) are treated as distinct sets. But how are we supposed to think about this?
The interpretation Savage seems to imply is that both outcomes and states are “part of the world”, but the agent has somehow segregated parts of the world into matters of belief and matters of preference. But however the agent has done this, it seems to be fundamentally beyond the Savage representation; clearly within Savage, the agent cannot represent meta-beliefs about which matters are matters of belief and which are matters of preference. So this seems pretty weird.
We could instead think of the objects of preference as something like “happiness levels” rather than events in the world. The idea of the representation theorem then becomes that we can peg “happiness levels” to real numbers. In this case, the picture looks more like standard utility functions; S is the domain of the function that gives us our happiness level (which can be represented by a real-valued utility).
Another approach which seems somewhat common is to take the Savage representation but require that S=O. Savage’s “acts” then become maps from world to world, which fits well with other theories of counterfactuals and causal interventions.
So even within a Savage framework, it’s not entirely clear that we would want the domain of the utility function to be different from the domain of the belief function.
I should also have mentioned the super-common VNM picture, where utility has to be a function of arbitrary states as well.
That’s just math speak, you can define a lot of things as a lot of other things, but that doesn’t mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.
The question is, what math-speak is the best representation of the things we actually care about?
It remains totally unclear to me why you demand the world to be such a thing.
Ah, if you don’t see ‘worlds’ as meaning any such thing, then I wonder, are we really arguing about anything at all?
I’m using ‘worlds’ that way in reference to the same general setup which we see in propositions-vs-models in model theory, or in vs the -algebra in the Kolmogorov axioms, or in Kripke frames, and perhaps some other places.
We can either start with a basic set of “worlds” (eg, ) and define our “propositions” or “events” as sets of worlds, where that proposition/event ‘holds’ or ‘is true’ or ‘occurs’; or, equivalently, we could start with an algebra of propositions/events (like a -algebra) and derive worlds as maximally specific choices of which propositions are true and false (or which events hold/occur).
My point is that if U has two output values, then it only needs two possible inputs. Maybe you’re saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you’re right, but I feel no need to make such claims.
Maybe I should just let you tell me what framework you are even using in the first place. There are two main alternatives to the Jeffrey-Bolker framework which I have in mind: the Savage axioms, and also the thing commonly seen in statistics textbooks where you have a probability distribution which obeys the Kolmogorov axioms and then you have random variables over that (random variables being defined as functions of type ). A utility function is then treated as a random variable.
It doesn’t sound like your notion of utility function is any of those things, so I just don’t know what kind of framework you have in mind.
My point is only that U is also reasonable, and possibly equivalent or more general. That there is no “case against” it.
I do agree that my post didn’t do a very good job of delivering a case against utility functions, and actually only argues that there exists a plausibly-more-useful alternative to a specific view which includes utility functions as one of several elements.
Utility functions definitely aren’t more general.
A classical probability distribution over with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U. Technically the sigma-algebra needs to be atomless to fit JB exactly, but Zoltan Domotor (Axiomatization of Jeffrey Utilities) generalizes this considerably.
I’ve heard people say that there is a way to convert in the other direction, but that it requires ultrafilters (so in some sense it’s very non-constructive). I haven’t been able to find this construction yet or had anyone explain how it works.
So it seems to me, but I recognize that I haven’t shown in detail, that the space of computable values is strictly broader in the JB framework; computable utility functions + computable probability gives us computable JB-values, but computable JB-values need not correspond to computable utility functions.
Thus, the space of minds which can be described by the two frameworks might be equivalent, but the space of minds which can be described by computations does not seem to be; the JB space, there, is larger.
I don’t see why any “good” utility function should be uncomputable.
Well, the Jeffrey-Bolker kind of explanation is as follows: agents really only need to consider and manipulate the probabilities and expected values of events (ie, propositions in the agent’s internal language). So it makes some sense to assume that these probabilities and expected values are computable. But this does not imply (as far as I know) that we can construct ‘worlds’ as maximal specifications of which propositions are true/false and then define a utility function on those worlds which is consistent with the computable expected values and have that utility function itself be computable. And indeed it seems rather plausible to me that this is not the case, even for values which otherwise seem relatively unremarkable, as illustrated by examples like the procrastination paradox.
I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.
I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={”button has been pressed”, “button has not been pressed”} and P(“button has been pressed” | “I’m pressing the button”)~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks.
I’m not sure why you say Omega can be the domain of U but not the entire ontology. This seems to mean that we don’t know how to take expected values for arbitrary events. Also it means you are no longer advocating for the model I’m arguing against, where U is a random variable.
We could expand the scenario so that every “day” is represented by an n-bit string.
If you want to force the agent to remember the entire history of the world, then you’ll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would “update” the daily utilities into total expected utility; in that case, U can do something similar.
I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent, and then both JB and random-variable treatments become implausible, in so far as those treatments involve infinitely large representations.
I still think that the Jeffreyesque representational choice of using compact event-propositions, rather than fully-specified worlds, seems more plausible with respect to such bounded agents.
You defined U at the very beginning, so there is no need to send these new facts to U, it doesn’t care. Instead, you are describing a problem with P, and it’s a hard problem, but Jeffrey also uses P, so that doesn’t solve it.
As per my earlier comment on “Omega is merely the domain of U”, I think here you’re abandoning elements of the random-variable approach to U, and in fact reasoning in a more JB-esque way.
> … set our model to be a list of “events” we’ve observed …
I didn’t understand this part.If you “evaluate events”, then events have some sort of bit representation in the agent, right? I don’t clearly see the events in your “Updates Are Computable” example, so I can’t say much and I may be confused, but I have a strong feeling that you could define U as a function on those bits, and get the same agent.
Yeah, it seems like we’re talking past each other here and would need to do more work to unpack what’s going on. All I can think to say right now is this: the usual random-variable approach to defining U requires that probabilities respect countable additivity, because the event of “the button being pressed” is just the set of individual worlds where that happens (where the button gets pressed on a particular day). This is the root of the computational difficulty in the standard approach. JB doesn’t require countable additivity, since it isn’t a rule which agents can enforce on their beliefs by touching only finitely many of them. This harkens back to something you said earlier:
Instead, you are describing a problem with P, and it’s a hard problem, but Jeffrey also uses P, so that doesn’t solve it.
Which I agree with in this case, except that JB does “solve” it by explicitly relaxing that constraint.
Again, this is a way in which JB is more general, not less; JB could follow that constraint, if you like.
In my personal practice, there seems to be a real difference—“something magic happens”—when you’ve got an actual audience you actually want to explain something to. I would recommend this over trying to simulate the experience within personal notes, if you can get it. The audience doesn’t need to be ‘the public internet’—although each individual audience will have a different sort of impact on your writing, so EG writing to a friend who already understands you fairly well may not cause you to clarify your ideas in the same way as writing to strangers.
I would also mildly caution against a policy which makes your own personal notes too effortful to write. I wholeheartedly agree that you should keep your future self in mind as an audience, and write such that the notes will be useful if you look back at them. But if I imagine writing my own personal notes to the same standard as public-facing essays, I think I lose something—it takes too long to capture ideas that way.
I agree that it makes more sense to suppose “worlds” are something closer to how the agent imagines worlds, rather than quarks. But on this view, I think it makes a lot of sense to argue that there are no maximally specific worlds—I can always “extend” a world with an extra, new fact which I had not previously included. IE, agents never “finish” imagining worlds; more detail can always be added (even if only in separate magisteria, eg, imagining adding epiphenomenal facts). I can always conceive of the possibility of a new predicate beyond all the predicates which a specific world-model discusses.
If you buy this, then I think the Jeffrey-Bolker setup is a reasonable formalization.
If you don’t buy this, my next question would be whether you really think that the sort of “world” (“world model”, as you called it) which an agent attaches value to always are “closed off” (ie sperify all the facts one way or the other; do not admit further detail) -- or, perhaps, you merely want to argue that this can sometimes be the case but not always. (Because if it’s sometimes the case but not always, this argues against both the traditional view where Omega is the set which the probability is a measure over & the utility function is a function of, and against the Jeffrey-Bolker picture.)
I find it implausible that the sort of “world model” which we can model humans as having-values-as-a-function-of is “closed off”—we can appreciate ideas like atoms and quarks, adding these to our ontology, without necessarily changing other aspects of our world-model. Perhaps sometimes we can “close things off” like this—we can consider the possibility that there “is nothing else”—but even so, I think this is better-modeled as an additional assertion which we add to the set of propositions defining a possibility rather than modeling us as having bottomed out in an underlying set of “world” which inherently decide all propositions.
In “procrastination” example you intentionally picked a bad model, so it proves nothing (if the world only has one button we care about, then maybe |Omega|=2 and everything is perfectly computable).
You seem to be suggesting that any such example could be similarly re-written to make things nicely computable. I find this implausible. We could expand the scenario so that every “day” is represented by an n-bit string. The computable function b() looks at a “day” and tells us whether the button was pressed or not on that day. As before, we get −10 utility if the button is never pressed. But we also have some (computable) reward, r(), which is a function of a “day” and tells us how good or bad that day was. The discounted reward is such that these priorities are never more important than whether or not the button is pressed; but so long as the button is eventually pressed, we prefer to get more reward rather than less. How would you change the representation now?
More generally, do you believe that any plausible utility function on bit-strings can be re-represented as a computable function (perhaps on some other representation, rather than bit-strings)? Why would you particularly expect this to be the case?
I think in arguing that I intentionally picked a bad model, you mean that the world-model representation which I chose was totally ad-hoc and chosen specifically to make things difficult to compute, and without having the goal in mind of making things difficult to compute, someone else would have chosen something simpler like |Omega|=2. But I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.
Further on, it seems to me that if we set our model to be a list of “events” we’ve observed, then we get the exact thing you’re talking about. Although you’re imprecise and inconsistent about what an event is, how it’s represented, how many there are, so I’m not sure if that’s supposed to make anything more tractable.I didn’t understand this part.
In general, asking questions about the domain of U (and P!) is a good idea, and something that all introductions to Utility lack. But the ease with which you abandon a perfectly good formalism is concerning. LI is cool, and it doesn’t use U, but that’s not an argument against U, at best you can say that U was not as useful as you’d hoped.
Jeffrey-Bolker is fairly commonly advocated amongst decision theorists in philosophy (from both sides of the CDT-EDT debate!), although as far as I’m aware it hasn’t made its way into stats textbooks at any level. It can be seen as part of a broader movement in mathematics, away from set-theoretic representations and toward more algebraic representations. A related example is pointless topology—instead of understanding a topology as a structure imposed on a set of points, the structure of “opens” (no longer “open sets”) is examined in its own right. In the same way that discarding “worlds” moves the formalism closer to concepts which the agent can actually realistically manipulate, discarding “points” from topology moves the math closer to the pieces which mathematicians are actually interested in manipulating.
My own take is that the domain of U is the type of P. That is, U is evaluated on possible functions P. P certainly represents everything the agent cares about in the world, and it’s also already small and efficient enough to be stored and updated in the agent, so this solution creates no new problems.
This is an interesting alternative, which I have never seen spelled out in axiomatic foundations.
I also wrote a huge amount in private idea-journals before I started writing publicly. There was also an intermediate stage where I wrote a lot on mailing lists, which felt less public than blogging although technically public.
Even if I conceded this point, which is not obvious to me, I would still insist on the point that different speakers will be using natural language differently and so resorting to natural language rather than formal language is not universally a good move when it comes to clarifying disagreements.
Well, more importantly, I want to argue that “translation” is happening even if both people are apparently using English.
For example, philosophers have settled on distinct but related meanings for the terms “probability”, “credence”, “chance”, “frequency”, “belief”. (Some of these meanings are more vague/general while others are more precise; but more importantly, these different terms have many different detailed implications.) If two people are unfamiliar with all of those subtleties and they start using one of the words (say, “probability”), then it is very possible that they have two different ideas about which more-precise notion is being invoked.
When doing original research, people are often in this situation, because the several more-precise notions have not even been invented yet (so it’s not possible to go look up how philosophers have clarified the possible concepts).
In my experience, this means that two people using natural language to try and discuss a topic are very often in a situation where it feels like we’re “translating” back and forth between our two different ontologies, even though we’re both expressing those ideas in English.
So, even if both people express their ideas in English, I think the “non-invertible translation problem” discussed in the original post can still arise.
I disagree. For tricky technical topics, two different people will be speaking sufficiently different versions of English that this isn’t true. Vagueness and other such topics will not apply equally to both speakers; one person might have a precise understanding of decision-theoretic terms like “action” and “observation” while the other person may regard them as more vague, or may have different decision-theoretic understanding of those terms. Simple example, one person may regard Jeffrey-Bolker as the default framework for understanding agents, while the other may prefer Savage; these two frameworks ontologize actions in very different ways, which may be incidental to the debate or may be central. Speaking in English just obscures this underlying difference in how we think about things, rather than solving the problem.
I’m not sure where the underlying disagreement in the decision theory case was (something about actions vs mixed strategies) but I assume there again the underlying problem can be expressed in natural language statements which both parties can understand without the need of translating them.
In the case of mixed vs pure strategies, I think it is quite clear that translating to technical terminology rather than English helped clarify rather than obscure, even if it created the non-one-to-one translation problem this post is discussing.
“Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
The question seems too huge for me to properly try to answer. Instead, I want to note that academics have been making some progress on models which are trying to do something similar to, but perhaps subtly different from, Paul’s reflective probability distribution you cite.
The basic idea is not new to me—I can’t recall where, but I think I’ve probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you’d expect to be meaningful (under some assumptions) because that’s how the next layer of neurons looks at a layer—since linear combinations are what’s important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn’t even surprising to me at the time I first learned about it.
But it’s great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn’t provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful “weak methods” in general are. (IE methods with empirical validation but which don’t come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it’s what I would classify as “weak methods”.
Yeah. For my case, I think it should be assumed that the meta-logics are as different as the object-logics, so that things continue to be confusing.
As I mentioned here, if Alice understands your point about the power of the double-negation formulation, she would be applying a different translation of Bob’s statements from the one I assumed in the post, so she would be escaping the problem. IE:
part of the beauty in the double-negation translation is that all of classical logic is valid under it.
is basically a reminder to Alice that the translation back from double-negation form is trivial in her own view (since it is classically equivalent), and all of Bob’s intuitionistic moves are also classically valid, so she only has to squint at Bob’s weird axioms, everything else should be understantable.
But I think things will still get weird if Bob makes meta-logical arguments (ie Bob’s arguments are not just proofs in Bob’s logic), which seems quite probable.
I don’t think the following is all that relevant to the point you are making in this post, but someone cited this post of yours in relation to the question of whether LLMs are “intelligent” (summarizing the post as “Nate says LLMs aren’t intelligent”) and then argued against the post as goalpost-moving, so I wanted to discuss that.
It may come as a shock to some, that Abram Demski adamantly defends the following position: GPT4 is AGI. I would be goalpost-moving if I said otherwise. I think the AGI community is goalpost-moving to the extent that it says otherwise.
I think there is some tendency in the AI Risk community to equate “AGI” with “the sort of AI which kills all the humans unless it is aligned”. But “AGI” stands for “artificial general intelligence”, not “kills all the humans”. I think it makes more sense for the definition of AGI to be up to the community of AI researchers who use the term AGI to distance their work from narrow AI, rather than for it to be up to the AI risk community. And GPT4 is definitely not narrow AI.
I’ll argue an even stronger claim: if you come up with a task which can be described and completed entirely in text format (and then evaluated somehow for performance quality), for most such tasks the performance of GPT4 is at or above the performance of a random human. (We can even be nice and only randomly sample humans who speak whichever languages are appropriate to the task; I’ll still stand by the claim.) Yes, GPT4 has some weaknesses compared to a random human. But most claims of weaknesses I’ve heard are in fact contrasting GPT4 to expert humans, not random humans. So my stronger claim is: GPT4 is human-level AGI, maybe not by all possible definitions of the term, but by a very reasonable-seeming definition which 2014 Abram Demski might have been perfectly happy with. To deny this would be goalpost-moving for me; and, I expect, for many.
So (and I don’t think this is what you were saying) if GPT4 were being ruled out of “human-level AGI” because it cannot write a coherent set of novels on its own, or do a big engineering project, well, I call shenanigans. Most humans can’t do that either.