I was gonna do some X
with my AI
I was gonna only do X
But when I got it working, it generalized (to Y)
oh no my AI
it also does Y
Practice listing tons of hypotheses. Then when the life shit hits the fan, you can list many hypotheses of what’s going on and plans to untangle stuff.
When you do have slack, explore, especially situations where you’re forced to interact with the world. E.g. talking to people in new contexts, making physical objects, etc. Conjecture: there’s a sort of “can charge full speed ahead into the unknown” that’s basically trainable, and it’s about doing things where you start off not at all knowing how it’s going to go / how you’re going to deal with it; training in low-stakes will transfer to high-stakes.
Wait, but this would also apply to similarities of convergent evolution in similar niches. There’s the essence of sight, the essence of flight, the essence of water-dwelling, the essence of hunting.
I feel like a fun version of noticing this conflict, is to rub one’s hands together at the prospect of getting to invent a word for “that set of animals who are members of species which occupy a niche that resembles the niches occupied by (the paraphyletic) Osteichthyes”.
To clarify where my responses are coming from: I think what I’m saying is not that directly relevant to your specific point in the post. I’m more (1) interested in discussing the notion of only-X, broadly, and (2) reacting to the feature of your discussion (shared by much other discussion) that you (IIUC) consider only the extensional (input-output) behavior of programs, excluding from analysis the intensional properties. (Which is a reasonable approach, e.g. because the input-output behavior captures much of what we care about, and also because it’s maybe easier to analyze and already contains some of our problems / confusions.)
From where I’m sitting, when a program “makes an observation of the world”, that’s moving around in codespace. There’s of course useful stuff to say about the part that didn’t change. When we really understand how a cognitive algorithm works, it starts to look like a clear algorithm / data separation; e.g. in Bayesian updating, we have a clear picture of the code that’s fixed, and how it operates on the varying data. But before we understand the program in that way, we might be unable to usefully separate it out into a fixed part and a varying part. Then it’s natural to say things like “the child invented a strategy for picking up blocks; next time, they just use that strategy”, where the first clause is talking about a change in source code. We know for sure that such separations can be done, because for example we can say that the child is always operating in accordance with fixed physical law, and we might suspect there’s “fundamental brain algorithms” that are also basically fixed. Likewise, even though Solomonoff induction is always just Solomonoff induction plus data, it can be also useful to understand SI(some data) in terms of understanding those programs that are highly ranked by SI(some data), and it seems reasonable to call that “the algorithm changed to emphasize those programs”.
Well, a main reason we’d care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, “how much” learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you’d have to have traversed the both the distances from null to x() and from null to y(), and it’s weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that’s an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that’s an only-Yer, where Y is “very different” from X.
Thanks for trying to clarify “X and only X”, which IMO is a promising concept.
One thing we might want from an only-Xer is that, in some not-yet-formal sense, it’s “only trying to X” and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You’ve discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we’ve selected something to do X, within some set of environments, what does that imply about how it’ll behave outside of that set of environments? It looks like by your definition we could have a program that’s a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an “X and only X”er, so either there’s more to be said, or my picture is incoherent.
Somewhat relevant: Yudkowsky, Eliezer. 2004. Coherent Extrapolated Volition. https://intelligence.org/files/CEV.pdf
Gives one of the desiderata for CEV as “Avoid creating a motive for modern-day humans to fight over the initial dynamic”.
There’s two stances I can take when I want to express a thought so that I can think about it with someone. Both could be called “expressing”. One could be called “pushing-out”: like I’m trying to “get it off my chest”, or “leave it behind / drop it so I can move on to the next thought”. The other is more appropriately “expressing”, as in pressing (copying) something out: I make a copy and give it to the other person, but I’m still holding the original. The former is a habit of mine, but on reflection it’s often a mistake; what I really want is to build on the thought, and the way to do that is to keep it active while also thinking the next thought. The underlying mistake might be incorrectly thinking that the other person can perform the “combine already-generated thoughts” part of the overall progression while I do the “generate individual new thoughts” part. Doing things that way results in a lot of dropped thoughts.
Say Alice has a problem with Bob, but doesn’t know what it is exactly. Then Bob tries to fix it cooperatively by searching in dimension X for settings that alleviate Alice’s problem. If Alice’s problem is actually about Bob’s position on dimension Y, not X, Bob’s activity might appear adversarial: Bob’s actions are effectively goodharting Alice’s sense of whether things are good, in the same way he’d do if he were actually trying to distract Alice from Y.
So I’d rather say that we “affect nothing but what we intervene on and what’s downstream of what we intervened on”.
A fair clarification.
Not sure whether this has anything to do with your point, though.
My point is very tangential to your post: you’re talking about decision theory as top-level naturalized ways of making decisions, and I’m talking about some non-top-level intuitions that could be called CDT-like. (This maybe should’ve been a comment on your Dutch book post.) I’m trying to contrast the aspirational spirit of CDT, understood as “make it so that there’s such a thing as ‘all of what’s downstream of what we intervened on’ and we know about it”, with descriptive CDT, “there’s such a thing as ‘all of what’s downstream of what we intervened on’ and we can know about it”. Descriptive CDT is only sort of right in some contexts, and can’t be right in some contexts; there’s no fully general Arcimedean point from which we intervene.
We can make some things more CDT-ish though, if that’s useful. E.g. we could think more about how our decisions have effects, so that we have in view more of what’s downstream of decisions. Or e.g. we could make our decisions have fewer effects, for example by promising to later reevaluate some algorithm for making judgements, instead of hiding within our decision to do X also our decision to always use the piece-of-algorithm that (within some larger mental context) decided to do X. That is, we try to hold off on decisions that have downstream effects we don’t understand well yet.
The specifications would correctly capture what-we-actually-mean, so they wouldn’t be prone to goodhart
I think there’s an ambiguity in “concept” here, that’s important to clarify re/ this hope. Humans use concepts in two ways:
1. as abstractions in themselves, like the idea of an ideal spring which contains its behavior within the mental object, and
2. as pointers / promissory notes towards the real objects, like “tree”.
Seems likely that any agent that has to attend to trees, will form the ~unique concept of “tree”, in the sense of a cluster of things, and minimal sets of dimensions needed to specify the relevant behavior (height, hardness of wood, thickness, whatever). Some of this is like use (1): you can simulate some of the behavior of trees (e.g. how they’ll behave when you try to cut them down and use them to build a cabin). Some of this is like use (2): if you want to know how to grow trees better, you can navigate to instances of real trees, study them to gain further relevant abstractiosn, and then use those new abstractions (nutrient intake, etc.) to grow trees better.
So what do we mean by “strawberry”, such that it’s not goodhartable? We might mean “a thing that is relevantly naturally abstracted in the same way as a strawberry is relevantly naturally abstracted”. This seems less goodhartable if we use meaning (2), but that’s sort of cheating by pointing to “what we’d think of these strawberrys upon much more reflection in many more contexts of relevance”. If we use meaning (1), that sems eminently goodhartable.
>There is no continuum of tree-like abstractions.
Some possibly related comments, on why there might be discrete clusters:
From a superrational perspective (in the game with no randomness), in both cases there’s two actions; in the correlation game both actions give a util, in the anti-correlation game both actions give no utils. The apparent difference is based on the incoherent counterfactual “what if I say heads and my copy says tails”, which doesn’t translate into the superrational perpective.
(Side note: There’s an aspect to the notion of “causal counterfactual” that I think it’s worth distinguishing from what’s discussed here. This post seems to take causal counterfactuals to be a description of top-level decision reasoning. A different meaning is that causal counterfactuals refer to an aspiration / goal. Causal interventions are supposed to be interventions that “affect nothing but what’s explicitly said to be affected”. We could try to describe actions in this way, carefully carving out exactly what’s affected and what’s not; and we find that we can’t do this, and so causal counterfactuals aren’t, and maybe can’t possibly, be a good description (e.g. because of Newcomb-like problems). But instead we could view them as promises: if I manage to “do X and only X” then exactly such and such effects result. In real life if I actually do X there will be other effects, but they must result from me having done something other than just exactly X. This seems related to the way in which humans know how to express preferences data-efficiently, e.g. “just duplicate this strawberry, don’t do any crazy other stuff”.)
>Surely there’s some precise way the universe is.
Agree, and would love to see a more detailed explicit discussion of what this means and whether it’s true. (Also, worth noting that there may be a precise way the universe is, but no “precise” way that “you” fit into the universe, because “you” aren’t precise.)
--Human brains have special architectures, various modules that interact in various ways (priors?)
--Human brains don’t use Backprop; maybe they have some sort of even-better algorithm
This is a funny distinction to me. These things seem like two ends of a spectrum (something like, the physical scale of “one unit of structure”; predictive coding is few-neuron-scale, modules are big-brain-chunk scale; in between, there’s micro-columns, columns, lamina, feedback circuits, relays, fiber bundles; and below predictive coding there’s the rules for dendrite and synapse change).
I wouldn’t characterize my own position as “we know a lot about the brain.” I think we should taboo “a lot.”
I think there’s mounting evidence that brains use predictive coding
Are you saying, there’s mounting evidence that predictive coding screens off all lower levels from all higher levels? Like all high-level phenomena are the result of predictive coding, plus an architecture that hooks up bits of predictive coding together?
It is implausible that human beings’ cognitive instincts contain significantly more information than the human genome (750 megabytes). I expect our instincts contain much less.
Our instincts contain pointers to learning from other humans, which contain lots of cognitive info. The pointer is small, but that doesn’t mean the resulting organism is algorithmically that simple.
__Levers error__. Anna writes about bucket errors . Attempted summary: sometimes two facts are mentally tracked by only one variable; in that case, correctly updating the belief about one fact can also incorrectly update the belief about the other fact, so it is sometimes epistemic to flinch away from the truth of the first fact (until you can create more variables to track the facts separately). There’s a conjugate error: two actions are bound together in one “lever”. For example, I want to clean my messy room. But somehow it feels pointless / tiring, even before I’ve started. If I just started cleaning anyway, I’d get bogged down in some corner, trying to make a bunch of decisions about where exactly to put lots of futzy random objects, tiring myself out and leaving my room still annoyingly cluttered. It’s not that there’s a necessary connection between cleaning my room and futzing around inefficiently; it’s that the only lever I have right now that activates the “clean room” action also activates the “futz interminably” action.What I want instead is to create a lever that activates “clean room” but not “futz”, e.g. by explicitly noting the possibility to just put futzy stuff in a box and not deal with it more. When I do that, I feel motivated to clean my messy room. I think this explains some “akrasia”.The general pattern: I want to do X to acheive some goal, but the only way (that I know how right now) to do X is if I also do Y, and doing Y in this situation would be bad. Flinching away from action toward a goal is often about protecting your goals.