I somewhat expect your response will be “why would anyone be applying coherence arguments in such a ridiculously abstract way rather than studying a concrete system”, to which I would say that you are not in the intended audience.
Ok, this is a fair answer. I think you and I, at least, are basically aligned here.
I do think a lot of people took away from your post something like “all behavior can be rationalized as EU maximization”, and in particular I think a lot of people walked away with the impression that usefully applying coherence arguments to systems in our particular universe is much more rare/difficult than it actually is. But I can’t fault you much for some of your readers not paying sufficiently close attention, especially when my review at the top of this thread is largely me complaining about how people missed nuances in this post.
I assume this is my comment + post
I was referring mainly to Richard’s post here. You do seem to understand the issue of assuming (rather than deriving) probabilities.
I discussed VNM specifically because that’s the best-understood coherence theorem and the one that I see misused in AI alignment most often.
This I certainly agree with.
I don’t know the formal statements of other coherence theorems, though I predict with ~98% confidence that any specific theorem you point me to would not change my objection.
Exactly which objection are you talking about here?
If it’s something like “coherence theorems do not say that tool AI is not a thing”, that seems true. Even today humans have plenty of useful tools with some amount of information processing in them which are probably not usefully model-able as expected utility maximizers.
But then you also make claims like “all behavior can be rationalized as EU maximization”, which is wildly misleading. Given a system, the coherence theorems map a notion of resources/efficiency/outcomes to a notion of EU maximization. Sure, we can model any system as an EU maximizer this way, but only if we use a trivial/uninteresting notion of resources/efficiency/outcomes. For instance, as you noted, it’s not very interesting when “outcomes” refers to “universe-histories”. (Also, the “preferences over universe-histories” argument doesn’t work as well when we specify the full counterfactual behavior of a system, which is something we can do quite well in practice.)
Combining these points: your argument largely seems to be “coherence arguments apply to any arbitrary system, therefore they don’t tell us interesting things about which systems are/aren’t <agenty/dangerous/etc>”. (That summary isn’t exactly meant to pass an ITT, but please complain if it’s way off the mark.) My argument is that coherence theorems do not apply nontrivially to any arbitrary system, so they could still potentially tell us interesting things about which systems are/aren’t <agenty/dangerous/etc>. There may be good arguments for why coherence theorems are the wrong way to think about goal-directedness, but “everything can be viewed as EU maximization” is not one of them.
Yes, if you add in some additional detail about resources, assume that you do not have preferences over how those resources are used, and assume that there are preferences over other things that can be affected using resources, then coherence theorems tell you something about how such agents act. This doesn’t seem all that relevant to the specific, narrow setting which I was considering.
Just how narrow a setting are you considering here? Limited resources are everywhere. Even an e-coli needs to efficiently use limited resources. Indeed, I expect coherence theorems to say nontrivial things about an e-coli swimming around in search of food (and this includes the possibility that the nontrivial things the theorem says could turn out to be empirically wrong, which in turn would tell us nontrivial things about e-coli and/or selection pressures, and possibly point to better coherence theorems).
I actually think it shouldn’t be in the alignment section, though for different reasons than Rohin. There’s lots of things which can be applied to AI, but are a lot more general, and I think it’s usually better to separate the “here’s the general idea” presentation from the “here’s how it applies to AI” presentation. That way, people working on other interesting things can come along and notice the idea and try to apply it in their own area rather than getting scared off by the label.
For instance, I think there’s probably gains to be had from applying coherence theorems to biological systems. I would love it if some rationalist biologist came along, read Yudkowsky’s post, and said “wait a minute, cells need to make efficient use of energy/limited molecules/etc, can I apply that?”. That sort of thing becomes less likely if this sort of post is hiding in “the alignment section”.
Zooming out further… today, alignment is the only technical research area with a lot of discussion on LW, and I think it would be a near-pareto improvement if more such fields were drawn in. Taking things which are alignment-relevant-but-not-just-alignment and lumping them all under the alignment heading makes that less likely.
My understanding is that the usual lab mouse breeds are highly inbred, resulting in high levels of cancer. That makes is “easier”, in some sense, to extend their lifespans—especially by interventions which trade off cancer risk against other age-related deterioration. For instance, there are ways to make cells more sensitive to DNA damage, so they undergo senescence at lower damage levels. This can decrease cancer risk, at the cost of accelerating other age-related degeneration.
The key problem is… sometimes you actually just do need to have status fights, and you still want to have as-good-epistemics-as-possible given that you’re in a status fight. So a binary distinction of “trying to have good epistemics” vs “not” isn’t the right frame.
Part of my model here is that moral/status judgements (like “we should blame X for Y”) like to sneak into epistemic models and masquerade as weight-bearing components of predictions. The “virtue theory of metabolism”, which Yudkowsky jokes about a few times in the sequences, is an excellent example of this sort of thing, though I think it happens much more often and usually much more subtly than that.
My answer to that problem on a personal level is to rip out the weeds wherever I notice them, and build a dome around the garden to keep the spores out. In other words: keep morality/status fights strictly out of epistemics in my own head. In principle, there is zero reason why status-laden value judgements should ever be directly involved in predictive matters. (Even when we’re trying to model our own value judgements, the analysis/engagement distinction still applies.)
Epistemics will still be involved in status fights, but the goal is to make that a one-way street as much as possible. Epistemics should influence status, not the other way around.
In practice it’s never that precise even when it works, largely because value connotations in everyday language can compactly convey epistemically-useful information - e.g. the weeds analogy above. But it’s still useful to regularly check that the value connotations can be taboo’d without the whole model ceasing to make sense, and it’s useful to perform that sort of check automatically when value judgements play a large role.
Not exactly an answer to your question, but this post is probably relevant and has several short-but-concrete examples.
I’ve wanted for a while to see a game along these lines. It would have some sort of 1-v-1 fighting, but dominated by “random” behavior from environmental features and/or unaligned combatants. The centerpiece of the game would be experimenting with the “random” components to figure out how they work, in order to later leverage them in a fight.
Fleshing this out a bit more, within the framework of this comment: when we can consistently predict some outcomes using only a handful of variables, we’ve learned a (low-dimensional) constraint on the behavior of the world. For instance, the gas law PV = nRT is a constraint on the relationship between variables in a low-dimensional summary of a high-dimensional gas. (More precisely, it’s a template for generating low-dimensional constraints on the summary variables of many different high-dimensional gases.)
When we flip perspective to problems of design (e.g. engineering), those constraints provide the structure of our problem—analogous to the walls in a maze. We look for “paths in the maze”—i.e. designs—which satisfy the constraints. Duality says that those designs act as constraints when searching for new constraints (i.e. doing science). If engineers build some gadget that works, then that lets us rule out some constraints: any constraints which would prevent the gadget from working must be wrong.
Data serves a similar role (echoing your comment here). If we observe some behavior, then that provides a constraint when searching for new constraints. Data and working gadgets live “in the same space”—the space of “paths”: things which definitely do work in the world and therefore cannot be ruled out by constraints.
I detect the ghost of Jaynes in this!
In particular, the view in this post is extremely similar to the view in Macroscopic Prediction. As there, reproducible phenomena are the key puzzle piece.
The material here is one seed of a worldview which I’ve updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits.
Two ideas unify all of these:
Our universe has a simplifying structure: it abstracts well, implying a particular kind of modularity.
Goal-oriented systems in our universe tend to evolve a modular structure which reflects the structure of the universe.
One major corollary of these two ideas is that goal-oriented systems will tend to evolve similar modular structures, reflecting the relevant parts of their environment. Systems to which this applies include organisms, machine learning algorithms, and the learning performed by the human brain. In particular, this suggests that biological systems and trained deep learning systems are likely to have modular, human-interpretable internal structure. (At least, interpretable by humans familiar with the environment in which the organism/ML system evolved.)
This post talks about some of the evidence behind this model: biological systems are indeed quite modular, and simulated evolution experiments find that circuits do indeed evolve modular structure reflecting the modular structure of environmental variations. The companion post reviews the rest of the book, which makes the case that the internals of biological systems are indeed quite interpretable.
On the deep learning side, researchers also find considerable modularity in trained neural nets, and direct examination of internal structures reveals plenty of human-recognizable features.
Going forward, this view is in need of a more formal and general model, ideally one which would let us empirically test key predictions—e.g. check the extent to which different systems learn similar features, or whether learned features in neural nets satisfy the expected abstraction conditions, as well as tell us how to look for environment-reflecting structures in evolved/trained systems.
Its behaviour might counterfactually depend on factors which the experimenter did not vary and which did not naturally change over the course of the experiment.
Keep reading, the post gets to that.
This is an excellent post, with a valuable and well-presented message. This review is going to push back a bit, talk about some ways that the post falls short, with the understanding that it’s still a great post.
There’s this video of a toddler throwing a tantrum. Whenever the mother (holding the camera) is visible, the child rolls on the floor and loudly cries. But when the mother walks out of sight, the toddler soon stops crying, gets up, and goes in search of the mother. Once the toddler sees the mother again, it’s back to rolling on the floor crying.
A key piece of my model here is that the child’s emotions aren’t faked. I think this child really does feel overcome, when he’s rolling on the floor crying. (My evidence for this is mostly based on discussing analogous experiences with adults—I know at least one person who has noticed some tantrum-like emotions just go away when there’s nobody around to see them, and then come back once someone else is present.)
More generally, a lot of human emotions are performative. They’re emotions which some subconscious process puts on for an audience. When the audience goes away, or even just expresses sufficient disinterest, the subconscious stops expressing that emotion.
In other words: ignoring these emotions is actually a pretty good way to deal with them. “Ignore the emotion” is decent first-pass advice for grown-up analogues of that toddler. In many such cases, the negative emotion will actually just go away if ignored.
Now, obviously a lot of emotions don’t fall into this category. The post is talking about over-applying the “ignore your emotions” heuristic, and the hazards of applying in places where it doesn’t work. But what we really want is not an argument that applying the heuristic more/less often is better, but rather a useful criterion for when the “ignore your emotions” heuristic is useful. I suggest something like: will this emotion actually go away if ignored?
The post is mainly talking about dealing with your own emotions, but this criterion is especially useful for dealing with others’ emotions. When you are audience, it’s relatively easy to remove the audience. Sometimes, another person’s negative emotion will just die down if you walk away, but will sustain itself if you hang around trying to “help”. The key thing to ask is “will this emotion just go away if it doesn’t have an audience?”.
I think you’re pointing to the same issue which Adam Zerner was pointing to. Hunting down sources of randomness is a good goal when doing science, but that doesn’t tell us much about how to go about the hunt when the solution space is very large.
Great question. This post is completely ignoring those points, and it’s really not something which should be ignored.
In the context of this post, the question is: ok, we’re trying to hunt down sources of randomness, trying to figure out which of the billions of variables actually matter, but how do we do that? We can’t just guess and check all those variables.
I think this roughly captures it. Minor caveat that picking the right few separate fields could be where the “hard part” is; for the strategy to really circumvent GEM, it also has to be tractable to pick the right few fields with not-too-low probability.
Causal inference (or more precisely learning causal structure) is exactly the sort of thing I have in mind here. There’s actually a few places in the post where I should distinguish between variables which control an outcome in an information sense (i.e. sufficient to perfectly predict the outcome) vs in a causal sense (i.e. sufficient to cause the outcome under interventions). The main reason I didn’t talk about it directly is because I would have had to explain that distinction, and decided that would be too much of a distraction from the main point.
I think the takeaway of Jason’s talk, as it relates to this post, is that a large chunk of the “science” of achieving consistent outcomes happens in inventors’ workshops rather than scientists’ labs. The problem is still largely similar, regardless of the label applied, but scientists aren’t the only ones doing science.
Even once you’ve figured out which dozen variables you need to control to get a sled to move at the same speed every time, you still can’t predict what that speed would be if you set these dozen variables to different values. You’ve got to figure out Newton’s laws of motion and friction before you can do that.Finding out which variables are relevant to a phenomenon in the first place is usually a required initial step for building a predictive model...
Even once you’ve figured out which dozen variables you need to control to get a sled to move at the same speed every time, you still can’t predict what that speed would be if you set these dozen variables to different values. You’ve got to figure out Newton’s laws of motion and friction before you can do that.
Finding out which variables are relevant to a phenomenon in the first place is usually a required initial step for building a predictive model...
Part of the implicit argument of the post is that the “figure out the dozen or so relevant variables” is the “hard” step in a big-O sense, when the number of variables in the universe is large. This is for largely similar reasons to those in Everyday Lessons From High-Dimensional Optimization: in low dimensions, brute force-ish methods are tractable. Thus we get things like e.g. tables of reaction rate constants. Before we had the law of mass action, there were too many variables potentially relevant to reaction rates to predict via brute force. But once we have mass action, there are few enough degrees of freedom that we can just try them out and make these tables of reaction constants.
Now, that still leaves the step of going from “temperature and concentrations are the relevant variables” to the law of mass action, but again, that’s the sort of thing where brute-force-ish exploration works pretty well. There is an insight step involved there, but it can largely be done by guess-and-check. And even before that insight is found, there’s few enough variables involved that “make a giant table” is largely tractable.
Another type of widespread scientific work I can think of is facilitating efficient calculation...
Great point! It’s a very similar problem, with a very similar solution. We have some complicated system with a large number of lines/variables which could influence the outcome (i.e. the bug), and the main problem is to figure out which lines/variables mediate the influence of everything else. The first step is to reproduce the bug—i.e. hunt down all the sources of “randomness”, until we can make the bug happen consistently. After that, the next step is to look for mediation—i.e. find lines/variables which are “in between” our original reproduction-inputs and the bug itself, and which are themselves sufficient to reproduce the problem.