Glad to hear that you aren’t recommending strategy research in general—because that’s what it looked like.
And yes, I think it’s incredibly hard to make sure we’re not putting effort into efforts with negative expected value, and I think that attention hazards are critical, and are the biggest place where I think strategy research has the potential to increase risks rather than ameliorate them. (Which is exactly why I’m confused that anyone would suggest that more such research should be done publicly and/or shared. And it’s why I don’t think that a more detailed object level discussion makes sense here, in public.)
No, it just means you just need an actual system model which is at least somewhat predictive in order to make decisions, and therefore have a better grasp on the expected value of your investments than “let’s try something, who knows, let’s just take risks.”
the more money you have, the higher the variance on weird projects you should be funding.
Only if you’re sure the mean is positive—and there’s no reason to think that. In fact, it’s arguable that in a complex system, a priori, we should consider significant changes destabilizing and significantly net negative unless we have reason to think otherwise.
I’m very confused why you think that such research should be done publicly, and why you seem to think it’s not being done privately.
Also, regarding the following:
Strategy research would not be valuable if it was completely intractable. We believe some actors and attempts at strategy research can succeed, but it is hard to predict success beforehand.
Given the first sentence, I’m confused as to why you think that “strategy research” (writ large) is going to be valuable, given our fundamental lack of predictive ability in most of the domains where existential risk is a concern.
It seems strange to try to draw sharp boundaries around communities for the purposes of this argument, and given the obvious overlap and fuzzy boundaries, I don’t really understand what the claim that the “rationality community didn’t have an organisation like CEA” even means. This is doubly true given that as far as I have seen, all of the central EA organizations are full of people who read/used to read Lesswrong.
On point 2, which is the only one I can really comment on, yes, this seems like a useful paper, and I buy the argument that such an approach is critical for some purposes, including some of what we discussed on Goodhart’s Law—https://arxiv.org/abs/1803.04585 - where one class of misalignment can be explicitly addressed by your approach. Also see the recent paper here: https://arxiv.org/abs/1905.12186 that explicitly models causal dependencies (like in figure 2,) to show a safety result.
Yes, and yes, I’m hoping to be there.
Note: I briefly tried a similar approach, albeit with polynomial functions with random coefficients rather than ANNs, and in R not python, but couldn’t figure out how to say anything useful with it.
If this is of any interest, it is available here: https://gist.github.com/davidmanheim/5231e4a82d5ffc607e953cdfdd3e3939 (I also built simulations for bog-standard Goodhart)
I am unclear how much of my feeling that this approach is fairly useless reflects my lack of continued pursuit of building such models and figuring out what can be said, or my diversion to other work that was more fruitful, rather than a fundamental difficult of saying anything clear based on these types of simulations. I’d like to claim it’s the latter, but I’ll clearly note that it is heavily motivated reasoning.
I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
Based on the discussions below, it seems clear to me that there are (at least) two continuous dimensions of legibility and coercion, which are often related but conceptually distinct. I think they are positively correlated in most good writing, so they are easily conflated, but clarifying them seems useful.
The first is Legible <--> Illegible, in Venkatesh Rao’s terms, as others suggested. This is typically the same as serial-access vs random-access, but has more to do with structure; trees are highly legible, but may not require a particular order. Rough notes from a lecture are illegible (even if they are typed, rather than hand-written,) but usually need to be read in order.
Coercive <--> Non-coercive, mostly in the negative sense people disliked. Most of the time, the level of coercion is fairly low even in what we think of as coercive writing. For example, any writing that pushes a conclusion is attempting to change your mind, hence it is coercive. Structures that review or present evidence are non-coercive.
I think it takes effort to make something legible but non-coercive, and it is either very high effort OR badly structured when they are illegible and non-coercive. And since I’ve brought up Venkatesh Rao and mentioned two dimensions, I believe I’m morally required to construct a 2x2. I can’t upload a drawing in a comment, but I will “take two spectra (or watersheds) relevant to a complex issue, simplify each down to a black/white dichotomy, and label the four quadrants you produce.” Given his advice, I’ll use a “glossary of example “types” to illustrate diversity and differentiation within the soup of ambiguity.”
Paternalistic non-fiction writing is legible but coercive; it assumes it knows best, but allows navigation. The sequences are a good example, well structured textbooks are often a better example. Note that being correct doesn’t change the level of coercion! There are plenty of coercive anti-evolution/religious biology “textbooks,” but the ones that are teaching actual science are no less coercive.
Unstructured Wikis are illegible and non-coercive; the structure isn’t intended to make a point or convince you, but they are also unstructured and makes no effort to present things logically or clearly on a higher level. (Individual articles can be more or less structured or coercive, but the wiki format is not.)
Blueprints, and Diagrams, are legible but non-coercive, since by their structure they only present information, rather than leading to a conclusion. Novels and other fiction are (usually) legible, but are often non-coercive. Sometimes there is an element of coercion, as in fables, Lord of the Flies, HP:MoR, and everything CS Lewis ever wrote—but the main goal is (or should be) to be immersive or entertaining rather than coercive or instructive.
Conversations, and almost any multi-person Forum (including most lesswrong writing) are coercive and illegible. Tl;drs are usually somewhat illegible as well. The structure of conversation is hard to understand, there are posts and comments that are relevant that aren’t clearly structured. At the same time, everyone is trying to push their reasoning.
It also fails to account for the fact that health care is, in a sense, an ultimate superior good—there is no level of income at which people don’t want more health, and their demand scales with more income. This combines with the fact that we don’t have good ways to exchange money for being healthier. (The same applies for intelligence / education.) I discussed this in an essay on Scott’s original post:
That’s all basically right, but if we’re sticking to causal Goodhart, the “without further assumptions” may be where we differ. I think that if the uncertainty is over causal structures, the “correct” structure will be more likely to increase all metrics than most others.
(I’m uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I’d think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.
It’s not exactly the same, but I would argue that the issues with “Dog” versus “Cat” for the picture are best captured with that formalism—the boundaries between categories are not strict.
To be more technical, there are a couple locations where fuzziness can exist. First, the mapping in reality is potentially fuzzy since someone could, in theory, bio-engineer a kuppy or cat-dog. These would be partly members of the cat set, and partly members of the dog set, perhaps in proportion to the genetic resemblance to each of the parent categories.
Second, the process that leads to the picture, involving a camera and a physical item in space, is a mapping from reality to an image. That is, reality may have a sharp boundary between dogs and cats, but the space of possible pictures of a given resolution is far smaller than the space of physical configurations that can be photographed, so the mapping from reality->pictures is many-to-one, creating a different irresolvable fuzziness—perhaps 70% of the plausible configurations that lead to this set of pixels are cats, and 30% are dogs, so the picture has a fuzzy set membership.
Lastly, there is mental fuzziness, which usually captures the other two implicitly, but has the additional fuzziness created because the categories were made for man, not man for the categories. That is, the categories themselves may not map to reality coherently. This is different from the first issue, where “sharp” genetic boundaries like that between dogs and cats do map to reality correctly, but items can be made to sit on the line. This third issues is that the category may not map coherently to any actual distinction, or may be fundamentally ambiguous, as Scott’s post details for “Man vs. Woman” or “Planet vs. Planetoid”—items can partly match one or more than one category, and be fuzzy members of the set.
Each of these, it seems, can be captured fairly well as fuzzy sets, which is why I’m proposing that your usage has a high degree of membership in the fuzzy set of things that can be represented by fuzzy sets.
Also, I keep feeling bad that we’re perpetuating giving Goodhart credit, rather than Campbell, since Campbell was clearly first—https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2018.01205.x - and Goodhart explicitly said he was joking in a recent interview.
See my much shorter and less developed note to a similar effect: https://www.lesswrong.com/posts/QJwnPRBBvgaeFeiLR/uncertainty-versus-fuzziness-versus-extrapolation-desiderata#kZmpMGYGfwGKQwfZs—and I agree that regressional and extremal goodhart cannot be fixed purely with his solution.
I will, however, defend some of Stuart’s suggestions as they relate to causal Goodhart in a non-adversarial setting. - I’m also avoiding the can of worms of game theory. In that case, both randomization AND mixtures of multiple metrics can address Goodhart-like failures, albeit in different ways. I had been thinking about this in the context of policy—https://mpra.ub.uni-muenchen.de/90649/ - rather than AI alignment, but some of the arguments still apply. (One critical argument that doesn’t fully apply is that “good enough” mitigation raises the cognitive costs of cheating to a point where aligning with the true goal is cheaper. I also noted in the paper that satisficing is useful for limiting the misalignment from metrics, and quantilization seems like one promising approach for satisficing for AGI.)
The argument for causal goodhart is that randomization and mixed utilities are both effective in mitigating causal structure errors that lead to causal Goodhart in the one-party case. That’s because the failure occurs when uncertainty or mistakes about causal structure leads to choice of metrics that are corrrelated with the goal, rather than causal of the goal. However, if even some significant fraction or probability of the metric is causally connected to the metrics in ways that cannot be gamed, it can greatly mitigate this class of failure.
To more clearly apply this logic to human utility, if we accidentally think that endorphins in the brain are 100% of human goals, AGI might want to tile the universe with rats on happy drugs, or the moral equivalent. If we assign this only 50% weight, of have a 50% probability that it will be the scored outcome, and we define something that requires a different way of creating what we actually think of as happiness / life satisfaction, it does not just shift the optimum from 50% of the universe tiled with rat brains. This is because the alternative class of hedonium will involve a non-trivial amount of endorphins as well, as long as other solutions have anywhere close to as much endorphins, they will be preferred. (In this case, admittedly, we got the endorphin goal so wrong that 50% of the universe tiled in rats on drugs is likely—bad enough utility functions can’t be fixed with either randomization or weighting. But if a causal mistake can be fixed with either a probabilistic or a weighting solution, it seems likely it can be fixed with the other.)
I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes—https://www.mdpi.com/2504-2289/3/2/21/htm. The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the “opponent model,” and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment—but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.
I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, “the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.””
As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim—misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,
Actually, I assumed fuzzy was intended here to be a precise term, contrasted with probability and uncertainty, as it is used in describing fuzzy sets versus uncertainty about set membership. https://en.wikipedia.org/wiki/Fuzzy_set
I missed the proposal when it was first released, but I wanted to note that the original proposal addresses only one (critical) class of Goodhart-error, and proposes a strategy based on addressing one problematic result of that, nearest-unblocked neighbor. The strategy does more widely useful for misspecification than just nearest-unblocked neighbor, but it still is only addressing some Goodhart-effects.
The misspecification discussed is more closely related to, but still distinct from, extremal and regressional Goodhart. (Causal and adversarial Goodhart are somewhat far removed, and don’t seem as relevant to me here. Causal Goodhart is due to mistakes, albeit fundamentally hard to avoid mistakes, while adversarial Goodhart happens via exploiting other modes of failure.)
I notice I am confused about how different strategies being proposed to mitigate these related failures can coexist if each is implemented separately, and/or how they would be balanced if implemented together, as I briefly outline below. Reconciling or balancing these different strategies seems like an important question, but I want to wait to see the full research agenda before commenting or questioning further.
Extremal Goodhart is somewhat addressed by another post you made, which proposes to avoid ambiguous distant situations—https://www.lesswrong.com/posts/PX8BB7Rqw7HedrSJd/by-default-avoid-ambiguous-distant-situations. It seems that the strategy proposed here is to attempt to resolve fuzziness, rather than avoid areas where it becomes critical. These seem to be at least somewhat at odds, though this is partly reconcilable by fully pursuing neither resolving ambiguity, nor fully avoiding distant ambiguity.
and regressional Goodhart, as Scott G. originally pointed out, is unavoidable except by staying in-sample, interpolating rather than extrapolating. Fully pursuing that strategy is precluded by injecting uncertainty into the model of the Human-provided modification to the utility function. Again, this is partly reconcilable, for example, by trying to bound how far we let the system stray from the initially provided blocked strategy, and how much fuzziness it is allowed to infer without an external check.
Yes, that does seem to be a risk. I would think that applying schelling fences to reinforce current values reduces the amount of expected drift in the future, and I’m unclear whether you are claiming that using Schelling fences will do the opposite, or claiming that they are imperfect.
I’d also like to better understand what specifically you think is making the error of making it difficult to re-align with current values, rather than reducing the degree of drift, and how it could be handled differently.
That’s a very good point, I was definitely unclear.
I think that the critical difference is that in epistemically health communities, when such a failure is pointed out, some effort is spent in identifying and fixing the problem, instead of pointedly ignoring it despite efforts to solve the problem, or spending time actively defending the inadequate status quo from even pareto-improving changes.