# Distance Functions are Hard

*[Epistemic status: Describes a failed research approach I had a while ago, and my only purpose here is to warn people off from that way of thinking. Every now and then I see someone working on an AIS subproblem say “if only we had a distance function for things in domain X”, and my intuition is that they are probably doing a wrong-way reduction. But I only mean this as a soft guideline, and I’m only somewhat confident in my current thinking on this.]*

~~~

Terminology: We use the terms *distance* or *distance function* to denote any function that intuitively tells us how “dissimilar” any two members of a set X are (regardless of the whether d is a metric).

## Counterfactual Worlds

Consider the counterfactual “If Lincoln were not assassinated, he would not have been impeached”. If we would like to say this has a truth value, we need to imagine what such a counterfactual world would have looked like: was it because Lincoln (somehow) survived his wounds, John Wilkes Booth (somehow) missed, that the plot was (somehow) discovered the day before, etc. Somehow, we must pick out the world that is in some sense “closest” to our actual world, but it seems very difficult to compare any two such worlds in a principled way.

To formalize Functional Decision Theory (FDT), we likely need to have a better understanding of counterfactuals, although even in restricted mathematical contexts, we don’t have a satisfactory understanding of why “If 0 = 1...” simply returns incoherence, yet “If the Modularity Theorem were false...” seemingly conjures up a possible world that we feel we can reason about.

(Also, in terms of corrigibility, we are often interested in formalizing the notion of “low-impact” agents, and the naive idea one often has is to define a distance metric on counterfactual world-states, as in p. 5 of Concrete Problems in AI Safety).

## Algorithmic Similarity

In the FDT framework, we do not view ourselves as a solitary agent, but as a *function *(or algorithm) that can be copied, modified, and read, and we wish to maximize the utility achieved by our algorithm. Minor details of our implementation that don’t affect our behavior (such as whether we are written in Java or Python) should not be decision-relevant, and if some algorithm does the same thing as us “most” of the time, then we would probably (e.g.) want to cooperate with it in a Prisoner’s Dilemma. Defining what it means for two algorithms to be similar remains an outstanding open problem.

At MSFP 2018, a small group (4-5) of us tried tackling this for a couple hours, had a few ideas that “felt” promising, but gradually realized that none of these made any sense, until ultimately we gave up with the feeling that we hadn’t made any intellectual advances. I only say this to give outside-view evidence of intractability, but it’s difficult for me to concisely communicate *why* its hard (I could say “try it yourself for an hour and you’ll see”, but part of my point is that hour is better spent). For those who insist on *inside-view* evidence, here’s an outline of one of the ideas we had and why it turned out to be unworkable:

We attempted to partition algorithm-space into equivalence classes that represent “conceptual similarity”, which should not be harder than defining a distance function on the space. By the Curry-Howard correspondence, we can rephrase this as asking when two proofs are similar (this felt easier for us to think about, but that’s entirely subjective). Suppose we have some proof A of size n, and we want to find proofs that “don’t use any fundamentally different ideas”. The obvious approach is to think of which proofs we can get to with minor edits. If we make some edit of size ϵ⋅n for some small ϵ and the result is still a valid proof, it should be more or less the same. If we take the closure under minor edits that preserve validity, it would seem superficially plausible that this would result in proofs that are similar. However, suppose we discover a one-line proof B that’s totally different from A: then we can append it to A as a minor edit, then gradually delete A with minor edits, until we have a drastically different proof (among other complications).

## Adversarial Examples

Given some data point x correctly classified by an ML model, a new point x′:=x+ϵ is an *adversarial example* if it is now misclassified, despite only differing from x by a tiny amount ϵ (i.e. making relatively small RGB changes to a few pixels). For *every* state-of-the-art image classifier tested, if you give me:

*Any*image classified correctly by that model*Any*target class you would like to have the model misclassify the image as

Then one can usually find some small perturbation of that image that the model believes is in the target class with high probability.

In the classic example we can have GoogLeNet classify a panda as a gibbon with 99% confidence. Moreover, these have been found to generalize very well across different models, even with very different architectures. Last year, a paper came out taking this further, by obtaining adversarial examples with the best cross-generalization, and giving these to humans who had only a few seconds to classify the image. Interestingly, the humans were “fooled” in the sense that their snap judgments—those formed by their pure visual system—differed from how they classified the images when given more time for reflection. In terms of robustness to these examples, it seems, our perceptual system by itself is not qualitatively better than today’s classifiers, but our lens can see its own flaws.

The paper was popularized in various places under a bolder headline, namely that there now existed full-blown adversarial examples *for humans* (reflection or not). This was showcased with a picture from a different part of the paper showing an image of a (somewhat dog-like) cat being given a tiny amount of noise, and subsequently looking like a dog to a human with any amount of visual processing and top-down feedback. This sparked controversy, with many pointing out that a small change (in RGB values) to some visual concept does not necessarily correspond to a small change in concept-space. The paper itself punted on this:

it is philosophically difficult to define the real object class for an image that is not a picture of a real object. In this work, we assume that an adversarial image is misclassified if the output label differs from the human-provided label of the clean image that was used as the starting point for the adversarial image. We make small adversarial perturbations and we assume that these small perturbations are insufficient to change the true class.

And in response to comments, co-author Ian Goodfellow acknowledged on Twitter:

While everyone else was scrambling to finish running experiments for ICML, my co-authors and I were having intense debates about philosophy and semantics and how to write the paper. Some of our open office colleagues were entertained by how surreal this sounded.

Making models robust against adversarial examples remains an outstanding and difficult topic with a considerable paper trail. The problem of merely *verifying* that a given model has no local adversarial examples (e.g. within a few RGB values of a given data point) has been the subject of some interesting formal verification work in the past couple years. But to even do this verification work, one needs a formal specification of what an adversarial example is, which in turn requires a formal specification of what a “small change” between (e.g.) images is, that somehow captures something about *conceptual* distance. It seems to me that even this smaller problem will be hard to solve in a philosophically satisfying way because of the inherent subjectivity/fuzziness in defining “distance in concept-space” or anything that even comes close.

## Distance Functions are Hard: The Evidence

What we are asking for, in all these instances, is some distance function precise enough to be mathematizable in some form, but robust enough to include many very fuzzy desiderata we have in mind. It seems natural to ask what distance functions of this form have been successfully developed before. The Encyclopedia of Distances comes out to over 700 pages, split roughly in half between those distances used in pure math (especially, as one would expect, topology, geometry, and functional analysis), and those used in applied math, computing disciplines, and the natural sciences.

Of the distance functions listed in the latter half, most were simply “the obvious thing one would do” given the preexisting mathematical structure around the topic in question (e.g. Levenshtein distance on strings). Others were less obvious, but usually because they used nontrivial mathematical machinery to answer specific mathematical questions, not to actually shed light on fuzzy philosophical questions one would have about it.

Getting to the social science section, where no existing mathematical formalism existed on most of the topics in the first place, virtually none of the distances particularly helped to remedy this fuzziness by themselves. Though I do not claim to have spent that much time flipping through this tome, never did I see a distance notion that struck me as a profound non-mathematical insight, or that even gestured at an “art of coming up with distance functions”.

## Conclusions

I conclude, with medium confidence, that each of the questions posed in the first 3 sections will be particularly hard to answer in a satisfying way, and if they are, then probably this won’t be by thinking about distance functions directly.

As a general heuristic, I feel like if you’ve reduced a philosophical problem to “defining the appropriate distance function”, then it’s worth pausing to consider if you’ve made a wrong-way reduction. Chances are, the distance function you want is inherently value-laden, and so the problem of defining it inherits the difficulty of the value alignment problem itself.

I also think this heuristic is especially salient if you’re trying to capture something like “conceptual similarity/distance”: if you could do this, then you’d have an objective map/taxonomy of (a large fraction of) concept-space.

Learning a distance function between pictures of human faces has been used successfully to train deep learning based face recognition systems.

My takeaway from your examples is not that “distance functions are hard” so much as “hardcoding is brittle”. The general approach of “define a distance function and train a model based on it” has been pretty successful in machine learning.

Yes, perhaps I should’ve been more clear. Learning certain distance functions

isa practical solution to some things, so maybe the phrase “distance functions are hard” is too simplistic. What I meant to say is more likeThis is mostly applicable to Agent Foundations-like research, where we are trying to give a formal model of (some aspect of) how agents work. Sometimes, we can reduce our problem to defining the appropriate distance function, and it can feel like we’ve made some progress, but we haven’t actually gotten anywhere (the first two examples in the post are like this).

The 3rd example, where we are trying to formally verify an ML model against adversarial examples, is a bit different now that I think of it. Here we apparently need transparent, formally-specified distance function if we have any hope of absolutely proving the absence of adversarial examples. And in formal verification, the specification problem often is just philosophically hard like this. So I suppose this example is less insightful, except insofar as it lends extra intuitions for the other class of examples.

Well, a classifier that is 100% accurate would also do the job ;) (I’m not sure a 100% accurate classifier is feasible per se, but a classifier which can be made arbitrarily accurate given enough data/compute/life-long learning experience seems potentially feasible.)

Also, small perturbations aren’t necessarily the only way to construct adversarial examples. Suppose I want to attack a model M1, which I have access to, and I also have a more accurate model M2. Then I could execute an automated search for cases where M1 and M2 disagree. (Maybe I use gradient descent on the input space, maximizing an objective function corresponding to the level of disagreement between M1 and M2.) Then I hire people on Mechanical Turk to look through the disagreements and flag the ones where M1 is wrong. (Since M2 is more accurate, M1 will “usually” be wrong.)

This is actually one way to look at what’s going on with traditional small perturbation adversarial examples. M1 is a deep learning model and M2 is a 1-nearest-neighbor model—not very good in general, but quite accurate in the immediate region of data points with known labels. The problem is that deep learning models don’t have a very strong inductive bias towards mapping nearby inputs to nearby outputs (sometimes called “Lipschitzness”). L2 Regularization actually makes deep learning models more Lipschitz because smaller coefficients=smaller eigenvalues for weight matrices=less capacity to stretch nearby inputs away from each other in output space. I think maybe that’s part of why it works.

Hoping to expand the previous two paragraphs into a paper with Matthew Barnett before too long—if anyone wants to help us get it published, please send me a PM (neither of us has ever published a paper before).

I’m not convinced conceptual distance metrics must be value-laden. Represent each utility function by an AGI. Almost all of them should be able to agree on a metric such that each could adopt that metric in its thinking losing only negligible value. The same could not be said for agreeing on a utility function. (The same could be said for agreeing on a utility-parametrized AGI design.)

I think it’s that any basis set I define in a super high dimensional space could be said to be value laden, though it might be tacit and I have little idea what it is. If I care about ‘causal structure’ or something that’s still relative to the sorts of affordances that are relevant to me in the space?

Is this the same value payload that makes activists fight over language to make human biases work for their side? I don’t think this problem translates to AI: If the AGIs find that some metric induces some bias, each can compensate for it.

This implies a measure over utility functions. Its propably true under the solomonoff measure, but abstract though they are, this is values.

Its sort of true that the correct distance function depends on your values. A better way to say it is that different distance functions are appropriate for different tasks, and they will be “better” or “worse” depending on how much you care about those tasks. But I dont think asking for the “best” metric in this sense is helpful, because you dont have to use the same metric for all tasks involving a certain space. Sometimes you want air distance, sometimes travel times. Maybe you have to decide because youre computationally limited, but its not philosophically relevant.

With that in mind, my attempts at two of your examples. The adversarial examples first, because its the clearest question: I think the problem is that you are thinking too abstractly. I dont think there is a meaningful sense of “concept similarity” thats purely logical, i.e. independent of the actual world. The intuitive sense of similarity youre trying to use here is propably something like this: Over the space of images, you want the propability measure of encountering them. Then you get a metric where two subsets of imagespace which are isomorphic under the metric always have the same measure. That is your similarity measure.

Counterfactuals usually involve some sort of propability distribution, which is then “updated” on the condition of the counterfactual being true, and then the consequent is judged under that distribution. What the initial distribution is depends on what youre doing. In the case of Lincoln, its propably reasonable expectations of the future from before the assasination. But for something like “What if conservation of energy wasnt true”, its propably our current distribution over physics theories. Basically, whats the most likely alternative. The mathematical example is a bit different. There lot of ways to conclude a contradiction from 0=1, but its very hard to deduce a contradiction from denying the modularity theorem. If you were to just randomly perform logical inferences from “the modularity theorem is wrong”, then there is a subset of propositions which doesnt include any claim that is a dircet negation of another in it, that your deductions are unlikely to lead you out of (it matters of course, in what way it is random, but it evidently works for “human mathmatician who hasnt seen the proof yet”).

“If Lincoln were not assassinated, he would not have been impeached” is a probabilistic statement that is not at all about THE Lincoln. It’s a reference class analysis of leaders who did not succumb to premature death and had the leadership, economy etc. metrics similar to the one Lincoln. There is no “counterfactual” there in any interesting sense. It is not about the minute details of avoiding the assassination. If you state the apparent counterfactual more precisely, it would be something like

So, there is no issue with “If 0=1...” here, unlike with the other one, “If the modularity theorem were false”, which implies some changes in the very basics of mathematics, though one can also argue for the reference class approach there.

I feel like this is practically a frequentist/bayesian disagreement :D It seems “obvious” to me that “If Lincoln were not assassinated, he would not have been impeached” can be about the real Lincoln as much as me saying “Lincoln had a beard” is, because both are statements made using my model of the world about this thing I label Lincoln. No reference class necessary.

I am not sure if labels help here. I’m simply pointing out that logical counterfactuals applied to the “real Lincoln” lead to the sort of issues MIRI is facing right now when trying to make progress in the theoretical AI alignment issues. The reference class approach removes the difficulties, but then it is hard to apply it to the “mathematical facts”, like what is the probability of 100...0th digit of pi being 0 or, to quote the OP “If the Modularity Theorem were false...” and the prevailing MIRI philosophy does not allow treating logical uncertainty as environmental.

Sure. In the case of Lincoln, I would say the problem is solved by models even as clean as Pearl-ian causal networks. But in math, there’s no

principledcausal network model of theorems to support counterfactual reasoning as causal calculus.Of course, I more or less just think that we have an unprincipled causality-like view of math that we take when we think about mathematical counterfactuals, but it’s not clear that this is any help to MIRI understanding proof-based AI.

I don’t think I am following your argument. I am not sure what Pearl’s causal networks are and how they help here, so maybe I need to read up on it.