adamShimi comments on The Natural Abstraction Hypothesis: Implications and Evidence

adamShimi 13 Jan 2022 19:47 UTC
LW: 4 AF: 4
0
AF
Thanks for the post! Two general points I want to make before going into more general comments:
- I liked the section on concepts difference across, and hadn’t thought much about it before, so thanks!
- One big aspect of the natural abstraction hypothesis that you missed IMO is “how do you draw the boundaries around abstractions?” — more formally how do you draw the markov blanket. This to me is the most important question to answer for settling the NAH, and John’s recent work on sequences of markov blanket is IMO him trying to settle this.
In general, we should expect that systems will employ natural abstractions in order to make good predictions, because they allow you to make good predictions without needing to keep track of a huge number of low-level variables .
Don’t you mean “abstractions” instead of “natural abstractions”?
John Maxwell proposes differentiating between the unique abstraction hypothesis (there is one definitive natural abstraction which we expect humans and AIs to converge on), and the useful abstraction hypothesis (there is a finite space of natural abstractions which humans use for making predictions and inference, and in general this space is small enough that an AGI will have a good chance of finding abstractions that correspond to ours, if we have it “cast a wide enough net”).
Note that both can be reconciled by saying that the abstraction in John’s sense (the high-level summary statistics that capture everything that isn’t whipped away by noise) is unique up to isomorphism (because it’s a summary statistics), but different predictors will learn different parts of this abstraction depending on the information that they don’t care about (things they don’t have sensor for, for example). Hence you have a unique natural abstraction, which generates a constrained space of subabstractions which are the ones actually learned by real world predictors.
This might be relevant for your follow-up arguments.
To put it another way — the complexity of your abstractions depends on the depth of your prior knowledge. The NAH only says that AIs will develop abstractions similar to humans when they have similar priors, which may not always be the case.
Or another interpretation, following my model above, is that with more knowledge and more data and more time taken, you get closer and closer to the natural abstraction, but you don’t generally start up there. Although that would mean that the refinement of abstractions towards the natural abstraction often breaks the abstraction, which sounds fitting with the course of science and modelling in general.
Different strengths of the NAH can be thought of as corresponding to different behaviours of this graph. If there is a bump at all, this would suggest some truth to the NAH, because it means (up to a certain level) models can become more powerful as their concepts more closely resemble human ones. A very strong form of the NAH would suggest this graph doesn’t tail off at all in some cases $^{2}$ , because human abstractions are the most natural, and anything more complicated won’t provide much improvement. This seems quite unlikely—especially for tasks where humans have poor prior knowledge—but the tailing off problem could be addressed by using an amplified overseer, and incentivising interpretability during the training process. The extent to which NAH is true has implications for how easy this process is (for more on this point, see next section).
Following my model above, one interpretation is that the bump means that when reaching human-level of competence, the most powerful abstractions available as approximations of the natural abstractions are the ones humans are using. Which is not completely ridiculous, if you expect that AIs for “human-level competence at human tasks” would have similar knowledge, inputs and data than humans.
The later fall towards alien concepts might come from the approximation shifting to very different abstractions as the best approximation, just like the shift between classical physics and quantum mechanics.
However, it might make deceptive alignment more likely. There are arguments for why deceptive models might be favoured in training processes like SGD—in particular, learning and modelling the training process (and thereby becoming deceptive) may be a more natural modification than internal or corrigible alignment. So if $X$ is a natural abstraction, this would make it easier to point to, and (since having a good model of the base objective is a sufficient condition for deception) the probability of deception is subsequently higher.
Here is my understanding of your argument: because X is a natural abstraction and NAH is true, models will end up actually understanding X, which is a condition for the apparition of mesa-optimization. Thus NAH might make deception more likely by making one of its necessary condition more likely. Is that a good description of what you propose?
I would assume that NAH actually makes deception less likely, because of the reasons you gave above about the better proxy, which might entail that the mesa-objective is actually the base-objective.
This abstractions-framing of instrumental convergence implies that getting better understanding of which abstractions are learned by which agents in which environments might help us better understand how an agent with instrumental goals might behave (since we might expect an agent will try to gain control over some variable $X$ only to the extent that it is controlling the features described by the abstraction $f (X)$ which it has learned, which summarizes the information about the current state relevant to the far-future action space).
Here too I feel that the model of “any abstraction is an approximation of the natural abstraction” might be relevant. Especially because if it’s correct, then in addition to knowing the natural abstraction, you have to understand which part of it could the model approximate, and which parts of that approximations are relevant for its goal.
If NAH is strongly true, then maybe incentivising interpretability during training is quite easy, just akin to a “nudge in the right direction”. If NAH is not true, this could make incentivising interpretability really hard, and applying pressure away from natural abstractions and towards human-understandable ones will result in models Goodharting interpretability metrics—where a model is be emphasised to trick the metrics into thinking it is forming human-interpretable concepts when it actually isn’t.
The less true NAH is, the harder this problem becomes. For instance, maybe human-interpretability and “naturalness” of abstractions are actually negatively correlated for some highly cognitively-demanding tasks, in which case the trade-off between these two will mean that our model will be pushed further towards Goodharting.
The model I keep describing could actually lead to both hard to interpret and yet potentially interpretable abstractions. Like, if someone from 200 years ago had to be taught Quantum Mechanics. It might be really hard, depends on a lot of mental moves that we don’t know how to convey, but that would be the sort of problem equivalent to interpreting better approximations than ours of the natural abstractions. It sounds at least possible to me.
A danger of this process is that the supervised learner would model the data-collection process instead of using the unsupervised model—this could lead to misalignment. But suppose we supplied data about human values which is noisy enough to make sure the supervised learner never switched to directly modelling the data-collection process, while still being a good enough proxy for human values that the supervised learner will actually use the unsupervised model in the first place.
Note that this amount to solving ELK, which probably takes more than “adequately noisy data”.
We can argue that human values are properties of the abstract object “humans”, in an analogous way to branching patterns being properties of the abstract object “trees”. However there are complications to this analogy: for instance, human values seem especially hard to infer from behaviour without using the “inside view” that only humans have.
I mean, humans learning approximations of the natural abstractions through evolutionary processes makes a lot of sense to me, as it’s just evolution training predictors, right?
For another, even if we accept that “which abstractions are natural?” is a function of factors like culture, this argues against the “unique abstraction hypothesis” but not the “useful abstraction hypothesis”. We could argue that navigators throughout history were choosing from a discrete set of abstractions, with their choices determined by factors like available tools, objectives, knowledge, or cultural beliefs, but the set of abstractions itself being a function of the environment, not the navigators.
Hence with the model I’m discussing, the culture determined their choice of subabstractions but they were all approximating the same natural abstractions.