Yep thanks! I would imagine if progress goes well on describing modularity in an information-theoretic sense, this might help with (2), because information entanglement between a single module and the output would be a good measure of “relevance” in some sense
CallumMcDougall
The Natural Abstraction Hypothesis: Implications and Evidence
Theories of Modularity in the Biological Literature
Project Intro: Selection Theorems for Modularity
How I use Anki: expanding the scope of SRS
Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once.
Since I’ve not been involved in this discussion for as long I’ll probably miss some subtlety here, but my immediate reaction is that “easier” might depend on your perspective—if you’re explicitly enforcing modularity in the architecture (e.g. see the “Direct selection for modularity” section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they’re being trained on factorisable tasks is kinda the whole question. Since sections of biological networks do sometimes evolve completely in isolation from each other (because they’re literally physically connected) then it does seem plausible that something like this is happening, but it doesn’t really move us closer to a gears-level model for what’s causing modularity to be selected for in the first place. I imagine I’m misunderstanding something here though.
So if module three is doing great, but module five is doing abysmally, and the answer depends on both being right, your loss is really bad. So the optimiser is going to happily modify three away from the optimum it doesn’t know it’s in.
Maybe one way to get around it is that the loss function might not just be a function of the final outputs of each subnetwork combined, it might also reward bits of subcomputation? e.g. to take a deep learning example which we’ve discussed, suppose you were training a CNN to calculate the sum of 2 MNIST digits, and you were hoping the CNN would develop a modular representation of these two digits plus an “adding function”—maybe the network could also be rewarded for the subtask of recognising the individual digits? It seems somewhat plausible to me that this kind of thing happens in biology, otherwise there would be too many evolutionary hurdles to jump before you get a minimum viable product. As an example, the eye is a highly complex and modular structure, but the very first eyes were basically just photoreceptors that detected areas of bright light (making it easier to navigate in the water, and hide from predators I think). So at first the loss function wasn’t so picky as to only tolerate perfect image reconstructions of the organism’s surroundings; instead it simply graded good brightness-detection, which I think could today be regarded as one of the “factorised tasks” of vision (although I’m not sure about this).
It seems like an environment that changes might cause modularity. Though, aside from trying to make something modular, it seem like it could potentially fall out of stuff like ‘we want something that’s easier to train’.
This seems really interesting in the biological context, and not something we discussed much in the other post. For instance, if you had two organisms, one modular and one not modular, even if there’s currently no selection advantage for the modular one, it might just be trained much faster and hence be more likely to hit on a good solution before the nonmodular network (i.e. just because it’s searching over parameter space at a larger rate).
Thanks! Yeah so there is one add-on I use for tag management. It’s called Search and Replace Tags, basically you can select a bunch of cards in the browser and Ctrl+Alt+Shift+T to change them. When you press that, you get to choose any tag that’s possessed by at least one of the cards you’re selecting, and replace it with any other tag.
There are also built-in Anki features to add, delete, and clear unused tags (to find those, right-click on selected cards in the browser, and hover over “Notes”). I didn’t realise those existed for a long time, was pretty annoyed when I found them! XD
Hope this helps!
Skilling-up in ML Engineering for Alignment: request for comments
Sorry for forgetting to reply to this at first!
There are 2 different ways I create code cards, one is in Jupyter notebooks and one is the “normal way”, i.e. by using the Anki editor. I’ve just created a GitHub describing the second one:
https://github.com/callummcdougall/anki_templates
Please let me know if there’s anything unclear here!
Oh wow, I wish I’d come across that plugin previously, that’s awesome! Thanks a bunch (-:
Thanks for the post! I just wanted to clarify what concept you’re pointing to with use of the word “deception”.
From Evan’s definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He’s clarified in other comments that he sees “deception” as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This doesn’t include situations like the first one, where the reward function is underspecified to produce behaviour we want (although it does produce behaviour that looks like it’s what we want, unless we peer under the hood).
To put it another way, it seems like the way deception is used here refers to the general situation where “AI has learnt to do something that humans will misunderstand / misinterpret, regardless of whether the AI actually has an internal representation of the base objective it’s being trained on and the humans doing the training.”
In this situation, I don’t really know what the benefit is of putting these two scenarios into the same class, because they seem pretty different. My intuitions about this might be wrong though. Also I guess this is getting into the inner/outer alignment distinction which opens up quite a large can of worms!
Source: original, but motivated by trying to ground WFLL1-type scenarios in what we already experience in the modern world, so heavily based on this. Also the original idea came from reading Neel Nanda’s “Bird’s Eye View of AI Alignment—Threat Models”
Intended audience: mainly policymakers
A common problem in the modern world is when incentives don’t match up with value being produced for society. For instance, corporations have an incentive to profit-maximise, which can lead to producing value for consumers, but can also involve less ethical strategies such as underpaying workers, regulatory capture, or tax avoidance. Laws & regulations are designed to keep behaviour like this in check, and this works fairly well most of the time. Some reasons for this are: (1) people have limited time/intelligence/resources to find and exploit loopholes in the law, (2) people usually follow societal and moral norms even if they’re not explicitly represented in law, and (3) the pace of social and technological change has historically been slow enough for policymakers to adapt laws & regulations to new circumstances. However, advancements in artificial intelligence might destabilise this balance. To return to the previous example, an AI tasked with maximising profit might be able to find loopholes in laws that humans would miss, they would have no particular reason to pay attention to societal norms, and they might be improving and becoming integrated with society at a rate which makes it difficult for policy to keep pace. The more entrenched AI becomes in our society, the worse these problems will get.
Okay I see, yep that makes sense to me (-:
Yeah I think this is Evan’s view. This is from his research agenda (I’m guessing you might have already seen this given your comment but I’ll add it here for reference anyway in case others are interested)
I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.
And I think his view on deception through inner optimisation pressure is that this is something we’ll basically be powerless to deal with once it happens, so the only way to make sure it doesn’t happen it to chart a safe path through model space which never enters the deceptive region in the first place.
Yeah I think the key point here more generally (I might be getting this wrong) is that C represents some partial state of knowledge about X, i.e. macro rather than micro-state knowledge. In other words it’s a (non-bijective) function of X. That’s why (b) is true, and the equation holds.
Ten experiments in modularity, which we’d like you to run!
What Is The True Name of Modularity?
I guess another point here is that we won’t know how different (for example) our results when sampling from the training distribution will be from our results if we just run the network on random noise and then intervene on neurons; this would be an interesting thing to experimentally test. If they’re very similar, this neatly sidesteps the problem of deciding which one is more “natural”, and if they’re very different then that’s also interesting
Thanks for the comment!
To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can differ greatly in things like (1) metaethics about why liberty matters, (2) the extent to which liberty should be traded off with other values, if indeed it can be traded off at all. People might disagree about interpretations of these concepts (especially different cultures), but in a world where these weren’t natural abstractions, we might expect disagreement in the first place to be extremely hard because the discussants aren’t even operating on the same wavelength, i.e. they don’t really have a set of shared concepts to structure their disagreements around.
Yeah, that’s a good point. I think partly that’s because my thinking about the NAH basically starts with “the inside view seems to support it, in the sense that the abstractions that I use seem natural to me”, and so from there I start thinking about whether this is a situation in which the inside view should be trusted, which leads to considering the validity of arguments against it (i.e. “am I just anthropomorphising?”).
However, to give a few specific reasons I think it seems plausible that don’t just rely on the inside view:
Humans were partly selected for their ability to act in the world to improve their situations. Since abstractions are all about finding good high-level models that describe things you might care about and how they interact with the rest of the world, it seems like there should have been a competitive pressure for humans to find good abstractions. This argument doesn’t feel very contingent on the specifics human cognition or what our simplicity priors are; rather the abstractions should be a function of the environment (hence convergence to the same abstractions by other cognitive systems which are also under competition, e.g. in the form of computational efficiency requirements, seems intuitive)
There’s lots of empirical evidence that seems to support it, at least at a weak level (e.g. CLIP as discussed in my post, or GPT-3 as mentioned by Rohin in his summary for the newsletter)
Returning to the clarification you made about inputs to human values being the natural abstraction rather than the actual values, it seems like the fact that different cultures can have a shared basis for disagreement might support some form of the NAH rather than arguing against it? I guess that point has a few caveats though, e.g. (1) all cultures have been shaped significantly by global factors like European imperialism, and (2) humans are all very close together in mind design space so we’d expect something like this anyway, natural abstraction or not