Thank you for the reply. You might be interested in neural Darwinism if you’ve never heard of it, the comment you linked in the edit made me think of it: https://en.wikipedia.org/wiki/Neural_Darwinism.
I don’t have a good story for how reuse of subcomponents leads to cooperation across agents, but my gut says to look at reuse rather than voting or markets. Could be totally baseless though.
entropizer
This is really interesting. My immediate gut reaction is that this wrongly treats different subagents as completely separate entities, when really they’re all overlapped chimeras. For example, Subagent A might be composed of {C,D,E} while Subagent B is composed of {C,E,F}. Reuse of subcomponents seems like a more natural path to coordination than internal voting systems or prediction markets to me.
What distinguishes subagents from shards? Are both getting at the same idea? https://www.alignmentforum.org/w/shard-theory
Two additional senses in which a “right to be wrong” might be justified: in differing risk preferences, individually, or the usefulness of holdout populations, societally.
I don’t think people should try to emulate heliocentrists because I think that acting like they did would generally lead people to failure, not success. The lesson I take from this is that stubborn holdout populations who refuse to accept the obvious are important to the health of science as an ecosystem of ideas. But I don’t think stubbornness should be seen as a general purpose virtue. I think Aristotle and co just experienced epistemic luck.
I agree with much of this, but I suspect people aren’t only sticking with activation-based interpretability because the bad dimensionality of weight-based interpretability is intimidating. Rather, I feel like we have to be thinking about activation-based interpretability if we want an analysis of the model’s behavior to contain semantics that are safety-relevant.
For example, I can know nothing about how safe a classifier that distinguishes A from B is, regardless of how much I know about its weights, unless I know what A and B are. There might be identical sets of weights that are safe if acting on one problem distribution but unsafe if acting on another. We’ve got to characterize the problem distribution to assess safety.
By the good regulator theorem, it is true that if a model works well, it will carry some kind of weak copy of the training dataset inside its weights. In this sense, it might be reasonable to think that models might “inherit” semantic features of the datasets they’re operating on, and maybe this provides some way to bridge between weight based interpretability and human values regarding the outputs of model. However, lots and lots of information can be lost in the process because a good regulator only cares about reflecting the data insofar as doing so improves its performance on the task it’s conducting. Anything that isn’t a sufficient statistic for doing that task given the particular dataset it has on hand can be lost.
I really feel like omitted variables make any form of interpretability, weight based or not, a doomed strategy for safety, at least interpretability in a vacuum without supplementation from other strategies. How are we supposed to detect the absence of a concept or value that the model ought to possess just from its weights or activations? Interpretability can only tell us what a model’s doing, but nothing about what it’s not. Analyzing the safety of the outputs doesn’t scale well for models that are smarter than us, but I think it’s a necessary requirement regardless.
(We might try to characterize omitted variables as undesirable invariances in the model’s behavior, changes to the inputs that should change its activations in certain ways but don’t. However, exhaustively describing all the undesired invariances or knowing which changes to make to the inputs or the corresponding activations we want to see associated with them all seems prohibitive, and this risks running into problems with the principle of indifference, so I don’t think it can work.)