Second, this seems well-informed to me, but I can’t really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that’ll be clearer in a later post.
Mostly in a later post. Ultimately, shard theory makes claims about goal/value formation in agents. In particular, some shard-theory flavored claims are:
That agents will have multiple, contextually activated goals and values
That we can predict what goals will be activated by considering what historical reinforcement events pertain to a given situation (e.g. is the cheese near the top-right corner, or not?)
That the multiple goals are each themselves made out of small pieces/circuits called “subshards” which can be separately manipulated or activated or influenced (see e.g. channels 55 and 42 having different effects when intervened upon)
So—we looked for “shards”, and (I think) found them.
That it’s profitable to think of agents as having multiple contextual goals, instead of thinking of them as “optimizing for a fixed objective”
(I would not have tried this project or its interventions if not for shard theory, and found shard theory reasoning very helpful throughout the project, and have some sense of having cut to empirical truths more quickly because of that theory. But I haven’t yet done deep credit assignment on this question. I think a more careful credit assignment will come down to looking at my preregistered predictions and reasoning.)
That we can predict what goals agents will form by considering their reinforcement schedules,
And we should gain skill at this art, today, now, in current systems. It seems like a clear alignment win to be able to loosely predict what goals/generalization behavior will be produced by a training process.
There are probably more ties I haven’t thought of. But hopefully this gives a little context!
Mostly in a later post. Ultimately, shard theory makes claims about goal/value formation in agents. In particular, some shard-theory flavored claims are:
That agents will have multiple, contextually activated goals and values
That we can predict what goals will be activated by considering what historical reinforcement events pertain to a given situation (e.g. is the cheese near the top-right corner, or not?)
That the multiple goals are each themselves made out of small pieces/circuits called “subshards” which can be separately manipulated or activated or influenced (see e.g. channels 55 and 42 having different effects when intervened upon)
So—we looked for “shards”, and (I think) found them.
That it’s profitable to think of agents as having multiple contextual goals, instead of thinking of them as “optimizing for a fixed objective”
(I would not have tried this project or its interventions if not for shard theory, and found shard theory reasoning very helpful throughout the project, and have some sense of having cut to empirical truths more quickly because of that theory. But I haven’t yet done deep credit assignment on this question. I think a more careful credit assignment will come down to looking at my preregistered predictions and reasoning.)
That we can predict what goals agents will form by considering their reinforcement schedules,
And we should gain skill at this art, today, now, in current systems. It seems like a clear alignment win to be able to loosely predict what goals/generalization behavior will be produced by a training process.
There are probably more ties I haven’t thought of. But hopefully this gives a little context!