Has anyone explored using neural clusters found by mechanistic interpretability as part of a goal system?
So that you would look for clusters for certain things e.g. happiness or autonomy and have that neural clusters in the goal system. If the system learned over time it could refine that concept.
This was inspired by how human goals seem to have concepts that change over time in them.
I’m thinking about secret projects that might be info hazardous to each other but still might need information from each other so the connections are by necessity tenuous and transitory. Is that a topic that has been explored before?