For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure (even if you do step 1 well enough to have a thing to look for).
I’m also fairly pessimistic about step 1, but would be very excited to know what preliminary work here looks like.
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure
If, as a system comes to ever-better approximate a powerful agent, there’s actual convergence towards this type of hierarchical structure, I expect you’d see something clearly distinct from noise even in the current LLMs. Intuition pump/proof-of-concept: the diagram comparing a theoretical prediction with empirical observations here.
Indeed, I think it’s one of the main promises of “top-down” agent foundations research. The applicability and power of the correct theory of powerful agents’ internal structures, whatever that theory may be, would scale with the capabilities of the AI system under study. It’ll apply to LLMs inasmuch as LLMs are actually on trajectory to be a threat, and if we jump paradigms to something more powerful, the theory would start working better (as opposed to bottom-up MechInterp techniques, which would start working worse).
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure (even if you do step 1 well enough to have a thing to look for).
I’m also fairly pessimistic about step 1, but would be very excited to know what preliminary work here looks like.
If, as a system comes to ever-better approximate a powerful agent, there’s actual convergence towards this type of hierarchical structure, I expect you’d see something clearly distinct from noise even in the current LLMs. Intuition pump/proof-of-concept: the diagram comparing a theoretical prediction with empirical observations here.
Indeed, I think it’s one of the main promises of “top-down” agent foundations research. The applicability and power of the correct theory of powerful agents’ internal structures, whatever that theory may be, would scale with the capabilities of the AI system under study. It’ll apply to LLMs inasmuch as LLMs are actually on trajectory to be a threat, and if we jump paradigms to something more powerful, the theory would start working better (as opposed to bottom-up MechInterp techniques, which would start working worse).