A successful version of this theory will retrodict phenomena like the Waluigi effect, solve theoretical problems like the five-and-ten problem, and make new high-level predictions about AI behavior.
Identify subagents (and subsubagents, and so on) within neural networks by searching their weights and activations for the patterns of interactions between subagents that this theory predicts.
A helpful analogy is how Burns et al. (2022) search for beliefs inside neural networks based on the patterns that probability theory predicts. However, I’m not wedded to any particular search methodology.
Characterize the behaviors associated with each subagent to build up “maps” of the motivational systems of the most advanced AI systems.
This would ideally give you explanations of AI behavior that scales in quality based on how much effort you put in. E.g. you might be able to predict 80% of the variance in an AI’s choices by looking at which highest-level subagents are activated, then 80% of the remaining variance by looking at which subsubagents are activated, and so on.
Monitor patterns of activations of different subagents to do lie detection, anomaly detection, and other useful things.
This wouldn’t be fully reliable—e.g. there’d still be some possible failures where low-level subagents activate in ways that, when combined, leads to behavior that’s very surprising given the activations of high-level subagents. (ARC’s research seems to be aimed at these worst-case examples.) However, I expect it would be hard even for AIs with significantly superhuman intelligence to deliberately contort their thinking in this way. And regardless, in order to solve worst-case examples it seems productive to try to solve the average-case examples first.
I’m focusing on step 1 right now. Note that my pursuit of it is overdetermined—I’m excited enough about finding a scale-free theory of intelligent agency that I’d still be working on it even if I didn’t think steps 2-4 would work, because I have a strong heuristic that pursuing fundamental knowledge is good. Trying to backchain from an ambitious goal to reasons why a fundamental scientific advance would be useful for achieving that goal feels pretty silly from my perspective. But since people keep asking me why step 1 would help with alignment, I decided to write this up as a central example.
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure (even if you do step 1 well enough to have a thing to look for).
I’m also fairly pessimistic about step 1, but would be very excited to know what preliminary work here looks like.
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure
If, as a system comes to ever-better approximate a powerful agent, there’s actual convergence towards this type of hierarchical structure, I expect you’d see something clearly distinct from noise even in the current LLMs. Intuition pump/proof-of-concept: the diagram comparing a theoretical prediction with empirical observations here.
Indeed, I think it’s one of the main promises of “top-down” agent foundations research. The applicability and power of the correct theory of powerful agents’ internal structures, whatever that theory may be, would scale with the capabilities of the AI system under study. It’ll apply to LLMs inasmuch as LLMs are actually on trajectory to be a threat, and if we jump paradigms to something more powerful, the theory would start working better (as opposed to bottom-up MechInterp techniques, which would start working worse).
What about the 5-and-10 problem makes it particularly relevant/interesting here? What would a ‘solution’ entail?
How far are you planning to build empirical cases, model them, and generalise from below, versus trying to extend pure mathematical frameworks like geometric rationality? Or are there other major angles of attack you’re considering?
Consider the version of the 5-and-10 problem in which one subagent is assigned to calculate U | take 5, and another calculates U | take 10. The overall agent solves the 5-and-10 problem iff the subagents reason about each other in the “right ways”, or have the right type of relationship to each other. What that specifically means seems like the sort of question that a scale-free theory of intelligent agency might be able to answer.
I’m mostly trying to extend pure mathematical frameworks (particularly active inference and a cluster of ideas related to geometric rationality, including picoeconomics and ergodicity economics).
Here is the broad technical plan that I am pursuing with most of my time (with my AI governance agenda taking up most of my remaining time):
Mathematically characterize a scale-free theory of intelligent agency which describes intelligent agents in terms of interactions between their subagents.
A successful version of this theory will retrodict phenomena like the Waluigi effect, solve theoretical problems like the five-and-ten problem, and make new high-level predictions about AI behavior.
Identify subagents (and subsubagents, and so on) within neural networks by searching their weights and activations for the patterns of interactions between subagents that this theory predicts.
A helpful analogy is how Burns et al. (2022) search for beliefs inside neural networks based on the patterns that probability theory predicts. However, I’m not wedded to any particular search methodology.
Characterize the behaviors associated with each subagent to build up “maps” of the motivational systems of the most advanced AI systems.
This would ideally give you explanations of AI behavior that scales in quality based on how much effort you put in. E.g. you might be able to predict 80% of the variance in an AI’s choices by looking at which highest-level subagents are activated, then 80% of the remaining variance by looking at which subsubagents are activated, and so on.
Monitor patterns of activations of different subagents to do lie detection, anomaly detection, and other useful things.
This wouldn’t be fully reliable—e.g. there’d still be some possible failures where low-level subagents activate in ways that, when combined, leads to behavior that’s very surprising given the activations of high-level subagents. (ARC’s research seems to be aimed at these worst-case examples.) However, I expect it would be hard even for AIs with significantly superhuman intelligence to deliberately contort their thinking in this way. And regardless, in order to solve worst-case examples it seems productive to try to solve the average-case examples first.
I’m focusing on step 1 right now. Note that my pursuit of it is overdetermined—I’m excited enough about finding a scale-free theory of intelligent agency that I’d still be working on it even if I didn’t think steps 2-4 would work, because I have a strong heuristic that pursuing fundamental knowledge is good. Trying to backchain from an ambitious goal to reasons why a fundamental scientific advance would be useful for achieving that goal feels pretty silly from my perspective. But since people keep asking me why step 1 would help with alignment, I decided to write this up as a central example.
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure (even if you do step 1 well enough to have a thing to look for).
I’m also fairly pessimistic about step 1, but would be very excited to know what preliminary work here looks like.
If, as a system comes to ever-better approximate a powerful agent, there’s actual convergence towards this type of hierarchical structure, I expect you’d see something clearly distinct from noise even in the current LLMs. Intuition pump/proof-of-concept: the diagram comparing a theoretical prediction with empirical observations here.
Indeed, I think it’s one of the main promises of “top-down” agent foundations research. The applicability and power of the correct theory of powerful agents’ internal structures, whatever that theory may be, would scale with the capabilities of the AI system under study. It’ll apply to LLMs inasmuch as LLMs are actually on trajectory to be a threat, and if we jump paradigms to something more powerful, the theory would start working better (as opposed to bottom-up MechInterp techniques, which would start working worse).
Interesting! Two questions:
What about the 5-and-10 problem makes it particularly relevant/interesting here? What would a ‘solution’ entail?
How far are you planning to build empirical cases, model them, and generalise from below, versus trying to extend pure mathematical frameworks like geometric rationality? Or are there other major angles of attack you’re considering?
Consider the version of the 5-and-10 problem in which one subagent is assigned to calculate U | take 5, and another calculates U | take 10. The overall agent solves the 5-and-10 problem iff the subagents reason about each other in the “right ways”, or have the right type of relationship to each other. What that specifically means seems like the sort of question that a scale-free theory of intelligent agency might be able to answer.
I’m mostly trying to extend pure mathematical frameworks (particularly active inference and a cluster of ideas related to geometric rationality, including picoeconomics and ergodicity economics).