It seems to me that an “implementation” of something like Infra-Bayesianism which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding to encourage it to generalize conservatively. This intuition comes in two parts:
1: The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between. I’ll try to test this empirically at ARENA this May. The point is that I struggle to see how IB applies here, on the level of pure prediction, in practice. It’s possible that this is just a result of my ignorance or lack of creativity.
2: I’m pessimistic about learning results for MDPs or environments “without traps” having anything to do with building a safe LLM agent.
If IB is only used in this heuristic way, we might expect fewer of the mathematical results to transfer, and instead just port over some sort of pessimism about uncertainty. In fact, Michael Cohen’s work follows pretty much exactly this approach at times (I’ve read him mention IB about once, apparently as a source of intuition but not technical results).
None of this is really a criticism of IB; rather, I think it’s important to keep in mind when considering which aspects of IB or IB-like theories are most worth developing.
I don’t understand this comment. I usually don’t think of “building a safer LLM agent” as a viable route to aligned AI. My current best guess about how to create aligned AI is Physicalist Superimitation. We can imagine other approaches, e.g. Quantilized Debate, but I am less optimistic there. More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
As to heuristic implementations of infra-Bayesianism, this is something I don’t want to speculate about in public, it seems exfohazardous.
I usually don’t think of “building a safer LLM agent” as a viable route to aligned AI
I agree that building a safer LLM agent is an incredibly fraught path that probably doesn’t work. My comment is in the context of Abram’s first approach, developing safer AI tech that companies might (apparently voluntarily) switch to, and specifically the route of scaling up IB to compete with LLM agents. Note that Abram also seems to be discussing the AI 2027 report, which if taken seriously requires all of this to be done in about 2 years. Conditioning on this route, I suggest that most realistic paths look like what I described, but I am pretty pessimistic that this route will actually work. The reason is that I don’t see explicitly Bayesian glass-box methods competing with massive black-box models at tasks like natural language prediction any time soon. But who knows, perhaps with the “true” (IB?) theory of agency in hand much more is possible.
More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
I’m not sure it’s possible to “complete” the theory of agents, and I am particularly skeptical that we can do it any time soon. However, I think we agree locally / directionally, because it also seems to me that a more rigorous theory of agency is necessary for alignment.
As to heuristic implementations of infra-Bayesianism, this is something I don’t want to speculate about in public, it seems exfohazardous.
Fair enough, but in that case, it seems impossible for this conversation to meaningfully progress here.
I think that in 2 years we’re unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it’s more likely than not that we actually have >15 years.
As to “completing” the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we “completed” the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we “completed” the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.
I probably should have said “create” rather than “complete”.
The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between.
I don’t personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I’m not sure why you believe this. (No, I don’t find “planning ahead” results to be convincing—I feel this can still be purely epistemic in a relevant sense.)
Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?
Anyway, cutting more directly to the point:
I think you lack imagination when you say
[...] which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding [...]
I think there are neural architectures close to the current paradigm which don’t directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)
EDIT: I think that I miscommunicated a bit initially and suggest reading my response to Vanessa before this comment for necessary context.
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I’m not sure why you believe this. (No, I don’t find “planning ahead” results to be convincing—I feel this can still be purely epistemic in a relevant sense.)
I am fine with using the term mesa-induction. I think induction is a restricted type of optimization, but I suppose you associate the term mesa-optimizer with agency, and that is not my intended message.
I think there are neural architectures close to the current paradigm which don’t directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)
I don’t think the chain of thought is necessary, but routing through pure sequence prediction in some fashion seems important for the current paradigm (that is what I call scaffolding). I expect that it is possible in principle to avoid this and do straight model-based RL, but forcing that approach to quickly catch up with LLMs / foundation models seems very hard and not necessarily desirable. In fact by default this seems bad for transparency, but perhaps some IB-inspired architecture is more transparent.
It seems to me that an “implementation” of something like Infra-Bayesianism which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding to encourage it to generalize conservatively. This intuition comes in two parts:
1: The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between. I’ll try to test this empirically at ARENA this May. The point is that I struggle to see how IB applies here, on the level of pure prediction, in practice. It’s possible that this is just a result of my ignorance or lack of creativity.
2: I’m pessimistic about learning results for MDPs or environments “without traps” having anything to do with building a safe LLM agent.
If IB is only used in this heuristic way, we might expect fewer of the mathematical results to transfer, and instead just port over some sort of pessimism about uncertainty. In fact, Michael Cohen’s work follows pretty much exactly this approach at times (I’ve read him mention IB about once, apparently as a source of intuition but not technical results).
None of this is really a criticism of IB; rather, I think it’s important to keep in mind when considering which aspects of IB or IB-like theories are most worth developing.
(Summoned by @Alexander Gietelink Oldenziel)
I don’t understand this comment. I usually don’t think of “building a safer LLM agent” as a viable route to aligned AI. My current best guess about how to create aligned AI is Physicalist Superimitation. We can imagine other approaches, e.g. Quantilized Debate, but I am less optimistic there. More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
As to heuristic implementations of infra-Bayesianism, this is something I don’t want to speculate about in public, it seems exfohazardous.
I agree that building a safer LLM agent is an incredibly fraught path that probably doesn’t work. My comment is in the context of Abram’s first approach, developing safer AI tech that companies might (apparently voluntarily) switch to, and specifically the route of scaling up IB to compete with LLM agents. Note that Abram also seems to be discussing the AI 2027 report, which if taken seriously requires all of this to be done in about 2 years. Conditioning on this route, I suggest that most realistic paths look like what I described, but I am pretty pessimistic that this route will actually work. The reason is that I don’t see explicitly Bayesian glass-box methods competing with massive black-box models at tasks like natural language prediction any time soon. But who knows, perhaps with the “true” (IB?) theory of agency in hand much more is possible.
I’m not sure it’s possible to “complete” the theory of agents, and I am particularly skeptical that we can do it any time soon. However, I think we agree locally / directionally, because it also seems to me that a more rigorous theory of agency is necessary for alignment.
Fair enough, but in that case, it seems impossible for this conversation to meaningfully progress here.
I think that in 2 years we’re unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it’s more likely than not that we actually have >15 years.
As to “completing” the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we “completed” the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we “completed” the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.
I probably should have said “create” rather than “complete”.
I agree with all of this.
I don’t personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I’m not sure why you believe this. (No, I don’t find “planning ahead” results to be convincing—I feel this can still be purely epistemic in a relevant sense.)
Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?
Anyway, cutting more directly to the point:
I think you lack imagination when you say
I think there are neural architectures close to the current paradigm which don’t directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)
EDIT: I think that I miscommunicated a bit initially and suggest reading my response to Vanessa before this comment for necessary context.
I am fine with using the term mesa-induction. I think induction is a restricted type of optimization, but I suppose you associate the term mesa-optimizer with agency, and that is not my intended message.
I don’t think the chain of thought is necessary, but routing through pure sequence prediction in some fashion seems important for the current paradigm (that is what I call scaffolding). I expect that it is possible in principle to avoid this and do straight model-based RL, but forcing that approach to quickly catch up with LLMs / foundation models seems very hard and not necessarily desirable. In fact by default this seems bad for transparency, but perhaps some IB-inspired architecture is more transparent.
@Vanessa Kosoy