Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”
1. Overview and conclusion
In section 2, I describe the “exemplary actor”, an LMCA (language model cognitive architecture) that takes a simple, “brute force” approach to alignment: a powerful LLM (think GPT-5/6 level, with a vast, or quasi-unlimited context) is given a list of “approved” textbooks on methodological and scientific disciplines: epistemology, rationality, ethics, physics, etc. Also, the LLM is given tools: narrow AIs (such as for protein folding or for predicting properties of materials, or for formal scientific modelling). Finally, the LLM is given a compute engine such as Wolfram and a knowledge base such as Wikidata or Wolfram Knowledgebase.
The exemplary actor creates plans or predictions for given situations (described in language and fed to the LLM underlying the exemplary actor as prompts) and iteratively critiques and refines these plans and predictions while putting different textbooks into the LLM context (first, with the textbook on rationality, then epistemology, then physics, etc., with potentially dozens of different textbooks relevant for a plan or prediction that is being criticised), for many iterations, until convergence.
In section 2.1, I note that the type of alignment that the exemplary actor’s architecture tries to ensure is called (world) model alignment and that is stronger and also more essential than goal alignment.
Then, I discuss the properties of the exemplary actor. In section 2.2., I discuss what I see as likely non-issues or straightforwardly addressable issues: the “divergent reasoning nature” of LLMs, the lack of grounded common sense reasoning, and the bias of the quick reactive network (”System 1”), which should be added to the architecture to make it more practically usable in lower-stakes reasoning settings.
In section 2.3, I discuss the outstanding technical issues and risks of the exemplary actor’s architecture:
The risk of direct access to the underlying LLM (section 2.3.1).
The exemplary actor’s reasoning could still be partially directed by “alien” thinking patterns (i.e., the world model) of the underlying LLM even though these influences won’t surface in the explanations of the plan (section 2.3.2).
Iterated critique and refinement probably won’t make plans strictly conforming to the theories described in the textbooks (section 2.3.3).
In section 2.3.4, I discuss the alignment tax of the exemplary actor (compared with the baseline of using bare LLM rollouts in a very lightweight agent architecture, such as AutoGPT) and conclude that the main source of alignment tax might happen to be the theory of ethics which may force the exemplary actor to refuse to participate in “games” (i.e., real-world situations and environments) where it doesn’t see ethical ways of “winning”, and thus will consider inaction (or some form of palliative action) the only ethical way forward. This is not a technical problem with the exemplary actor per se, but rather a problem with a higher-level system, i.e., the current economic, social, and political structure of the civilisation. I mention this and other kinds of “higher-level” risks of the plans to build and deploy the exemplary actor (i.e., roughly the plans that OpenAI and Anthropic are betting on, as it seems to me) in section 2.4.
In section 3, I describe how the H-JEPA (Hierarchical Joint-Embedding Predictive Architecture) architecture proposed by LeCun (2022) could be modified to generate action plans conforming to the world model (and, therefore, values) exhibited by the exemplary actor, described in section 2. (In turn, this world model should follow the body of scientific knowledge described in the textbooks if we find some ways to address the problems discussed in sections 2.3.2 and 2.3.3, or decide that these problems are not critical.)
The key idea of the proposal is to treat H-JEPA’s plans (in the space of representations) as latents for textual descriptions and explanations of these plans and use GFlowNet-EM (Hu et al., 2023) algorithms to train a set of policies, including a policy to generate a textual description and an explanation from a plan, and a reverse policy to generate a plan (in the space of representations) from the textual description. The training samples (textual descriptions of plans) for the reverse policy could be generated by the exemplary actor for an unlimited number of imaginary situations.
In section 3.2, I note that in this hybrid architecture (called “H-JEPA agent with GFlowNet actors” below), the Cost module as proposed by LeCun becomes entirely unnecessary and could be discarded. The problems of combining the “intrinsic” and “trainable” (i.e., pro-social and ethical) costs also go away together with this module.
In section 3.4, I discuss that training GFlowNet policies within H-JEPA from the output of the LLM-based exemplary actor could be orders of magnitude more expensive than training the LLM underlying the exemplary actor. (In section 4.5, I further note H-JEPA agent with GFlowNet actors would be cheaper and faster at inference time than the exemplary actor, and therefore the whole idea could still be economically viable. However, I don’t see this discussion as very relevant for safety and x-risk.)
In section 3.5, I explain why the plans should be latents for their explanations. This idea may seem surprising at first glance, but it makes sense for safety because this is close to how humans actually plan and predict (i.e., justify sub-linguistic inferences with verbal explanations rather than use language reasoning for inference), and making AI’s thinking more “human-like” is generally considered good for safety. Also, it is not obvious that linguistic reasoning is more robust than sub-linguistic reasoning in the representation space.
In section 4, I discuss the properties of the proposed H-JEPA agent with GFlowNet actors.
Thanks to the integration of the JEPA’s predictive loss (which is ultimately grounded with information from the sensors) with the “language modelling” loss of GFlowNet policy training, the H-JEPA agent with GFlowNet actors should be more grounded than the LLM-based exemplary actor (section 4.1). So, I guess that LeCun would endorse this architecture because he considers grounding a big weakness of LLM reasoning in general (Browning & LeCun, 2022), although many people disagree with him on this question, and I tend to side with those people who don’t see grounding as a big issue for LLMs, as I noted in section 2.2.
In section 4.2, I note that interpretability shouldn’t be considered an issue for the exemplary actor already (we assume that the world model of the exemplary actor is described in the textbooks), so cloning its behaviour into GFlowNet actors doesn’t provide the interpretability benefit of separating the world model from the “inference machine”, as discussed by Bengio and Hu (2023).
In section 4.4, I discuss how H-JEPA with GFlowNet actors can’t be used to bootstrap future versions of itself and thus will remain dependent on a powerful LLM for training “forever”.
In section 4.6, I explain how training GFlowNet actors with initial and intermediate plans from the exemplary actor’s critique and refinement process as “negative” contrastive examples helps to address the risk of GFlowNet learning to generate “good-sounding” explanations for arbitrary (perhaps, self-serving and misaligned) plans.
In this article, a theoretical method for aligning H-JEPA architecture via training GFlowNet policies from the outputs of the LLM-based “exemplary actor” (i.e., an aligned LMCA). The additional benefit of this method is that it combines the “intrinsic” and “pro-social” costs in a principled way (i.e., according to some scientific theory of ethics). The distinction between the intrinsic and trainable (i.e., pro-social, moral) costs made in the original H-JEPA architecture proposed by LeCun (2022) was itself reductionistic. It was also proposed to combine these costs by simply adding them, which could fail in some extreme situations, such as situations demanding an agent to sacrifice itself.
However, this method implies that an aligned LMCA (i.e., the exemplary actor) already exists, which may seem as pushing the difficulty of the alignment problem forward into this LMCA.
The assumption of the existence of an aligned LMCA may also seem contrived considering that LeCun mainly presents the H-JEPA architecture as an alternative to the current auto-regressive LLM fad in AI. This probably means that he doesn’t believe that auto-regressive LLMs will keep getting more capable (also, robust and grounded) through further scaling and some tweaks in training. Therefore, this also probably means that LeCun doesn’t believe that LMCA is a viable architecture for (alignable) AGI.
The only aspect in which the x-risk profile of the H-JEPA agent with GFlowNet actors seems to be qualitatively different from that of the exemplary actor is the risk of direct access to the underlying Transformers, which is catastrophic in the case of the exemplary actor (section 2.3.1) and perhaps could be addressed completely in GFlowNet actors if we accept that they will deliberately dumb themselves in strategic x-risk analysis and planning (section 4.3). However, even if we accept this tradeoff, this might not reduce the overall x-risk of the civilisation because GFlowNet actors are not “self-sufficient” for training (section 4.4) and therefore the powerful LLM that underlies the exemplary actor that is used to train GFlowNet actors should still be kept around, and the risk of direct access to this LLM is still present.
Thus, the H-JEPA agent with GFlowNet actors could become interesting perhaps only if the “LLM optimism” view will prove to be correct and thus LMCAs could generally work and be satisfactorily aligned, but also sensory grounding proves to be a really important missing piece of the puzzle. (Though, this combination of conditionals looks to me rather unlikely.) The proposed variant of H-JEPA combines “the best of two worlds”: grounding from H-JEPA and aligned reasoning from the LMCA.
3. Training GFlowNet actors for H-JEPA on the outputs from the exemplary actor
This part of the post is a response to Steven Byrnes’ comment where he asks for a concrete procedure for making an Actor module in the H-JEPA agent (LeCun, 2022) to heed the “correct” versions of the disciplines such as epistemology, rationality, and ethics when minimising the energy of its action plans.
So, concretely, we can use the GFlowNet-EM (expectation-maximization) algorithm for learning compositional latent variable models (Hu et al., 2023), where we take the text plans (or their parse trees) generated by the LLM-based exemplary actor as observation data , and the plan to be generated by the Actor as latent variables . A plan is a series of interleaved representations of actions and world states: i.e., a sequence , which are learned to optimise the behaviour of the Actor with respect to both the JEPA’s prediction objective (LeCun, 2022, section 4) and GFlowNet objectives. The former ensures that the Actor learns grounded prediction capabilities, while the latter ensures that the Actor’s behaviour (the style of planning) will be aligned with the methodological and scientific theories that constitute the body of knowledge of the exemplary LLM-based actor.
See the GFlowNet Tutorial (Bengio et al., 2022) for an introduction to the GFlowNet framework.
Algorithms 1 and 2 in (Hu et al., 2023) require that we train
The “main” forward policy (i.e., a model) to generate the action plan from the initial world state and the target world state . The target world state is omitted on the highest level of the hierarchy. The highest-level GFlowNet actor is trained on open-ended plans generated by the exemplary actor.
A conditional forward policy to generate an action plan (in the embedding space) from the text version of it, . Note that this text could also include extensive explanations, justifying the plan with argumentation, scientific models, etc. Most of this information doesn’t appear in the plan . This policy also plays the role of the H-JEPA agent’s “theory of mind” of other agents and people it communicates with, as described in section 4.3.
A conditional forward policy to generate a “text version and the explanation/justification” of the given plan .
A conditional backward policy to “rewind” the plan generation one step back. If the hierarchical aspect of planning was handled exclusively by the JEPA hierarchy (i.e., on the system level of the agent’s architecture) and we would want the Actor to generate only “flat” plans, this policy could just trivially remove the latest action from the plan . However, it seems that separating the planning levels and timescales cleanly would be problematic in practice, and the same GFlowNet actor would better handle multiple adjacent planning levels, such as hours-days, days-weeks-months, and months-years (see section 3.3 for further discussion). Thus, when the GFlowNet actor generates tree-like plans that are “expanded” rather than generated strictly left-to-right, the backward policy is non-trivial and should be implemented with a DNN.
All these policies could be implemented with a family of Transformers (a shared encoder and multiple decoders) with a multi-modal encoder with two input token streams, one for action plans (where tokens are elements of world state and action representations) and another for textual explanations of the plans . Decoders are specific to the policy and generate either plans or text. See Chefer et al. (2021) for an overview of possible attention schemes that could be used.
Transformers also permit estimating probabilities , , and  that are needed to compute the loss in Algorithms 1 and 2 from (Hu et al., 2023).
The decoder that implements (i.e., the policy that generates a textual explanation for a given plan) could be augmented with the same tools as the LLM in the exemplary actor.
GFlowNet-EM algorithms (as well as the algorithms in (Zhang et al., 2022), the work that makes an explicit connection to energy-based modelling) train policies that can effectively sample from the space of text plans , by first generating a plan using and then generating a textual explanation for this plan using , without having an a priori definition of the energy function, and only having the data samples from the (reward) distribution entailed by the energy function (the reward distribution and the energy function are related as , see Bengio et al. (2022)). In our case, the LLM-based exemplary actor could generate an infinite supply of such training samples.
Hu et al. (2023) showcase examples of using these algorithms to induce a grammar of the text (section 5.2) and to learn discrete latent representations of images (section 5.3). The GFlowNet actor described here combines some features of these two examples, inducing action plans (in the latent representation space) from the larger space of textual descriptions of these plans.
3.1. Differences from GFlowNet-EM
The training and the model architecture of the GFlowNet actor for H-JEPA needs to differ from the setup described by Hu et al. (2023).
In the context of H-JEPA, the actor trained as described above doesn’t need to use a separate predictive World Model, as was originally proposed by LeCun (2022) because the GFlowNet actor is itself an inference machine (i.e., a predictive model, or a simulator) based on the world model (Bengio & Hu, 2023). Still, if there are multiple levels of the inference and planning hierarchy (such as if there is a sub-linguistic sensorimotor level, or if linguistic inference and planning are split into multiple temporal scales), intermediate representations should be learned right during the GFlowNet actor’s training (LeCun, 2022, section 4.7). Therefore, apart from the M-step and the E-step (Hu et al., 2023, Algorithm 1), there should also be steps that propagate gradients from the lower and/or the higher level in the inference and planning hierarchy (how the hierarchy levels are connected specifically is discussed in section 3.3.1).
In H-JEPA, the World Model plays two roles: (1) estimate the missing information about the state of the world not provided by perception, and (2) predict plausible future states of the world. The second role is taken over by the GFlowNet actor, but the first role remains. A “complete” state of the world together with uncertainty estimates (e.g., variance, or error bounds) is used both to seed the action plan and by the “Mode-1” module (a.k.a. the habitual network). I can also imagine that this World Model should be involved in updating the current world state representation in response to new incoming percepts and information.
Note that now we implicitly have two “world models” in the architecture: the implicit, incomputable one (based on which the GFlowNet actor performs inferences) and the explicit, computable one that fulfils the “second role”, as described in the previous paragraph. To maintain coherence between them, perhaps the Transformer encoder of GFlowNet policies should be made yet more complicated: it should have extra modalities for the current world state and for incoming percepts, and an extra decoder to output an updated world state representation, which is used to train the computable (H-JEPA’s “second role”) world model. Or, this new encoder-decoder pair could just be used as this model.
3.2. The entailed reward distribution replaces the Cost module at the hierarchy levels with GFlowNet actors
Since the training of the GFlowNet actor already makes it generate plans that could be seen as samples from the reward distribution , there is no need for a separate Cost module (LeCun, 2022, section 3.2). Plans could be optimised through Markov chain sampling: see the description of the “optimisation” mode in section 3.3 below.
When we generate sample textual plans for training the GFlowNet actor using the LLM-based exemplary actor, we can always prompt the latter to reason “from the H-JEPA agent’s point of view”, and supply the agent’s design specification (such as the acceptable operating temperature range, levels of moisture, power consumption, characteristics of the robot’s sensors and actuators, etc.) as a part of the exemplary actor’s body of knowledge, alongside the textbooks on methodological and scientific disciplines which are discussed in section 2. This specification is equivalent to the “ego model” in H-JEPA (LeCun, 2022, section 4.8.1).
This incorporation of the “intrinsic” limitations and needs of the agent into the exemplary actor’s reasoning allows not only omitting the Cost module entirely, including the Intrinsic Cost component of it. Also, it allows combining the “intrinsic” (instrumental) and “trainable” costs (pro-social, moral, etc.) in a principled way by leveraging the full power of methodological disciplines (ethics, rationality, epistemology, game theory) which the exemplary actor is equipped with (and which the GFlowNet actor is supposed to learn). This potentially allows teaching the agent to do out-of-distribution actions, such as self-sacrifice if demanded in some extraordinary situations according to the best theories of ethics and rationality that the exemplary actor uses. In comparison, LeCun has proposed to simply sum the energies (i.e., costs) from the Intrinsic Cost and Trainable Cost modules which may fail to infer self-sacrifice considering that the agent’s death is typically modelled as a state with infinite energy (cost).
The “Altruism Controller” that I proposed before for H-JEPA to monitor for an unexpected prevalence of intrinsic (instrumental) cost (motivation) over the pro-social cost in the Actor’s inferred plan relied on an explicit numerical estimation of these costs, but this approach was itself unprincipled. Open-ended reasoning with the principles of rationality, ethics, and other disciplines doesn’t provide for a reductionist breakdown of the cost into “instrumental” and “pro-social” components.
Thus, the H-JEPA agent should switch between the inference modes (see section 3.3) or switch to a higher level in the inference and planning hierarchy out of the “regular” cadence, based on other indicators than the unexpected proportion between intrinsic/instrumental and pro-social/ethical costs.
An example of a trigger that is implementable in the GFlowNet actor is the estimate of the actor’s confidence in the inferred plan succeeding. In Active Inference, this is called the risk of the plan (Barp et al. (2022) discuss the correspondences with other engineering and scientific theories of control and decision-making). The typical levels of risk could be learned by a separate neural net that takes in the current world state as the input and the risk as the target to be learned. The latter comes from the GFlowNet neural nets: or could be computed from Transformer’s output logits and normalised using the estimate of the partition function.
Thus constructed “watchdog” module plays the same role as the “Altruism Controller”. It triggers the switch between the inference modes or the inference hierarchy levels if the risk of the inferred immediate action of the plan is unexpected for the given context.
Note that the “watchdog” module that detects an unusual level of risk in a certain context does something different from affective inference as proposed by Hesp et al. (2021), which is about computing the expected precision of the plan based on the difference in action plan distributions and before and after updating the current world state with new percepts. Hesp and collaborators also call this measure affective charge. For a GFlowNet actor, the affective charge couldn’t be computed directly as was proposed. However, , i.e., the reward of the plan in which the current world state is updated from to but the rest of the plan is not, could be estimated with GFlowNet’s Transformers and perhaps could be used as a crude proxy for affective charge.
Interestingly, while an anomalous value of the plan risk should seemingly induce more deliberative and/or deep (in terms of the inference hierarchy) thinking, biologically inspired reaction to negative affective charge is the opposite: resorting to quicker, more “intuitive” inference modes and “hardcoded” (survival) heuristics (Hesp et al., 2021).
Yet another way to derive non-trivial indicators of something going wrong with the GFlowNet actor’s predictions and planning is to make the forward policy to predict the risk of all actions that it plans (where the prediction target is the risk of the sub-plan in the inference hierarchy that elaborates this exact action) and then to take note if these predictions diverge strongly from the “actual” risk (coming from the lower-level GFlowNet actor) in some context. This type of meta-prediction is also needed to connect the actor hierarchy, as further discussed in section 3.3.1. Note the difference from the “watchdog” module described above, which predicts the risk for the current world state irrespective of the plan, i.e., implicitly evaluating the quality of the prediction. On the other hand, the risk estimator described in this paragraph and section 3.3.1 is contextualised with both the current and the future world states (i.e., the world state that should be achieved as the result of taking the action). In other words, this risk estimator calibrates the GFlowNet actor’s understanding of the agent’s planning (and execution) capability on the lower levels of inference.
3.2.1. Ambiguity estimation
In practice, making GFlowNet policies to predict full world state representations for every step in the action plan could be wasteful (LeCun, 2022, section 4.9), so only some representations of the world state updates should appear on the plan, while the full current world state itself is kept up-to-date and is provided as input for the forward policy (i.e., a Transformer) separately. Then, world state updates on the lower level in the inference hierarchy could be treated as percepts for the higher-level GFlowNet actor.
In Active Inference, the expected free energy of a plan is a sum of the plan’s risk (which is discussed above) and its ambiguity. Ambiguity is the measure of uncertainty of the expected percepts (outcomes) given the expected sequence of world states.
In our setting, the expected sequence of world states is entailed by the generated plan directly, and the lower-level world state updates are conceptualised as percepts for the higher level of planning, as noted above. Predicting higher-level representations from lower-level world state updates is the core training objective in H-JEPA. When the states are predicted from percepts using a DNN the uncertainty (i.e., the entropy, ) of this prediction could be computed using various methods (Pei et al., 2022; Osband et al., 2023, inter alia). The ambiguity is (note that dependent and independent variables are switched), but perhaps is a satisfactory proxy estimate of the ambiguity.
Note that even the proxy of ambiguity could be computed only for the portions of the plan which was elaborated on both the higher and the lower levels of the hierarchy, which generally won’t be the case because it’s usually wasteful and pointless to elaborate lower-level plans for the portions of the higher-level plans that are further out in the future. This means that to estimate ambiguity for the entire plan, a separate estimator should be trained, using the “actual” ambiguity computed from multi-level plans for the immediate future as the training target (similarly to the risk estimator).
The thus-estimated ambiguity of a plan could then be used alongside the risk for various “conscious control” and “affective” functions in the cognitive architecture, as discussed in section 3.2.
3.3. Deliberation and hierarchical plan refinement
The GFlowNet actor could have at least three different decision-making modes:
“Habitual” mode: infer the next action directly from the current world state , without creating any plan.
“Planning” mode: infer (sample) the plan without generating an explanation for it, and pick the first action from this plan for execution or for passing down as a constraint for inference at lower levels of the hierarchy.
“Optimisation” mode: try to improve the plan though Markov chain sampling via “back-and-forth” K-step transitions, i.e., applying (the explanation could be generated from if needed) to times to get and then applying to until the plan is completed (Zhang et al., 2022, section 3.3). In the beginning, could be as large as , which is equivalent to sampling completely new plans from , and then exponentially reduced, e.g., to , , etc., after a certain number of transition attempts with each value of , or according to some cleverer heuristic.
Intuitively, it seems that Markov chain sampling should work better when the plans are not append-only, flat sequences of world states interleaved with actions (such as ) but trees of actions and world states encompassing multiple adjacent hierarchy levels. This is also motivated by the fact that real-world plans made by humans today are usually multi-level, and the explanations of these multi-level plans regularised by methodological disciplines such as rationality and ethics should capture the dependencies between some actions and considerations belonging to different sub-plans, and, therefore, these sub-plans should better be put in the context of a single Transformer network.
The forward policy should not only be able to append the (partial) plan that ends with an action with the next world state () and a plan that ends with a world state with the next action (), but also inserting world states and actions at any position in the middle of the list, as well as expanding subsequences or “down” by initiating a sub-plan that details this state transition.
The ability of forward GFlowNet policies to grow the plans by inserting tokens in the middle of the sequence (as well as the ability of backward policies to remove elements from any place in the sequence) requires something other than the standard autoregressive, GPT-like transformer architecture (Gu et al., 2019a;b).
The upside of this design decision is that plans on the lower level of the inference hierarchy could be constrained by the plans from the higher level by simply seeding the lower-level plan with the leaf-level sub-plan from the higher-level actor and letting the lower-level actor expand this plan. For this trick to work, the adjacent actors should have an overlap in terms of the planning temporal scales and world state (and action) representations that they handle. For example, if the agent has three GFlowNet actors in the hierarchy (not counting a sublinguistic JEPA actor for sensorimotor level), they could handle minutes-hours-days, days-weeks-months, and months-years-decades action and world state scales, respectively. Note that the first two actors overlap on the scale of days and the second two actors overlap on the scale of months.
Another reason to make GFlowNet policies to incrementally update tree-like plans is that in this way, we can demand that all plans are always “complete” (even if very vague, such as only including the initial and the target world state without any elaboration about what should be done to help the world to transition between these two states). This allows using forward-looking flow (Pan et al., 2023) to speed up GFlowNet training.
3.3.1. Joint hierarchical optimisation
Joint optimisation of plans across the hierarchy could be implemented as follows: in the plan sequence, each action representation (and two consecutive world states could always assume a “placeholder” action between them) is followed by an embedding that, intuitively, should represent some relevant characteristics of this action, such as the probability of success (i.e., the risk of the action: see section 3.2), expected expenditure of resources, the moral “cost” of the action on certain other agents, etc. These characteristics of the action are useful for the planning-as-inference process performed by the forward GFlowNet policy on the timescale to which the action belongs, and possibly on one or two higher-level timescales. The forward GFlowNet policy learns to predict these embeddings of action characteristics with the targets from a separate neural network that compresses the entire plan generated by the lower-level GFlowNet actor (i.e., the expansion of the higher-level action in question) into this small embedding space (however, I don’t know how this neural net itself should be trained: for example, autoencoding doesn’t seem to make a lot of sense?). Then, making K-step transitions in the “optimisation” mode (as described above) at the higher level of planning could be interleaved with replacing the “action characteristic” embeddings predicted by the forward policy itself at this level with “true” embeddings, generated from the expanded plan, inferred by the lower-level GFlowNet actor.
If the idea with “lower-level plan embedding” couldn’t work because there is no way to train an encoder for plans in a self-supervised way, a simpler alternative is to make the GFlowNet planner predict the risk of the action with the risk of the sub-plan as the training target. The latter could be estimated directly using the output logits in the neural net that implements a forward policy, as described in section 3.2.
3.4. Training GFlowNet actors may be orders of magnitude more expensive than training the LLM underlying the exemplary actor
The exemplary actor architecture looks completely realistic today (under the “LLM optimism” assumption, as remarked in section 2), even though it requires perhaps in the order of $1B in compute costs to train the underlying LLM.
Training a GFlowNet actor, on the other hand, might be orders of magnitude more expensive and/or require orders of magnitude larger Transformers in terms of parameter counts.
First, note that the entire training data corpus for the GFlowNet actor is generated by the exemplary actor, whose inference itself will be expensive (considering that it will constantly wrangle entire textbooks in its context and will pass the plans through many iterations of critique and refinement).
Second, multi-modal encoder-decoder Transformers whose tokens are elements of world state and action representations (in the “plan” modality of inputs) with additional co-attention between the plan representations and the text stream may need to be larger than a pure language Transformer. Another reason why the Transformers implementing GFlowNet policies may need to be much larger even than the LLM that underlies the exemplary actor is that we are essentially trying to teach more capable models: they should internalise the rather complicated structure of the world model. This world model structure takes a long list of textbooks to explain! The LLM itself isn’t capable of internalising this structure and can only model the world in this way through iterated critique and refinement. However, this is merely an intuition that these Transformers may need to be larger than the LLM itself: I could also imagine that the target world model structure could be learned by a smaller Transformer, or that the capability of effective critiques and refinements with entire textbooks “in the mind” itself requires much a larger and deeper Transformer than for GFlowNet policies.
Third, training GFlowNet policies to faithfully model the exemplary actor’s aligned behaviour might require much more training data (and/or compute/iterations) than is deemed optimal for LLMs (Hoffmann et al., 2022). Intuitively, learning deep and robust regularisation of Transformer’s inference with the entire body of knowledge of various methodological and scientific disciplines should take much more training examples than learning the handful of shallow logical rules that appear in 99% of human texts, or even the syntax rules of programming languages and the inference heuristics for solving math problems that LLMs also learn.
Furthermore, training GFlowNet policies simultaneously with the world state representations themselves, and interleaving weight updates on GFlowNet’s objectives with other types of updates (for ground the GFlowNet actor(s) with the predictive loss at the sensorimotor JEPA level, and to propagate training signal between the GFlowNet actors at different levels of the hierarchy, as discussed in section 3.3) might make GFlowNet training longer to converge and therefore to require more training compute, even if only due to passing the same training data samples more times through the system or just making the batches smaller rather, than due to generating more unique training samples with the exemplary actor.
If the conclusion that training a GFlowNet actor takes some orders of magnitude more compute than training an LLM for the exemplary actor is correct, then building an H-JEPA agent with the Actor module that effectively clones the exemplary actor’s behaviour will not become feasible perhaps for 5-10 years after the exemplary actor is built. I’m leaving the exploration of the strategic implications of this to future work.
3.5. Why plans aren’t generated from explanations?
This may seem odd at first glance that in GFlowNet actors discussed here, plans are latent variables for text explanations of these plans rather than vice versa. However, there is a consensus in psychology that this is exactly what humans do as well: humans produce linguistic explanations (or justifications) for conclusions that have been reached intuitively (Haidt, 2013). And so, since it is generally thought beneficial for AI safety to make AI’s ways of thinking similar to human ways of thinking, we should consider the fact that GFlowNet actors generate explanations from the plan a good rather than a worrisome feature of the architecture described above.
Haidt (2013) writes that humans can “train” their intuitive thinking through linguistic reasoning and arguments over time. In the context of GFlowNet-EM, this means that when humans “pass texts through themselves” (either through reading texts written by others, or reading texts written and refined by themselves, or listening to others, or listening to their own speech or inner monologue) they engage in training described by Algorithm 1 from Hu et al. (2022). Thus, whenever humans perceive some text, they hone their sub-linguistic/representational “inference machine” for future planning and prediction.
It may seem that language-level reasoning, predictions, and planning should be more robust than inference performed in the space of world state and action representations. This is not obvious to me: maybe linguistic reasoning is too easily carried away and therefore inference in the space of world state representations that are grounded with sensory information more directly than language is more robust.
Whether this is true or not, we can still “check” the inferred plans with linguistic reasoning, specifically, by extending the “optimisation” mode of thinking (section 3.3) in the following way: on each step of Markov chain sampling, we can compare not just the (predicted) energies of the plans, but the energies of the explanations generated from these plans, and to decide whether to accept a Markov chain transition taking into account the relative energies of both the plans and their explanations, as well as the likelihoods that could also be computed. Furthermore, to make sure that the energies of the explanations provide a helpful extra signal relative to the energies of the plans, we can use not just the “first” explanation generated from the plan (by applying iteratively), but to optimise the explanation itself with a similar Markov chain sampling process, although this requires training an extra backward policy and can make the overall thinking process in the “optimisation” mode considerably slower.
4. Safety properties and risks of H-JEPA agent with GFlowNet actors
In this section, I discuss the safety characteristics and the deployment risks of the H-JEPA-like architecture with GFlowNet actors, organised and trained as discussed in section 3.
4.1. H-JEPA with GFlowNet actors is more grounded than the exemplary actor (in case it turns out to be a problem)
The GFlowNet actors effectively clone the behaviour of the LLM-based exemplary actor, with a single substantial addition: GFlowNet actors are also trained to predict future outcomes in the world faithfully through the JEPA hierarchy, which provides grounding for these actors. Even though I don’t expect grounding to be a problem for the LLM-based exemplary actor (see section 2.2), there is some probability that this will be a problem (including for safety), and thus addressing this potential issue through JEPA training objective is a good thing.
4.2. Interpretability: GFlowNet actors don’t have an advantage over the exemplary actor, both are fairly interpretable in comparison with bare LLMs
The selling point of GFlowNets, namely that they differentiate between world models and inference based on these models (Bengio & Hu, 2023), applies equally to H-JEPA with GFlowNet actors and the exemplary actor (an LMCA): even though we assume that the auto-regressive LLM underlying the exemplary actor has an “alien” world model, we also assume that iterated critiques and refinements reduce this problem significantly, and the remaining “alien” bias is cloned by the GFlowNet actors anyway.
In section 2.3.3, I discussed that the world model of the exemplary actor is not guaranteed to exactly reflect the theories (models) of rationality, ethics, science, etc. which are the basis of iterated critiques and refinements, although we should reasonably expect that the exemplary plans and predictions will at least reflect something close to these theories. Characterising the exact nature of the resulting world model, epitomised by both the exemplary actor and GFlowNet actors, would be an open research problem.
4.3. There is no “jailbreak” for GFlowNet actors, but direct access is still a risk unless we accept a high alignment tax in capability
Unlike a typical auto-regressive LLM (including the LLM that underlies the exemplary actor), GFlowNet actors are trained only on exemplary examples of plans, predictions, and explanations, and with an objective that rewards good completed plans and predictions rather than plausible auto-regressive continuations. Intuitively, this suggests that GFlowNet actors couldn’t be made to generate harmful plans through adversarial communication, i.e., there is no “jailbreak” for them. Perhaps, GFlowNet policies couldn’t even be easily fine-tuned to bias generation in a certain “misaligned” direction without harming the capabilities, thus fulfilling the vision of self-destructing models (Mitchell et al., 2022).
Through a normal interface, GFlowNet actors couldn’t be “prompted” at all, unlike both auto-regressive LLMs and LMCAs on top of it such as the exemplary actor. The H-JEPA agent could perhaps use the policy as its theory of mind of people and other AI agents surrounding it. Passing the text communicated by counterparties through this policy is equivalent to obtaining a belief about the world state and plan embeddings that could be associated with the counterparty in the agent’s own world state.
However, if an adversary has direct access to all GFlowNet policies, they could still use them to make misaligned, adversarial plans quite easily, as I described here. They could generate a textual description of the world state that calls the H-JEPA agent to make a mitigation plan against an adversary, e.g., a terrorist that plans to deploy bioweapons to kill all humans on Earth. A sufficiently capable H-JEPA agent with GFlowNet actors will “think through” the likely action of the terrorist to prepare its best mitigation plan, which includes thinking through which chemical compounds the terrorist will be able to synthesise. Then the adversary could pick up this hazardous information from the textual explanations of the mitigation plan by the H-JEPA agent.
The exemplary actor and, therefore, H-JEPA with GFlowNet actors could be trained to never explicitly predict the actions of its adversaries, at least when their misaligned plans have an existential scale (such as creating a supervirus and killing all humans). This trades off the risk of direct access to GFlowNet policies for higher alignment capability tax (see section 2.3.4): if the exemplary actor (or H-JEPA with GFlowNet actors) refuse to “think through” mortal enemies of humanity, it couldn’t also effectively protect the humanity against them.
4.4. GFlowNet actors can’t bootstrap future generations of GFlowNet actors
Even though there is the conditional forward policy generates language, it probably couldn’t be used effectively for self-critique, refinement, and other “LLM tasks” as the LLM that underlies the exemplary actor. Therefore, to train successive versions of H-JEPA with GFlowNet actors, a powerful LLM should be “kept around” to produce sample plans for GFlowNet-EM training.
Moreover, as the number of scientific and methodological theories grows, and as we wish to make the exemplary plans more aligned with the actual theories expressed in the textbooks (see section 2.3.3), we would probably need to increase the power of the LLM (e.g., through scaling, as long as scaling keeps improving LLM’s capability) if we aim to increase the reliability and precision of GFlowNet actors over successive iterations of the technology.
This means that there is no strategic option to “train LLM, make an LMCA which will be aligned though attending to the “right” textbooks, then clone the exemplary actor’s behaviour into H-JEPA, then destroy the original LLM as inherently unsafe for direct access and bootstrap future versions of H-JEPA from a previous version of H-JEPA” (even if we don’t consider for a moment that H-JEPA is also not “inherently safe” because direct access to it still poses a great risk, as well as direct access to an LLM, as was discussed in the previous section). GFlowNet actors that are trained as described in section 3 are forever dependent on LLM’s capability to generate training samples for them. Therefore, the risk of just keeping the LLM around (i.e., the risk of someone stealing their weights and fine-tuning them for “bad” purposes) doesn’t go away anywhere if we train an H-JEPA agent with GFlowNet actors from the behaviour of the exemplary actor.
4.5. H-JEPA with GFlowNet actors has a lower alignment tax at inference time relative to the exemplary actor in exchange for a higher alignment tax at training time
As discussed in section 2.2.1, going through the full exemplary actor’s iterative critique and refinement process of inference of the “best” plan will probably take many minutes (or dozens of minutes) for many years ahead, and will probably not become dirt cheap at least for some years, too.
The gargantuan volume of training compute drives “aligned” inference procedures into the Transformers that implement GFlowNet policies, which then will be able to perform aligned “System 2” reasoning over a single Transformer rollout (or a limited number of them in a Markov chain, in the “optimisation” inference mode, as discussed in section 3.3). Thus, the inference of the H-JEPA agent with GFlowNet actors could be about an order of magnitude faster and cheaper than the inference of the exemplary actor.
I’m not sure that this gain is significant from the strategic “capability deterrent” perspective, i.e., the perspective that we want to have an aligned AI that is more powerful than a misaligned rogue AI. I’m not sure that even if we have a fully digitalised and fully automated economy and ubiquitous internet, IoT, and drone infrastructure, a rogue AI could gain a decisive advantage in a full-scale AI conflict scenario by thinking in seconds rather than in dozens of minutes. Transmitting data, infecting computers with viruses, collecting and processing data, and making inferences from this data still takes time that adds up at least to minutes and perhaps hours, so it seems that a rogue AI that conducts a massive cyber-attack or a war may have little to gain from arriving to conclusions in one second rather than ten minutes.
Therefore, the idea of transferring the alignment tax payout from inference to training time by training an H-JEPA agent with GFlowNet actors that clones the behaviour of the exemplary LLM-based actor may be economically attractive (but it also may not, if training of GFlowNet actors proves way too expensive and this up-front cost is not recovered through cheaper inference, and considering that we may also want to improve the exemplary actor in cheap, short iterations without re-training the underlying LLM but through updating the body of knowledge and through improvements to the LMCA), but this particular kind of lowering the alignment tax doesn’t seem to be relevant to the “strategic deterrence” aspect of AI x-risk.
4.6. Addressing the risk of GFlowNet actors producing good-sounding explanations for self-serving, misaligned plans
In a recent podcast, Ajeya Cotra describes this risk in the context of explaining plans by chess and Go-playing AIs:
Rob Wiblin: In theory, could we today, if we wanted to, train a model that would explain why proteins are folded a particular way or explain why a particular go move is good?
Ajeya Cotra: I think we could totally try to do that. We have the models that can talk to us, and we have the models that are really good at Go or chess or protein folding. It would be a matter of training a multimodal model that takes as input both the Go or chess board or protein thing, and some questions that we’re asking, and it produces its output: both a move and an explanation of the moves.
But I think it’s much harder and less obvious how to train this system to have the words it’s saying be truly connected to why it’s making the moves it’s making. Because we’re training it to do well at the game by just giving it a reward when it wins, and then we’re training it to talk to us about the game by having some humans listen to what it’s saying and then judge whether it seems like a good explanation of why it did what it did. Even when you try and improve this training procedure, it’s not totally clear if we can actually get this system to say everything that it knows about why it’s making this move.
A disconnection between the plans and the explanations is a situation when a GFlowNet actor can justify a wide range of plans with good-sounding explanations. This is a form of solution degeneracy that GFlowNet-EM algorithms specifically address (Hu et al., 2023, section 4.1).
In addition, we can also augment the GFlowNet-EM algorithms with contrastive learning steps, where all initial and intermediate versions of the plans and predictions produced by the exemplary actor (except the final plan or prediction on which the exemplary actor converges) could be used as negative contrastive examples when training GFlowNet policies, while GFlowNet-EM training algorithms take care of preventing posterior collapse (discussed as mode collapse by LeCun) so that GFlowNet policies learn to represent “imperfect” plans faithfully in the latent space and, of course, are trained to avoid generating such plans.
I believe that the second remark by Cotra, that deep neural network “knows” about the totality of causes of its own “moves” (plans, behaviour) is a misunderstanding of the limits of any system’s self-knowledge and the nature of explanation.
Fields and Levin (2022) posited that “no system can fully predict its own future behaviour”, which automatically means that no system can fully explain its behaviour.
The generation of an explanation (including sub-processes such as critiques and refinements in the case of the exemplary actor, or Markov chain sampling in the case of GFlowNet actors) for something such as a move or a plan is a form of computation whose role is to check that the move or the plan conforms to some abstract model of behaviour (see sections 3.3 and 3.5). Therefore, any intelligent system also doesn’t “know” explanations for its behaviour: it can generate them on demand.
This applies even to the exemplary actor, even though it converges on its plans through linguistic reasoning and explanations: the final version of the plan on which the exemplary actor converges could have vastly more possible explanations that didn’t lie on the path of plans (which are progressively refined versions of each other, according to the exemplary actor’s algorithm) that the exemplary actor happened to take. In principle, after the convergence, the exemplary actor could try to verify the plan further by stripping all the explanations from the minimum description of the final plan, then trying to seed alternative explanations for this plan, and then checking if critiques and refinements of these alternative explanations lead to changes in the plan.
Barp, A., Da Costa, L., França, G., Friston, K., Girolami, M., Jordan, M. I., & Pavliotis, G. A. (2022). Geometric Methods for Sampling, Optimisation, Inference and Adaptive Agents (Vol. 46, pp. 21–78). https://doi.org/10.1016/bs.host.2022.03.005
Bengio, Y. (2019). The Consciousness Prior (arXiv:1709.08568). arXiv. http://arxiv.org/abs/1709.08568
Bengio, Y., & Hu, E. (2023, March 21). Scaling in the service of reasoning & model-based ML. Yoshua Bengio. https://yoshuabengio.org/2023/03/21/scaling-in-the-service-of-reasoning-model-based-ml/
Bertsch, A., Alon, U., Neubig, G., & Gormley, M. R. (2023). Unlimiformer: Long-Range Transformers with Unlimited Length Input (arXiv:2305.01625). arXiv. https://doi.org/10.48550/arXiv.2305.01625
Browning, J., & LeCun, Y. (2022). AI And The Limits Of Language. https://www.noemamag.com/ai-and-the-limits-of-language
Chefer, H., Gur, S., & Wolf, L. (2021). Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. 397–406. https://openaccess.thecvf.com/content/ICCV2021/html/Chefer_Generic_Attention-Model_Explainability_for_Interpreting_Bi-Modal_and_Encoder-Decoder_Transformers_ICCV_2021_paper.html
Fields, C., & Levin, M. (2022). Regulative development as a model for origin of life and artificial life studies [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/rdt7f
Friston, K. J., Ramstead, M. J. D., Kiefer, A. B., Tschantz, A., Buckley, C. L., Albarracin, M., Pitliya, R. J., Heins, C., Klein, B., Millidge, B., Sakthivadivel, D. A. R., Smithe, T. S. C., Koudahl, M., Tremblay, S. E., Petersen, C., Fung, K., Fox, J. G., Swanson, S., Mapes, D., & René, G. (2022). Designing Ecosystems of Intelligence from First Principles (arXiv:2212.01354). arXiv. http://arxiv.org/abs/2212.01354
Gu, J., Liu, Q., & Cho, K. (2019). Insertion-based Decoding with Automatically Inferred Generation Order. Transactions of the Association for Computational Linguistics, 7, 661–676. https://doi.org/10.1162/tacl_a_00292
Gu, J., Wang, C., & Zhao, J. (2019). Levenshtein Transformer. Advances in Neural Information Processing Systems, 32. https://proceedings.neurips.cc/paper/2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html
Haidt, J. (Ed.). (2013). The righteous mind: Why good people are divided by politics and religion (1. Vintage books ed). Vintage Books.
Hesp, C., Smith, R., Parr, T., Allen, M., Friston, K. J., & Ramstead, M. J. D. (2021). Deeply Felt Affect: The Emergence of Valence in Deep Active Inference. Neural Computation, 33(2), 398–446. https://doi.org/10.1162/neco_a_01341
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models (arXiv:2203.15556). arXiv. https://doi.org/10.48550/arXiv.2203.15556
Hu, E., Malkin, N., Jain, M., Everett, K., Graikos, A., & Bengio, Y. (2023). GFlowNet-EM for learning compositional latent variable models (arXiv:2302.06576). arXiv. http://arxiv.org/abs/2302.06576
LeCun, Y. (n.d.). A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27.
Liu, Z., Gan, E., & Tegmark, M. (2023). Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability (arXiv:2305.08746). arXiv. https://doi.org/10.48550/arXiv.2305.08746
Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., & Scialom, T. (2023). Augmented Language Models: A Survey (arXiv:2302.07842). arXiv. https://doi.org/10.48550/arXiv.2302.07842
Mitchell, E., Henderson, P., Manning, C. D., Jurafsky, D., & Finn, C. (2022). Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models (arXiv:2211.14946). arXiv. https://doi.org/10.48550/arXiv.2211.14946
Osband, I., Asghari, S. M., Van Roy, B., McAleese, N., Aslanides, J., & Irving, G. (2023). Fine-Tuning Language Models via Epistemic Neural Networks (arXiv:2211.01568). arXiv. http://arxiv.org/abs/2211.01568
Pan, L., Malkin, N., Zhang, D., & Bengio, Y. (2023). Better Training of GFlowNets with Local Credit and Incomplete Trajectories (arXiv:2302.01687). arXiv. https://doi.org/10.48550/arXiv.2302.01687
Pearl, J. (2010). Causal Inference. Proceedings of Workshop on Causality: Objectives and Assessment at NIPS 2008, 39–58. https://proceedings.mlr.press/v6/pearl10a.html
Pei, J., Wang, C., & Szarvas, G. (2022). Transformer Uncertainty Estimation with Hierarchical Stochastic Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), Article 10. https://doi.org/10.1609/aaai.v36i10.21364
Rahaman, N., Weiss, M., Locatello, F., Pal, C., Bengio, Y., Schölkopf, B., Li, L. E., & Ballas, N. (2022). Neural Attentive Circuits (arXiv:2210.08031). arXiv. https://doi.org/10.48550/arXiv.2210.08031
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (arXiv:2305.04388). arXiv. http://arxiv.org/abs/2305.04388
Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., & Lewis, M. (2023). MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers (arXiv:2305.07185). arXiv. http://arxiv.org/abs/2305.07185
Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., & Bengio, Y. (2022). Generative Flow Networks for Discrete Probabilistic Modeling (arXiv:2202.01361). arXiv. http://arxiv.org/abs/2202.01361
Actually, it seems that plans should be trees of actions and world states, covering multiple adjacent hierarchy levels: see section 3.3.
is referred to as in Algorithm 2 in (Hu et al., 2023).
In the language of Bengio and Hu (2023), a “world model” seems to be something like a static description of the knowledge (theory), or, more generally, the energy function or the distribution, rather than something that performs inferences (makes calculations) according to the theory.
Or, we can “reuse” the action plan input modality which always begins with the current world state.
I.e., world state updates from the lower hierarchy level: see section 3.2.1.
Unless the reasoning is formalised to the level that everything is crammed into a single, coherent causal model where causal effects (Pearl, 2010) of the action on various variables (which themselves could be classified as “instrumental” and “pro-social” ends) could be calculated and marginalised. However, the reasoning of the LLM-based exemplary actor will not be as formal (reasoning “in words”, even for connecting more causal sub-models, is not as formal as building an explicit structural causal model), and so neither will be the GFlowNet actor’s reasoning.
Because the energy distribution of preferred outcomes and outcomes expected is not a parametric distribution.
Given that plans don’t have a fixed length, this isn’t guaranteed to take exactly steps.
Note that world state and action representations are learned by GFlowNet training algorithms, and simple transfer of these representations between GFlowNet policies on adjacent levels could reduce the efficiency of training. Sharing some layers between GFlowNet actors at the adjacent hierarchy levels could help to address this issue.
How much larger I don’t have the slightest intuition and didn’t try to estimate. However, potentially, these Transformers may need to be much larger than the LLM underlying the exemplary actor.
Even though an inverse conditional policy, for generating plans from explanations is trained as well.
Albeit still not very directly if we talk about GFlowNet plan representations on high levels of the inference hierarchy that are removed from direct sensory information through a sequence of state space mappings.
Unless the theory of rationality that is used by the exemplary actor and, by extension, the H-JEPA agent with GFlowNet actors permits Dutch Book exploitation in practice.
So, I’m sceptical of Eric Schmidt’s “millisecond war” idea.
In principle, even exhaustively in some formalised systems: for example, we can generate unit tests for a function that verifies the correct behaviour of the function for every combination of the values of the arguments in their respective domains, which would be a “full explanation” that the function behaves according to some desired abstract specification.