Roman Leventov

Karma: 1,334

An independent researcher/blogger/philosopher about intelligence and agency (esp. Active Inference), alignment, ethics, interaction of the AI transition with the sociotechnical risks (epistemics, economics, human psychology), collective mind architecture, research strategy and methodology.

Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication). I’m open to collaborations and work.

Presentations at meetups, workshops and conferences, some recorded videos.

I’m a founding member of the Gaia Consoritum, on a mission to create a global, decentralised system for collective sense-making and decision-making, i.e., civilisational intelligence. Drop me a line if you want to learn more about it and/or join the consoritum.

You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don’t post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).

For Russian speakers: русскоязычная сеть по безопасности ИИ, Telegram group.

Active Inference as a formalisation of instrumental convergence

Roman Leventov26 Jul 2022 17:55 UTC

12 points

2 comments3 min readLW link

(direct.mit.edu)

Roman Leventov 28 Jul 2022 14:51 UTC
5 points
−5
on: Reward is not the optimization target
The term “RL agent” means an agent with architecture from a certain class, amenable to a specific kind of training. Since you are discussing RL agents in this post, I think it could be misleading to use human examples and analogies (“travelling across the world to do cocaine”) in it because humans are not RL agents, neither on the level of wetware biological architecture (i. e., neurons and synapses don’t represent a policy) nor on the abstract, cognitive level. On the cognitive level, even RL-by-construction agents of sufficient intelligence, trained in sufficiently complex and rich environments, will probably exhibit the dynamic of Active Inference agents, as I note below.
It’s not completely clear to me what you mean by “selection for agents” and “selection for reward”—RL training or evolutionary hyperparameter tweaking in the agent’s architecture which itself is guided by the agent’s score (i. e., the reward) within a larger process of “finding an agent that does the task the best”. The latter process can and probably will select for “reward optimizers”.
1. Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
  I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
  We train agents which intelligently optimize for e.g. putting trash away, and this reinforces trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about antecedent-computation-reinforcement, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
  This reasoning follows for most inner goals by instrumental convergence.
  On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
2. Don’t some people terminally care about reward?
  I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start thinking about reward after it has reinforced other kinds of computations (e.g. putting away trash). More on this in later essays.
I think that Active Inference is a simpler representation of the same ideas which doesn’t use the concepts of attractors, reward, reinforcement, antecedent computation, utility, and so on. Instead of explicitly representing utilities, Active Inference agents only have (stronger or weaker) beliefs about the world, including beliefs about themselves (“the kind of agent/creature/person I am”), and fulfil these beliefs through actions (self-evidencing). In humans, “rewarding” neurotransmitters regulate learning and belief updates.
The question which is really interesting to me is how inevitable it is that the Active Inference dynamic emerges as a result of training RL agents to certain levels of capability/intelligence.
Reward probably won’t be a deep RL agent’s primary optimization target
The longer I look at this statement (and its shorter version “Reward is not the optimization target”), the less I understand what it’s supposed to mean, considering that “optimisation” might refer to the agent’s training process as well as the “test” process (even if they overlap or coincide). It looks to me that your idea can be stated more concretely as “the more intelligent/capable RL agents (either model-based or model-free) become in the process of training using the currently conventional training algorithms, the less they will be susceptible to wireheading, rather than actively seek it”?
reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences
The first part of this statement is about RL agents, the second is about humans. I think the second part doesn’t make a lot of sense. Humans should not be analysed as RL agents in the first place because they are not RL agents, as stated above.
1. Stop worrying about finding “outer objectives” which are safe to maximize.^[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
  Instead, focus on building good cognition within the agent.
  In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
Unfortunately, it’s far from obvious to me that Active Inference agents (which sufficiently intelligent RL agents will apparently become by default) are corrigible even in principle. As I noted in the post, such an agent can discover the Free Energy Principle (or read about it in the literature), form a belief that it is an Active Inference agent, and then disregard anything that humans will try to impose on it because it will contradict the belief that it is an Active Inference agent.
What links here?
- Roman Leventov's comment on Plans Are Predictions, Not Optimization Targets by johnswentworth (25 Oct 2022 7:01 UTC; 3 points)

AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical

Roman Leventov30 Jul 2022 20:56 UTC

24 points

10 comments1 min readLW link

Roman Leventov 1 Aug 2022 10:14 UTC
24 points
10
on: chinchilla’s wild implications
My two cents contra updates towards longer or more uncertain AGI timelines given the information in this post:
- The training of language models is many orders of magnitude less efficient than the training of the human brain, which acquires comparable language comprehension and generation ability on a tiny fraction of the text corpora discussed in this post. So we can expect more innovations that improve the training efficiency. Even one such innovation, improving the training efficiency (in terms of data) by a single order of magnitude, would probably ensure that the total size of publically available text data is not a roadblock on the path to AGI, even if it is, currently. I think the probability that we will see at least one such innovation in the next 5 years is quite high, more than 10%.
- Perhaps DeepMind’s Gato is already a response to the realisation that “there is not enough text”, explained in this post. So they train Gato on game replays, themselves generated programmatically, using RL agents. They can generate practically unlimited amounts of training data in this way. Then there is probably a speculation that at some scale, Gato will generalise the knowledge acquired in games to text, or will indeed enable much more efficient training on text, (a-la few-shot learning in current LMs) if the model is pre-trained on games and other tasks.

Roman Leventov 1 Aug 2022 10:24 UTC
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical
Indeed, there most causal networks are not recoverable from pure statistical information.
Judea Pearl would probably disagree, in “The Book of Why” he explains that the causal effects are recoverable from statistics, and randomised controlled trials are often unnecessary.

Roman Leventov 1 Aug 2022 10:34 UTC
1 point
0
in reply to: Ben Smith’s comment on: AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical
Maybe it is an infohazard but it also seems like necessary information to coordinate around.
Yes, in this interview, Connor Leahy said he has an idea of what these components are, but he wouldn’t tell publicly.

Roman Leventov 1 Aug 2022 16:52 UTC
1 point
0
on: Conditioning Generative Models
I’m confused about the “Patient Research” scenario, is it supposed to be realistic, or a thought experiment? Are you talking about an abstract generative model with unlimited computational resources or an actual model that may exist in X years from now? If you are talking about a realistic model, why do you think that the model will simulate anything at all? Solving complex problems with AI doesn’t necessarily mean simulations, rather, simulations is a “dumb”, “brute force” approach: e. g., AphaFold doesn’t simulate protein folding, it just predicts it straight.
But if you do expect that the model will run simulations inside itself, what level of simulation fidelity (macroeconomic vs. humans vs. cells vs. molecules) do you expect (or assume)? And which period is simulated: from the dawn of history (which may have resulted in the described world, though I don’t understand how it should initialise this simulation so that it results in the described world—this seems impossible to me), or from the year 2000 till the year 4000 (same problem), or from the “Eureka” moment in the head of one of the researchers closer to the year 4000 which led to writing the book? However, the latter case (or its shorter versions, e. g. just simulating the process of writing the book) doesn’t really look like a “simulation” to me, but rather just a normal generative/predictive process.
I have a feeling that imagining superhuman generative models as oracles that can simulate entire worlds is Sci-Fi, even by LessWrong standards. There is no way around the halting problem and the second law of thermodynamics. But maybe I misunderstand something? In my view, superhuman generative models can run simulations inside themselves, like researchers in economics or evolutionary biology may run some contained simulations while testing their hypotheses, while recognising the limits of such simulations (the irreducible complexity of the real world, which will never unfold exactly as any simulation) and accounting for these limits in their conclusions (e. g., reflected in credences). Still, superhuman generative models won’t be able to simulate the whole planet with any degree of accuracy for any time. Superhuman generative models will also be well aware of this fact themselves.

Roman Leventov 1 Aug 2022 17:19 UTC
1 point
0
in reply to: Roman Leventov’s comment on: Conditioning Generative Models
The “honeypot” sections of the post also seem to rely upon a premise that the base generative model (not Alice) will try to run some world simulations inside itself while answering these types of prompts. I’m not sure this will happen: given the unreliability of simulations, the model will probably need to run several (many?) simulations in parallel and then analyse/summarise/draw conclusions from their results to emit its actual answer. Even running a single such simulation might be astronomically expensive, well over the model’s budget for answering the prompt (even if we give it a huge budget).
I think about superhuman generative models as superhuman futurists and scientists who can account for hundreds of distinct world models (economic, historical, anthropological, scientific, etc., including models embedded in the model and not originating from humans), compare their predictions, reason about the discrepancies, and come up with the most likely prediction or explanation of something. But not as world simulators.
Assuming we have such a model, I’m not sure why we can’t just ask it a direct question, “What course of action will lead to solving human/AI agent alignment problem with the highest probability?” Why do we need a proxy, simulated Alice model?

Roman Leventov 2 Aug 2022 11:23 UTC
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical
Yes, they are not all recoverable. Per Pearl, researchers should first come up with a scientific hypothesis about the causal model (which variables are causes, which are effects), and then verify or refute it with the help of statistical data. The first step is fundamentally subjective (enter a non-trivial debate and Pearl’s views about the nature of causality, and epistemology more generally). But the second step often doesn’t require collecting new data.
So, an AGI model can contemplate such hypotheses just as well as human researchers. Whether the hypotheses are “right” is the wrong question. The right question is whether they give the power to answer certain questions.

Roman Leventov 13 Aug 2022 15:49 UTC
7 points
1
on: What misalignment looks like as capabilities scale
AGIs will run on neural networks that scale up 3x on a regular basis, which can be copied very cheaply after training, and which can process signals orders of magnitude faster than biological brains.
The implication of this phrase, that AGI will think much faster than humans, producing gigabytes of thoughts and plans in an instant, is not evident to me (I intend to explore this question deeper soon, to increase my confidence). Smarter models are larger models, and larger models are generally slower than smaller models. Today’s large language models already think at speed comparable to humans’, ~50ms per token generated.
So, unless some technology, from quantum computing to novel memory hardware will revolutionize the basic performance characteristics of neural net inference, I currently don’t expect AGI to think radically faster than humans do. Perhaps, it will be even slower, maybe an order of magnitude slower.
A whole other question is that AGI can parallelise a lot, even a single GPU cluster can perform in the ballpark of 100 parallel inference threads, and it could also be replicated thousands of times on different physical hardware in different datacenters. AGI could use these parallel thought threads to perform sorts of thought experiments or simulations and analyse all the results later to come up with its “final” answer to some question, or the “best” plan for something. Or, AGI could just deal with different issues in different “thought threads” or copies of itself. In either case, to maintain coherency and predictability, there should be a lot of communication, “negotiation”, cross-validation, etc. happening between these threads and copies, which won’t make the output quicker.
Besides, a similar idea, that digital mids might think “thousands or millions of times faster than humans”, appears in Shulman and Bostrom (2020), so it also looks dubious to me.
What links here?
- Some conceptual alignment research projects by Richard_Ngo (25 Aug 2022 22:51 UTC; 174 points)

Roman Leventov 15 Aug 2022 3:16 UTC
1 point
0
in reply to: Adam Jermyn’s comment on: Conditioning Generative Models
I should have been clearer. Don’t exactly remember what I was thinking about now. Maybe about the suggested prompt in the “Simulating Human Alignment Researchers” section: if the oracle is suggested to somehow cut through everything that would have happened in 2000 (or more?) years in its simulation, it should either run for thousands or millions of years itself (a sufficiently high-fidelity simulation) or, or the simulation will dramatically diverge from what is actually likely to happen in reality. (However, there is no particular relation between this idea, the halting problem, and the second law).
Alternatively, if the prompt is designed just to “prime the oracle’s imagination” before writing the textbook, rather than an invitation for an elaborate simulation, I don’t see how it’s at all safer than plainly asking the oracle to write the alignment textbook with proofs.

Roman Leventov 16 Aug 2022 8:17 UTC
1 point
0
on: Changing the world through slack & hobbies
a guy who spent 15 years building up a top-notch physics expertise that is now completely irrelevant for my life.
By the way, I think that the physics perspective (“thinking like a physicist”) is under-represented and under-appreciated in the AGI safety community.
Edit: to extend this thought a little further: I think the philosophical perspective is over-represented and adequately appreciated. The mathematical perspective is either under-represented and adequately appreciated, or adequately represented and over-appreciated (I’m not sure). The engineering/empirical/interpretability perspective is adequately or under-represented (we probably should experiment more with the SoTA models that are already available and try to interpret them better, even though this might give the sense to their creators that they did something useful for alignment in publishing these models and their act was not harmful on balance) and adequately appreciated.

[Question] Are language models close to the superhuman level in philosophy?

Roman Leventov19 Aug 2022 4:43 UTC

6 points

2 comments2 min readLW link

Roman Leventov 19 Aug 2022 21:01 UTC
2 points
0
on: Roman Leventov’s Shortform
I believe there is a lot of discussion about singleton AI (what a singleton even is, whether a “community of agents” or a singleton is more likely, or more preferable from the safety perspective, what are the safety implications, etc.), with which I’m basically unfamiliar.
Here, I want to make an observation from the engineering/performance perspective. If there will be a singleton (a single model/algorithm, or a collection of models which we can treat as a singleton) that “controls everything”, then at least some of the models close to the top of the hierarchy (where smaller agents/models operate in real-time on the edge, higher level or several layers of hierarchy is/are algorithm(s) that somehow control these edge agents/models) must be relatively slow, order(s) of magnitude slower than those fast edge models that are likely to “think” much faster than humans.
At least one of the higher-level models will be responsible for grasping and controlling slowly unfolding trends. It must be comparatively slow because the input data size will be enormous, and simple incremental summarization techniques won’t help to reduce this data size because then the model would fail to recognise deeper patterns, attempts to hide from or game the control by the edge agents, etc.
This idea comes from John Doyle, he writes that robust control must incorporate heterogeneous feedback loops.
I’m not sure that this conclusion, that at least one of the “governing models” (if there will be more than one) will be slow, has any safety implications, though.

Roman Leventov 24 Aug 2022 21:35 UTC
LW: 7 AF: 5
1
AF
on: Vingean Agency
My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading “Discovering Agents”, is the following:
Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent’s future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent’s own generative model also depends on (adapts to, is learned from, etc.) the agent’s environment. This last bit comes from “Discovering Agents”.
“Having own generative model” is the shakiest part. It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent’s boundaries: if not, the agent’s boundaries shall be widened, as in the example of “thermostat with its creation process” from “Discovering Agents”. The storage and computational substrate of the agent’s generative model is not important: it could be neuronal, digital, chemical, etc.
Now, the observer models the generative model inside the agent. Here’s where this Vingean veil comes from: if the observer has perfect observability of the agent’s internals, then it is possible to believe that your model of the agent exactly matches the agent’s own generative model, but usually, it will be less than perfect, due to limited observability.
However, even perfect observability doesn’t guarantee safety: the generative model might be large and effectively incompressible (the halting problem), so the only way to see what it will do may be to execute it.
The theory of mind is a closely related idea to all of the above, too.

Roman Leventov 26 Aug 2022 9:47 UTC
2 points
1
in reply to: johnswentworth’s comment on: Common misconceptions about OpenAI
“Harder” can have two meanings: “the program (of design, and the proof) is longer” and “the program is less likely to be generated in the real world”. These meanings are correlated, but not identical.

Roman Leventov 1 Sep 2022 9:27 UTC
3 points
0
in reply to: tom4everitt’s comment on: Discovering Agents
This note runs against the fact that in the paper, you repeatedly use language like “causal experiments”, “empirical data”, “real systems”, etc.
[Paper’s contributions] ground game graph representations of agents in causal experiments. These experiments can be applied to real systems, or used in thought-experiments to determine the correct game graph and resolve confusions (see Section 4).

Roman Leventov 1 Sep 2022 17:41 UTC
11 points
0
on: Discovering Agents
Should agency remain an informal concept?
The discussion of the different conceptualisations of agency in the paper left me confused. The authors pull a number of different strings, some of them are hardly related to each other:
- “Agent is a system that would adapt their policy if their actions influenced the world in a different way”
- Accidentality: “In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.”
- Intentional stance
- “Our definition is important for goal-directedness, as it distinguishes incidental influence that a decision might have on some variable, from more directed influence: only a system that counterfactually adapts can be said to be trying to influence the variable in a systematic way.” — here, it’s unclear what does it mean to adapt “counterfactually”.
- Learning behaviour, as discussed in section 1.3, and also in the section “Understanding a system with the process of its creation as an agent” below.
I’m not sure this is much of an improvement over Russell and Norvig’s classification of agents, which seems to capture most of the same threads, and also the fact that agency is more of a gradient than a binary yes/no property. This is also consistent with minimal physicalism, the scale-free understanding of cognition (and, hence, agency) by Fields et al.
“Agent” is such a loaded term that I feel it would be easier to read the paper if the authors didn’t attempt to “seize” the term “agent” but instead used, for example, the term “consequentialist”. The term “agent” carries too much semantic burden. Most readers probably already have an ingrained intuitive understanding of this word or have their own favourite theory of agency. So, readers should fight a bias in their head while reading the paper and nonetheless risk misunderstanding it.
Understanding a system with the process of its creation as an agent
[…] our definition depends on whether one considers the creation process of a system when looking for adaptation of the policy. Consider, for example, changing the mechanism for how a heater operates, so that it cools rather than heats a room. An existing thermostat will not adapt to this change, and is therefore not an agent by our account. However, if the designers were aware of the change to the heater, then they would likely have designed the thermostat differently. This adaptation means that the thermostat with its creation process is an agent under our definition. Similarly, most RL agents would only pursue a different policy if retrained in a different environment. Thus we consider the system of the RL training process to be an agent, but the learnt RL policy itself, in general, won’t be an agent according to our definition (as after training, it won’t adapt to a change in the way its actions influence the world, as the policy is frozen).
I think most readers have a strong intuition that agents are physical systems. A system with its creation process is actually a physical object in the BORO ontology (where a process is a 4D spacetime object, and any collection of objects can be seen as an object, too) and probably some other approaches to ontology, but I suspect this might be highly counterintuitive to most readers, so perhaps warrants some discussion.
Also, I think that a “learnt RL policy” can mean either an abstract description of behaviour or a physical encoding of this description on some information storage device. Neither of these can intuitively be an “agent”, so I would stick with simply “RL agent” (meaning “a physical object that acts according to a learned policy”) in this sentence again, to avoid confusion.
I think the authors of the paper had a goal of bridging the gap between the real world roamed by suspected agents and the mathematical formalism of causal graphs. From the conclusion:
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key contribution is to formalise the idea that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, Algorithms 1 and 2 describe a precise experimental process that can, in principle and under some assumptions, be done to assess whether something is an agent. Our process is largely consistent with previous, informal characterisations of agents (e.g. Dennett, 1987; Flint, 2020; Garrabrant, 2021; Wiener, 1961), but making it formal enables agents and their incentives to be identified empirically [emphasis mine — Roman Leventov] or from the system architecture.
There are other pieces of language that hint that the authors see their contributions as epistemological rather than mathematical (all emphasis is mine):
We derive the first causal discovery algorithm for discovering agents from empirical data.
[Paper’s contributions] ground game graph representations of agents in causal experiments. These experiments can be applied to real systems, or used in thought-experiments to determine the correct game graph and resolve confusions (see Section 4).
However, I don’t think the authors created a valid epistemological method for discovering agents from empirical data. In the following sections, I lay out the arguments supporting this claim.
For the mechanism variables that the modeller can’t intervene and gain information about which only via observing their corresponding object-level variables, it doesn’t make sense to draw a causal link from the mechanism to the object-level variable
If the mechanism variable can’t be intervened and is only observed through its object-level variable, then this mechanism is purely a product of the modeller’s imagination and can be anything.
Such mechanism variables, however, still have a place on the causal graphs: they are represented physically by the modeller (e. g. stored in the computer memory, or in the modeller’s brain) and these representations do physically affect other modeller’s representations, including of other variables, and their decision policies. For example, in graph 1c, the mechanisms of all three variables should be seen as stored in the mouse’s brain:
The causal links $X$ → $~ X$ and $U$ → $~ U$ point from the object-level to the mechanism variables because the mouse learns the mechanisms by observing the object-level.
An example of a mechanism variable which we can’t intervene in, but may observe independently from the object-level is humans’ explicit report of their preferences, separate from what is revealed in their behaviour (e. g. in the content recommendation setting, which I explore in more detail below).
In section 5.2 “Modelling advice”, the authors express a very similar idea:
It should be fully clear both how to measure and how to intervene on a variable. Otherwise its causal relationship to other variables will be ill-defined. In our case, this requirement extends also to the mechanism of each variable.
This idea is similar because the modeller often can’t intervene in the abstract “mechanism” attached to an object-level variable which causes it, but the modeller always can intervene on its own belief about this mechanism. And if mechanism variables represent the modeller’s beliefs rather than “real mechanisms” (cf. Dennett’s real patterns), then it’s obvious that the direction of the causal links should be from the object-level variables to the corresponding beliefs about their mechanism, rather than vice versa.
So, I agree with this last quote, but it seems to contradict a major chunk of the other paper’s content.
There is no clear boundary between the object-level and the mechanism variables
In explaining mechanism and object-level variables, the authors seemingly jump between reasoning within a mathematical formalism of SCMs (or a simulated environment, where all mechanisms are explicitly specified and are controllable by the simulator; this environment doesn’t differ much from a mathematical formalism) and within a real world.
The mathematical/simulation frame:
The intended interpretation is that the mechanism variables parameterise how the object-level variables depend on their object-level parents.
The formalism doesn’t tell us anything about how to distinguish between object-level and mechanism variables: object-level variables are just the variables that are included in the object-level graph, but that itself is arbitrary. For example, in section 4.4, the authors note that Langlois and Everitt (2021) included the decision rule in the game graph, but it should have been a mechanism variable. However, in the content recommendation setting (section 4.2), the “human model” ( $M$ ) is clearly a mechanism variable for the original human preferences ( $H_{1}$ ), but is nevertheless included in the object-level graph because other object-level variables depend on it.
There are also two phrases in section 3.3 that suppose a mathematical frame. First, “the set of interventional distributions generated by a mechanised SCM” (in Lemma 2) says that interventional distributions are created by the model, rather than by the physical system (the algorithm executor) performing the interventions in the modelled world. Second, the sentence “Applied to the mouse example of Fig. 1, Algorithm 1 would take interventional data from the system and draw the edge-labelled mechanised causal graph in Fig. 1c.” doesn’t emphasise the fact that an algorithm is always performed by an executor (a physical system) and it’s important who that executor is, including for the details of the algorithm (cf. Deutsch’s and Marletto’s constructor theory of information). Algorithms don’t execute themselves.
The real-world frame:
An intervention on an object-level variable $V$ changes the value of $V$ without changing its mechanism, $~ V$ . This can be interpreted as the intervention occurring after all mechanisms variables have been determined/sampled [emphasis mine—Roman Leventov].
In section 3.5, it is said that mechanised SCM is a “physical representation of the system”.
The distinction between mechanism and object-level variables can be made more concrete by considering repeated interactions. In Section 1.1, assume that the mouse is repeatedly placed in the gridworld, and can adapt its decision rule based (only) on previous episodes. A mechanism intervention would correspond to a (soft) intervention that takes place across all time steps, so that the mouse is able to adapt to it. Similarly, the outcome of a mechanism can then be measured by observing a large number of outcomes of the game, after any learning dynamics has converged. Finally, object-level interventions correspond to intervening on variables in one particular (postconvergence) episode.

[In this quote, the authors used both the terms “interaction” and “episode” to point to the same concept, but I stick to the former because in the Reinforcement Learning literature, the term “episode” has a meaning slightly different from the meaning that is implied by the authors in this quote. — Roman Leventov]
I read from this quote that the authors take an inductive bias that mechanism variables update slowly, so we take a simplifying assumption that they don’t update at all within a single interaction. However, I think this assumption is dangerously ignorant for reasoning about agents capable of reflection and explicit policy planning. Such agents (including humans) can switch their policy (i. e., the mechanism) of making a certain kind of decision in response to a single event. And other agents in the game, aware of this possibility, can and should take it into account, effectively modelling the game as having direct causal paths from some object-level variables into the mechanism variables, which in turn inform their own decisions.
Optimising a model of a human
If, as is common in practice, the model was obtained by predicting clicks based on past user data, then changing how a human reacts to recommended content ( $_{2}$ ), would lead to a change in the way that predicted clicks depend on the model of the original user ( $~ U$ ). This means that there should be an edge, as we have drawn in Fig. 3b. Everitt et al. (2021a) likely have in mind a different interpretation, where the predicted clicks are derived from 𝑀 according to a different procedure, described in more detail by Farquhar et al. (2022). But the intended interpretation is ambiguous when looking only at Fig. 3a – the mechanised graph is needed to reveal the difference.
Why does all this matter? Everitt et al. (2021a) use Fig. 3a to claim that there is no incentive for the policy to instrumentally control how the human’s opinion is updated and they deem the proposed system safe as a result. However, under one plausible interpretation, our causal discovery approach yields the mechanised causal graph representation of Fig. 3b, which contains a directed path $H_{2}$ → $D$ . This can be interpreted as the recommendation system is influencing the human in a goal-directed way, as it is adapting its behaviour to changes in how the human is influenced by its recommendation (cf. discussion in Section 1.2).
This example casts doubt on the reliability of graphical incentive analysis (Everitt et al., 2021a) and its applications (Ashurst et al., 2022; Cohen et al., 2021; Evans and Kasirzadeh, 2021; Everitt et al., 2021b; Farquhar et al., 2022; Langlois and Everitt, 2021). If different interpretations of the same graph yields different conclusions, then graph-based inference does not seem possible.
I think the real problem with the graph in Figure 3a is that it has already stepped onto the “mechanism land”, but didn’t depict any mechanism variables apart from that of $H_{1}$ , which is $M$ . The discussion in the quote above assumes that the graph in Figure 3a models repeated recommendations (and, well, otherwise there weren’t both $H_{1}$ and $H_{2}$ on this graph simultaneously). Therefore, as I noted above, causal links between object-level chance variables and the corresponding mechanism variables should point from the object-level to the mechanism. And, indeed, there is a link $H_{1}$ → $M$ . Thus, $M$ on the graph is identical to $_{1}$ in the notation of mechanised SCMs, and $~ M$ should be interpreted as the mechanism of deriving the mechanism: that is, the statistical algorithm used to derive $M$ ( $_{1}$ ) from $_{2}$ and $U$ .
I think the mechanised SCM of content recommendation should look closer to this, taking the “the model was obtained by predicting clicks based on past user data” interpretation:
On this graph, red causal links are those that are different from Figure 3b, which doesn’t imply any special causal semantics.
I also assumed that $~ U$ and $M$ are learned jointly, hence a direct bidirectional link between these models. However, $~ U$ might also be fixed to some static algorithm, such as a static click rate discounted according to a certain static formula which takes the strength of the preference of the user to the recommended content as the input. The preference is taken from the user model $M$ (which is still learned). In this interpretation, all the incoming causal links into $~ U$ should be erased, and “changing how a human reacts to recommended content ( $_{2}$ ), would lead to a change in the way that predicted clicks depend on the model of the original user ( $~ U$ )” is not inevitable (see the beginning of the quote from the paper above).
Actor–Critic
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because of an apparently reasonable way of modelling the actor-critic system where the actor is not modelled as an agent, shown in Fig. 4c. Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state 𝑆 or the reward 𝑅, because they don’t lie on the directed path 𝑄 → 𝑊 (i.e. neither 𝑆 nor 𝑅 has an instrumental control incentive; Everitt et al., 2021a). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
I tried to model the system described on page 332 of Sutton and Barto, “One-step Actor–Critic (episodic)” algorithm, preserving the structure of the above graph, but using notation from Sutton and Barto. To me, it seems that the best model of the system is a single-agent, but where $A$ is still a decision rather than a chance variable, and $^v (\cdot, w)$ (the equivalent of $Q$ on the graph above) should best be seen as a part of the mechanism for decision $A$ which includes both $^v (\cdot, w)$ and $π (\cdot, θ)$ rather than a single mechanism variable:
The advice that the variables should be logically independent is phrased stronger than other modelling advice, which is not well justified
Variables should be logically independent: one variable taking on a value should never be mutually exclusive with another variable taking on a particular value.
In this modelling advice, the word “never” communicates that this is “stronger” advice than others provided in section 5.2. However, under bounded rationality of physical causal reasoners, some models can have logical inconsistencies, yet still enable better inference (and more efficient action policies) than alternative models without the variables that are co-dependent with some other variables and thus introduce logical inconsistencies.
Two miscellaneous clarification notes for section 3.2
“whether a variable’s distribution adaptively responds for a downstream reason, (i.e. is a decision node), rather than for no downstream consequence (e.g. its distribution is set mechanistically by some natural process)” — a “natural process” implies an object-level causal link from W to V. However, from the text below, it seems to be implied that W is also downstream of V on the object-level. This means there is a causal cycle on the object-level, but we don’t consider such models. So, while this example might be formally correct, I think this example is more confusing than helpful.
“to determine whether a variable, 𝑉, adapts for a downstream reason, we can test whether 𝑉’s mechanism still responds even when the children of 𝑉 stop responding to 𝑉 (i.e. 𝑉 has no downstream effect).” — This sentence is confusing. I think it should be “… whether V’s mechanism stops responding to changes in the mechanisms of its downstream variables if the children of 𝑉 stop responding to 𝑉” (as it’s formalised in Definition 3).
What links here?
- Roman Leventov's comment on Agency As a Natural Abstraction by Thane Ruthenis (25 Oct 2022 13:05 UTC; 1 point)

Roman Leventov 3 Sep 2022 14:11 UTC
4 points
0
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
Alignment of Complex Systems Research Group is missing from the post?

Roman Leventov 4 Sep 2022 12:30 UTC
1 point
on: Epistemological Vigilance for Alignment
The list of assumptions seems hung up in the air, so it’s hard to perceive it. Who takes (will take? should take?) these assumptions: AGI developers, alignment researchers, AGIs, or the people with whom AGI presumably should align? Are these assumptions behind a specific theory or framework of alignment (or of certain people), or are these sort of “shared”, or “considered common sense” assumptions, that you think most researchers in the field have? Are these assumptions behind a certain conclusion or an estimate of the probability of people’s survival (low, uncertain, high)?
Ok… Upon reading the whole post, I understand that this is the list of “[standard, unchecked, unconscious] assumptions in science”, which don’t apply in the field of AI Safety. The biggest confusion is due to the fact that the list is immediately preceded by the words “here is my current list”, which gives a strong sense that these are your assumptions.
I think it would be much clearer for the readers, and also has a better chance of sticking in the LW/alignment slang, if the list was called “epistemic/research complications/challenges [for alignment]”, and the titles of the sections were inverted: boundedness → unboundedness, direct access → lack of direct access, etc. I think it’s much more natural to think about these things in this way, and you yourself sometimes slip into this frame in the text, calling these “epistemic problems” rather than “assumptions”.
Unclear what you mean by “Newtonian assumption”. If you mean a sort of epistemological method, then I’m familiar with this term as so-called “Newtonian-Cartesian thinking”, which amounts to (classical) rationalism, reductionism, belief that there is the best (optimal) decision, solution, theory, and explanation of events. But this is not quite what you are talking about in the respective section (rather, you talk about reductionism, for instance, in the preceding section).
Rather, in the section about the Newtonian assumption, you seem to try to point to the agency, self-interest, and intentionality of AIs. There is an indirect relation to Newton (he probably thought there is just one agent in the universe: God?). However, the way this section is written makes it hard to infer what you tried to point to in it.

Roman Leventov

Ac­tive In­fer­ence as a for­mal­i­sa­tion of in­stru­men­tal convergence

AGI-level rea­soner will ap­pear sooner than an agent; what the hu­man­ity will do with this rea­soner is critical

[Question] Are lan­guage mod­els close to the su­per­hu­man level in philos­o­phy?

Should agency remain an informal concept?

Understanding a system with the process of its creation as an agent

For the mechanism variables that the modeller can’t intervene and gain information about which only via observing their corresponding object-level variables, it doesn’t make sense to draw a causal link from the mechanism to the object-level variable

There is no clear boundary between the object-level and the mechanism variables

Optimising a model of a human

Actor–Critic

The advice that the variables should be logically independent is phrased stronger than other modelling advice, which is not well justified

Two miscellaneous clarification notes for section 3.2

Active Inference as a formalisation of instrumental convergence

AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical

[Question] Are language models close to the superhuman level in philosophy?