tom4everitt

Karma: 454

Research Scientist at DeepMind

tomeveritt.se

tom4everitt 9 Sep 2024 13:57 UTC
6 points
0
in reply to: Dalcy’s comment on: Darcy’s Shortform
it’s true it’s cool, but I suspect he’s been a bit disheartened by how complicated it’s been to get this to work in real-world settings.
in the book of why, he basically now says it’s impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books.
but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans

tom4everitt 2 Sep 2024 11:19 UTC
LW: 3 AF: 3
0
AF
in reply to: Tapatakt’s comment on: Reward Hacking from a Causal Perspective
Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from “Robust Agents Learn Causal World Models”

tom4everitt 23 Jan 2024 13:11 UTC
LW: 5 AF: 5
0
AF
on: A Shutdown Problem Proposal
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
- Does not want to manipulate the shutdown button
- Does respond to the shutdown button
- Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
From a quick read, your proposal seems closely related to Jessica Taylor’s causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861

tom4everitt 2 Nov 2023 16:47 UTC
LW: 5 AF: 4
0
AF
on: 3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem
I really like this articulation of the problem!
To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced). https://www.alignmentforum.org/s/pcdHisDEGLbxrbSHD/p/Qi77Tu3ehdacAbBBe
One thing I’ve been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.

tom4everitt 15 Aug 2023 9:58 UTC
1 point
0
in reply to: Chris Lakin’s comment on: Reward Hacking from a Causal Perspective
I see, thanks for the careful explanation.
I think the kind of manipulation you have in mind is bypassing the human’s rational deliberation, which is an important one. This is roughly what I have in mind when I say “covert influence”.
So in response to your first comment: given that the above can be properly defined, there should also be a distinction between using and not using covert influence?
Whether manipulation can be defined as penetration of a Markov blanket, it’s possible. I think my main question is how much it adds to the analysis, to characterise it in terms of a Markov blanket. Because it’s non-trivial to define the membrane variable, in a way that information that “covertly” passes through my eyes and ears bypasses the membrane, while other information is mediated by the membrane.
The SEP article does a pretty good job at spelling out the many different forms manipulation can take https://plato.stanford.edu/entries/ethics-manipulation/

tom4everitt 14 Aug 2023 17:06 UTC
2 points
0
in reply to: Chris Lakin’s comment on: Reward Hacking from a Causal Perspective
The point here isn’t that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.
Re Markov blankets, won’t any kind of information penetrate a human’s Markov blanket, as any information received will alter the human’s brain state?

tom4everitt 7 Jul 2023 17:28 UTC
2 points
0
in reply to: Chris Lakin’s comment on: Agency from a causal perspective
Thanks, that’s a nice compilation, I added the link to the post. Let me check with some of the others in the group, who might be interested in chatting further about this

tom4everitt 7 Jul 2023 17:24 UTC
1 point
0
in reply to: Chris Lakin’s comment on: Agency from a causal perspective
fixed now, thanks! (somehow it added https:// automatically)

tom4everitt 7 Jul 2023 17:21 UTC
LW: 4 AF: 2
0
AF
in reply to: Gordon Seidoh Worley’s comment on: Causality: A Brief Introduction
Sure, I think we’re saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don’t actually talk about the same random variable).
How big a problem is it? In practice it seems usually fine, if we’re careful to test our sensor / double check we’re using language in the same way. In theory, scaled up to super intelligence, it’s not impossible it would be a problem.
But I would also like to emphasize that the problem you’re pointing to isn’t restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.
I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn’t solve the reference problem in general. Partly for this reason, we’re mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

tom4everitt 4 Jul 2023 10:45 UTC
LW: 1 AF: 1
0
AF
in reply to: Gordon Seidoh Worley’s comment on: Causality: A Brief Introduction
The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined “set” operation associated with each variable, so that the effect of interventions is well-defined.
Once we have the variables, and a “set” and “get” operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution.
So in short, I don’t think we need to talk about an agent observer beyond what we already say about the variables.

tom4everitt 26 Jun 2023 14:24 UTC
1 point
0
in reply to: Alex_Altair’s comment on: Causality: A Brief Introduction
nice, yes, I think logical induction might be a way to formalise this, though others would know much more about it

tom4everitt 22 Jun 2023 16:34 UTC
4 points
0
in reply to: Alex_Altair’s comment on: Causality: A Brief Introduction
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)

tom4everitt 22 Jun 2023 16:17 UTC
3 points
0
in reply to: yrimon’s comment on: Causality: A Brief Introduction
Thanks for the suggestion. We made an effort to be brief, but perhaps we went too far. In our paper Reasoning about causality in games, we have a longer discussion about probabilistic, causal, and structural models (in Section 2), and Pearl’s book A Primer also offers a more comprehensive introduction.
I agree with you that causality offers a way to make out-of-distribution predictions (in post number 6, we plan to go much deeper into this). In fact, a causal Bayesian network is equivalent to an exponentially large set of probability distributions, where there is one joint distribution $P_{\do(X=x)}$ for any possible combinations of interventions $X=x$.
We’ll probably at least add some pointers to further reading, per your suggestion. (ETA: also added a short paragraph near the end of the Intervention section.)

tom4everitt 16 Jun 2023 15:47 UTC
3 points
0
in reply to: Quinn’s comment on: Introduction to Towards Causal Foundations of Safe AGI
Preferences and goals are obviously very important. But I’m not sure they are inherently causal, which is why they don’t have their own bullet point on that list. We’ll go into more detail in subsequent posts

tom4everitt 15 Jun 2023 17:05 UTC
1 point
0
in reply to: Quinn’s comment on: Introduction to Towards Causal Foundations of Safe AGI
I’m not sure I entirely understand the question, could you elaborate? Utility functions will play a significant role in follow-up posts, so in that sense we’re heavily building on VNM.

tom4everitt 31 Jan 2023 9:39 UTC
LW: 3 AF: 2
0
AF
in reply to: RyanCarey’s comment on: Discovering Agents
The idea … works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. … But to apply this to a physical system, we would need a way to obtain such a partition those variables
Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.
Does a spinal reflex count as a policy?
A spinal reflex would be different if humans had evolved in a different world. So it reflects an agentic decision by evolution. In this sense, it is similar to the thermostat, which inherits its agency from the humans that designed it.
Does an ant’s decision to fight come from a representation of a desire to save its queen?
Same as above.
How accurate does its belief about the forthcoming battle have to be before this representation counts?
One thing that I’m excited about to think further about is what we might call “proper agents”, that are agentic in themselves, rather than just inheriting their agency from the evolution / design / training process that made them. I think this is what you’re pointing at with the ant’s knowledge. Likely it wouldn’t quite be a proper agent (but a human would, as we are able to adapt without re-evolving in a new environment). I have some half-developed thoughts on this.

tom4everitt 14 Nov 2022 11:58 UTC
LW: 3 AF: 3
0
AF
in reply to: johnswentworth’s comment on: Clarifying AI X-risk
This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

tom4everitt 4 Nov 2022 12:12 UTC
LW: 4 AF: 3
0
AF
in reply to: johnswentworth’s comment on: Clarifying AI X-risk
For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won’t be systematically deceiving humans to pursue some particular agenda of the agent.
As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

tom4everitt 29 Sep 2022 13:16 UTC
2 points
1
on: Agency engineering: is AI-alignment “to human intent” enough?
I think the point that even an aligned agent can undermine human agency is interesting and important. It relates to some of our work on defining agency and preventing manipulation. (Which I know you’re aware of, so I’m just highlighting the connection for others.)

tom4everitt 2 Sep 2022 10:46 UTC
1 point
1
in reply to: Roman Leventov’s comment on: Discovering Agents
Sorry, I worded that slightly too strongly. It is important that causal experiments can in principle be used to detect agents. But to me, the primary value of this isn’t that you can run a magical algorithm that lists all the agents in your environment. That’s not possible, at least not yet. Instead, the primary value (as i see it) is that the experiment could be run in principle, thereby grounding our thinking. This often helps, even if we’re not actually able to run the experiment in practice.

I interpreted your comment as “CIDs are not useful, because causal inference is hard”. I agree that causal inference is hard, and unlikely to be automated anytime soon. But to me, automatic inference of graphs was never the intended purpose of CIDs.

Instead, the main value of CIDs is that they help make informal, philosophical arguments crisp, by making assumptions and inferences explicit in a simple-to-understand formal language.

So it’s from this perspective that I’m not overly worried about the practicality of the experiments.