Magdalena Wache

Karma: 566

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

108 points

4 comments3 min readLW link

Magdalena Wache 12 Feb 2024 16:57 UTC
1 point
0
on: From Finite Factors to Bayes Nets
Are you sure the “arrow forgetting” operation is correct?

You say there should be an arrow when
for all probability distributions $P$ over the elements of $N_{i}$ , the approximation of $N_{j}$ over $P$ is no worse than the approximation of $N_{i}$ over $P$
wherein $[N_{i}] = [A \to B \leftarrow C]$ here, and $[N_{j}] = [A - B - C]$ . If we take a distribution in which $A ⊥/ C | B$ , then $N_{j}$ is actually a worse approximation, because $A ⊥ C | B$ must hold in any distribution which can be expressed by a graph in $[N_{j}]$ right?
Also, I really appreciate the diagrams and visualizations throughout the post!

Magdalena Wache 12 Feb 2024 14:38 UTC
1 point
0
on: From Finite Factors to Bayes Nets
I am confused about the last correspondence between Venn Diagram and BN equivalence class (this one:
). The graph X-Z-Y implies that X⊥Y|Z. Could you explain how the Venn diagram encodes this independence/orthogonality?

Magdalena Wache 21 Sep 2023 15:54 UTC
10 points
2
on: There should be more AI safety orgs
I would love to see something like the Charity Entrepreneurship incubation program for AI safety.

Magdalena Wache 21 Sep 2023 15:51 UTC
LW: 14 AF: 4
4
AF
on: There should be more AI safety orgs
Thanks for writing this! I agree.
I used to think that starting new AI safety orgs is not useful because scaling up existing orgs is better:
- they already have all the management and operations structure set up, so there is less overhead than starting a new org
- working together with more people allows for more collaboration
And yet, existing org do not just hire more people. After talking to a few people from AIS orgs, I think the main reason is that scaling is a lot harder than I would intuitively think.
- larger orgs are harder to manage, and scaling up does not necessarily mean that much less operational overhead.
- coordinating with many people is harder than with few people. Bigger orgs take longer to change direction.
- reputational correlation between the different projects/teams
We also see the effects of coordination costs/”scaling being hard” in industry, where there is a pressure towards people working longer hours. (It’s not common that companies encourage employees to work part-time and just hire more people.)

Interpretability Externalities Case Study—Hungry Hungry Hippos

Magdalena Wache20 Sep 2023 14:42 UTC

64 points

22 comments2 min readLW link

Technical AI Safety Research Landscape [Slides]

Magdalena Wache18 Sep 2023 13:56 UTC

49 points

2 comments4 min readLW link

Magdalena Wache 8 Aug 2023 14:33 UTC
28 points
0
on: formalizing the QACI alignment formal-goal
I made a deck of Anki cards for this post—I think it is probably quite helpful for anyone who wants to deeply understand QACI. (Someone even told me they found the Anki cards easier to understand than the post itself)

You can have a look at the cards here, and if you want to study them, you can download the deck here.

Here are a few example cards:

AI Safety Europe Retreat 2023 Retrospective

Magdalena Wache14 Apr 2023 9:05 UTC

43 points

0 comments2 min readLW link

Magdalena Wache 5 Jan 2023 4:35 UTC
2 points
0
in reply to: maximkazhenkov’s comment on: Finite Factored Sets in Pictures
What is the procedure by which we identify the particular set factorization that models our distribution?
There is not just one factored set which models the distribution, but infinitely many. The depicted model is just a prototypical example.
The procedure for finding causal relationships (e.g. $X \to Y$ in our case) is not trivial. In that part of his post, Scott sounds like it’s
1. have the suspicion that $X \to Y$
2. prove it using the histories
I think there is probably a formalized procedure to find causal relationships using factored sets, but I’m not sure if it’s written up somewhere.
For causal graphs, I don’t know if there is a formalized procedure to find causal relationships which take into account all possible ways to define variables on the sample space. My general impression is that causal graphs become pretty unwieldy once we introduce deterministic nodes. And then many rules we usually use with causal graphs don’t hold anymore, and working with graphs then becomes really unintuitive.
So my intuition is that yes, finding causal relationships while taking into account all possible ways to define variables is easier with factored sets than with causal graphs. (And I’m not even sure if it’s even possible in general with causal graphs.) But my intuition on this isn’t based on knowing a formal procedure for finding causal relationships, it just comes from my impression when playing around with the two different representations.

Magdalena Wache 21 Dec 2022 19:06 UTC
2 points
0
in reply to: Fejfo’s comment on: Finite Factored Sets in Pictures
Possibly! Extending factored sets to continuous variables is an active area of research.
Scott Garrabrant has found 3 different ways to extend the orthogonality and time definitions to infinite sets, and it is not clear which one captures most of what we want to talk about.
About his central result, Scott writes:
I suspect that the fundamental theorem can be extended to finite-dimensional factored sets (i.e., factored sets where |B| is finite), but it can not be extended to arbitrary-dimension factored sets
If his suspicion is right, that means we can use factored sets to model continuous variables, but not model a continuous set of variables (e.g. we could model the position of a point in space as a continuous random variable, but couldn’t model a curve consisting of uncountably many points)
The Countably Factored Spaces post also seems very relevant. (I have only skimmed it)

Magdalena Wache 20 Dec 2022 22:10 UTC
10 points
2
on: Will Machines Ever Rule the World? MLAISU W50
Suggestion for a different summary of my post:
Finite Factored Sets are re-framing of causality: They take us away from causal graphs and use a structure based on set partitions instead. Finite Factored Sets in Pictures summarizes and explains how that works. The language of finite factored sets seems useful to talk about and re-frame fundamental alignment concepts like embedded agents and decision theory.
I’m not completely happy with
Finite factored sets are a new way of representing causality that seems to be more capable than Pearlian causality, the state-of-the-art in causality analysis. This might be useful to create future AI systems where the causal dynamics within the model are more interpretable.
because
- I wouldn’t say finite factored sets are about interpretability. I think the primary thing why they are cool is that they give us a different language to talk about causality, and thereby also about fundamental alignment concepts like embedded agents and decision theory.
- Also it sounds a bit like my post introduces finite factored sets (even though you don’t say that explicitly), but it’s just a distillation of existing work.

Magdalena Wache 17 Dec 2022 10:33 UTC
2 points
0
in reply to: Frederik Hytting Jørgensen’s comment on: Finite Factored Sets in Pictures
Consider the graph Y<-X->Z. If I set Y:=X and Z:=X
I would say then the graph is reduced to the graph with just one node, namely X. And faithfulness is not violated because we wouldn’t expect X⊥X|X to hold.

In contrast, the graph X-> Y ← Z does not reduce straightforwardly even though Y is deterministic given X and Z, because there are no two variables which are information equivalent.

I’m not completely sure though if it reduces in a different way, because Y and {X, Z} are information equivalent (just like X and {Y,Z}, as well as Z and {X,Y}). And I agree that the conditions for theorem 3.2. aren’t fulfilled in this case. Intuitively I’d still say X-> Y ← Z doesn’t reduce, but I’d probably need to go through this paper more closely in order to be sure.

Magdalena Wache 16 Dec 2022 15:03 UTC
1 point
0
in reply to: Frederik Hytting Jørgensen’s comment on: Finite Factored Sets in Pictures
Would it be possible to formalize “set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.”?
We are looking at the space of conditional probability table (CPT) parametrizations in which the indepencies of our given joint probability distribution (JPD) hold.
If Y causes X, the independencies of our JPD only hold for a specific combination of conditional probabilities. Namely those in which P(X,Z) = P(X)P(Z). The set of CPT parametrizations with P(X,Z) = P(X)P(Z) has measure zero (it is lower-dimensional than the space of all CPT parametrizations).

In contrast, any CPT parametrization of graph 1 would fulfill the independencies of our given JPD. So the set of CPT parametrizations in which our independencies hold has a non-zero measure.
There are results that show that under some assumptions, faithfulness is only violated with probability 0. But those assumptions do not seem to hold in this example.
Could you say what these assumptions are that don’t hold here?

Finite Factored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC

183 points

35 comments12 min readLW link

Magdalena Wache 7 Dec 2022 17:10 UTC
2 points
1
in reply to: Kaarel’s comment on: Finite Factored Sets in Pictures
I see! You are right, then my argument wasn’t correct! I edited the post partially based on your argument above. New version:
Can we also infer that X causes Y?
Let’s concretize the above graphs by adding the conditional probabilities. Graph 1 then looks like this:
Graph 2 is somewhat trickier, because W is not uniquely determined. But one possibility is like this:
Note that W is just the negation of Z here ( $W = \neg Z$ ). Thus, W and Z are information equivalent, and that means graph 2 is actually just graph 1.

Can we find a different variable W such that graph 2 does not reduce to graph 1? I.e. can we find a variable W such that Z is not deterministic given W?
No, we can’t. To see that, consider the distribution $P (Z | X, Y, W)$ . By definition of Z, we know that
$P (Z | X, Y, W) = {\begin{matrix} 1 if Z = X XOR Y, 0 otherwise. \end{matrix}$ .

In other words, $P (Z | X, Y, W)$ is deterministic.
We also know that W d-separates Z from X,Y in graph 2:
This d-separation implies that Z is independent from X, Z given W:
$P (Z | W) = P (Z | W, X, Y)$
As $P (Z | X, Y, W)$ is deterministic, $P (Z | W)$ also has to be deterministic.
So, graph 2 always reduces to graph 1, no matter how we choose W. Analogously, graph 3 and graph 4 also reduce to graph 1, and we know that our causal structure is graph 1:
Which means we also know that X causes Y.

Magdalena Wache 7 Dec 2022 17:03 UTC
3 points
0
in reply to: Matthias G. Mayer’s comment on: Finite Factored Sets in Pictures
Good point, you are right! I just edited the post and now say that graph 2, 3, and 4 only have instantiations that are equivalent to graph 1.
Also in Graph 2 there is a typo P(W=0|Z=0) instead of P(Z=0|W=0)
fixed, thanks!

Magdalena Wache 5 Dec 2022 18:16 UTC
2 points
0
in reply to: Ben’s comment on: Finite Factored Sets in Pictures
Yes, exactly!

Magdalena Wache 5 Dec 2022 15:43 UTC
30 points
1
in reply to: Ben’s comment on: Finite Factored Sets in Pictures
This is a very good point, and you are right—Y causes X here, and we still get the stated distribution. The reason that we rule this case out is that the set of probability distributions in which Y causes X is a null set, i.e. it has measure zero.
If we assume the graph Y->X, and generate the data by choosing Y first and then X, like you did in your code—then it depends on the exact values of P(X|Y) whether X⊥Z holds. If the values of P(X|Y) change just slightly, then X⊥Z won’t hold anymore. So given that our graph is Y->X, it’s really unlikely (It will happen almost never) that we get a distribution in which X⊥Z holds. But as we did observe such a distribution with X⊥Z , we infer that the graph is not Y->X.
In contrast, if we assume the graph as X-> Y <-Z , then X⊥Z will hold for any distribution that works with our graph, no matter what the exact values of P(Y|X) are (as long as they correspond to the graph, which includes satisfying X⊥Z)

Magdalena Wache 5 Dec 2022 15:20 UTC
1 point
0
in reply to: David Johnston’s comment on: Finite Factored Sets in Pictures
I agree that 1. is unjustified (and would cause lots of problems for graphical causal models if it was).
Interesting, why is that? For any of the outcomes (i.e. 00, 01, 10, and 11), P(W|X,Y) is either 0 or 1 for any variable W that we can observe. So W is deterministic given X and Y for our purposes, right?

If not, do you have an example for a variable W where that’s not the case?

Magdalena Wache

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

In­ter­pretabil­ity Ex­ter­nal­ities Case Study—Hun­gry Hun­gry Hippos

Tech­ni­cal AI Safety Re­search Land­scape [Slides]

AI Safety Europe Re­treat 2023 Retrospective

Finite Fac­tored Sets in Pictures

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Interpretability Externalities Case Study—Hungry Hungry Hippos

Technical AI Safety Research Landscape [Slides]

AI Safety Europe Retreat 2023 Retrospective

Finite Factored Sets in Pictures