Jordan Taylor(Jordan Taylor)

Karma: 164

I’m finishing up my PhD on tensor network algorithms at the University of Queensland, Australia, under Ian McCulloch. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.

Predictably, I’m moving into AI safety work. See my post on graphical tensor notation for interpretability. I also attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Graphical tensor notation for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC

129 points

11 comments19 min readLW link

Jordan Taylor 15 Mar 2023 2:25 UTC
13 points
7
in reply to: paulfchristiano’s comment on: GPT-4
Potential dangers of future evaluations / gain-of-function research, which I’m sure you and Beth are already extremely well aware of:
1. Falsely evaluating a model as safe (obviously)
2. Choosing evaluation metrics which don’t give us enough time to react (After evaluation metrics switch would from “safe” to “not safe”, we should like to have enough time to recognize this and do something about it before we’re all dead)
3. Crying wolf too many times, making it more likely that no one will believe you when a danger threshold has really been crossed
4. Letting your methods for making future AIs scarier be too strong given the probability they will be leaked or otherwise made widely accessible. (If the methods / tools are difficult to replicate without resources)
5. Letting your methods for making AIs scarier be too weak, lest it’s too easy for some bad actors to go much further than you did
6. Failing to have a precommitment to stop this research when models are getting scary enough that it’s on balance best to stop making them scarier, even if no-one else believes you yet
What links here?
- Jordan Taylor's comment on Announcing Apollo Research by Marius Hobbhahn (31 May 2023 13:59 UTC; 3 points)

Jordan Taylor 20 Dec 2022 3:15 UTC
8 points
1
on: Inner and outer alignment decompose one hard problem into two extremely hard problems
I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views.
I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it’d rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed.

Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem.

Jordan Taylor 31 May 2023 13:59 UTC
3 points
0
on: Announcing Apollo Research
Seems great! I’m excited about potential interpretability methods for detecting deception.

I think you’re right about the current trade-offs on the gain of function stuff, but it’s good to think ahead and have precommitments for the conditions under which your strategies there should change.

It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am

Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.

Jordan Taylor 12 Oct 2023 6:40 UTC
2 points
0
in reply to: Harny Wang’s comment on: Graphical tensor notation for interpretability
Nice, I forgot about ZX (and ZXW) calculus. I’ve never seriously engaged with it, despite it being so closely related to tensor networks. The fact that you can decompose any multilinear equation into so few primitive building blocks is interesting.

Jordan Taylor 5 Oct 2023 12:15 UTC
2 points
0
in reply to: Alex K. Chen (parrot)’s comment on: Graphical tensor notation for interpretability
This is an interesting and useful overview, though it’s important not to confuse their notation with the Penrose graphical notation I use in this post, since lines in their notation seem to represent the message-passing contributions to a vector, rather than the indices of a tensor.
That said, there are connections between tensor network contractions and message passing algorithms like Belief Propagation, which I haven’t taken the time to really understand. Some references are:
Duality of graphical models and tensor networks—Elina Robeva and Anna Seigal
Tensor network contraction and the belief propagation algorithm—R. Alkabetz and I. Arad
Tensor Network Message Passing—Yijia Wang, Yuwen Ebony Zhang, Feng Pan, Pan Zhang
Gauging tensor networks with belief propagation—Joseph Tindall, Matt Fishman

Jordan Taylor 12 Apr 2024 18:42 UTC
1 point
0
in reply to: thomasahle’s comment on: Graphical tensor notation for interpretability
Thanks for the kind words! Sadly I just used inkscape for the diagrams—nothing fancy. Though hopefully that could change soon with the help of code like yours. Your library looks excellent! (though I initially confused it with https://github.com/wangleiphy/tensorgrad due to the name).
I like how you represent functions on the tensors, like you’re peering inside them. I can see myself using it often, both for visualizing things, and for computing derivatives.

The difficulty in using it for final diagrams may be in getting the positions of the tensors arranged nicely. Do you use a force-directed layout like networkx for that currently? Regardless, a good thing about exporting tixz code is that you can change the positions manually, as long as the positions are set up as nice tikz variables rather than “hardcoded” numbers everywhere.

Anyway, I love it. Here’s an image for others to get the idea:
https://lh3.googleusercontent.com/pw/AP1GczOcN5JNU0oTkklp2dvgilWHN1DwDJWBuJ7j38iCuA0MBmEN-DmWY0YfjsRbBH-WgM6NjBuCPhtVGNiY2uG_z9dtsPnNp8Uw4UShPAIQOeMIaw0Zj-4dR6_u_lt9FIz6BsAJJtM91tpt4Dj7xlL_ybusKw=w990-h1974-s-no-gm?authuser=0

Jordan Taylor 5 Oct 2023 20:57 UTC
1 point
0
in reply to: Adam Shai’s comment on: Graphical tensor notation for interpretability
Oops, yep. I initially had the tensor diagrams for that multiplication the other way around (vector then matrix). I changed them to be more conventional, but forgot that. As you say you can just move the tensors any which way and get the same answer so long as the connectivity is the same, though it would be $A b = b^{T} A^{T}$ or $y_{i} = \sum_{j} A_{i j} b_{j} = \sum_{j} b_{j} A_{i j} = \sum_{j} b_{j} A_{j i}^{T}$ to keep the legs connected the same way.

Jordan Taylor 26 Aug 2023 23:37 UTC
1 point
0
in reply to: Tom DAVID’s comment on: A list of core AI safety problems and how I hope to solve them
I guess the shutdown timer would be most important in the training stage, so that it (hopefully) learns only to care about the short term.

Jordan Taylor 18 Oct 2022 2:45 UTC
1 point
0
on: A review of the Bio-Anchors report
One small thing: When you first use the word “power”, I thought you were talking about energy use rather than computational power. Although you clarify in “A closer look at the NN anchor”, I would get the wrong impression if I just read the hypotheses:
… TAI will run on an amount of power comparable to the human brain …
… neural network which would use that much power …
Maybe change “power” to “computational power” there? I expect biological systems to be much more strongly selected to minimize energy use than TAI systems would be, but the same is not true for computational power.

Jordan Taylor(Jordan Taylor)

Graph­i­cal ten­sor no­ta­tion for interpretability

Graphical tensor notation for interpretability