phd student in comp neuroscience @ mpi brain research frankfurt. https://twitter.com/janhkirchner and https://universalprior.substack.com/
Huh, thanks for spotting that! Yes, should totally be ELK 😀 Fixed it.
This work by Michael Aird and Justin Shovelain might also be relevant: “Using vector fields to visualise preferences and make them consistent”
And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: “Inferring utility functions from locally non-transitive preferences”
(Extremely cool project ideas btw)
Hey Ben! :) Thanks for the comment and the careful reading!
Yes, we only added the missing arx.iv papers after clustering, but then we repeat the dimensionality reduction and show that the original clustering still holds up even with the new papers (Figure 4 bottom right). I think that’s pretty neat (especially since the dimensionality reduction doesn’t “know” about the clustering) but of course the clusters might look slightly different if we also re-run k-means on the extended dataset.
There’s an important caveat here:
The visual stimuli are presented 8 degrees over the visual field for 100ms followed by a 100ms grey mask as in a standard rapid serial visual presentation (RSVP) task.
I’d be willing to bet that if you give the macaque more than 100ms they’ll get it right—That’s at least how it is for humans!
(Not trying to shift the goalpost, it’s a cool result! Just pointing at the next step.)
Great points, thanks for the comment! :) I agree that there are potentially some very low-hanging fruits. I could even imagine that some of these methods work better in artificial networks than in biological networks (less noise, more controlled environment).
But I believe one of the major bottlenecks might be that the weights and activations of an artificial neural network are just so difficult to access? Putting the weights and activations of a large model like GPT-3 under the microscope requires impressive hardware (running forward passes, storing the activations, transforming everything into a useful form, …) and then there are so many parameters to look at.
Giving researchers structured access to the model via a research API could solve a lot of those difficulties and appears like something that totally should exist (although there is of course the danger of accelerating progress on the capabilities side also).
Great point! And thanks for the references :)
I’ll change your background to Computational Cognitive Science in the table! (unless you object or think a different field is even more appropriate)
Thank you for the comment and the questions! :)
This is not clear from how we wrote the paper but we actually do the clustering in the full 768-dimensional space! If you look closely as the clustering plot you can see that the clusters are slightly overlapping—that would be impossible with k-means in 2D, since in that setting membership is determined by distance from the 2D centroid.
Oh true, I completely overlooked that! (if I keep collecting mistakes like this I’ll soon have enough for a “My mistakes” page)
Yes, good point! I had that in an earlier draft and then removed it for simplicity and for the other argument you’re making!
This sounds right to me! In particular, I just (re-)discovered this old post by Yudkowsky and this newer post by Alex Flint that both go a lot deeper on the topic. I think the optimal control perspective is a nice complement to those posts and if I find the time to look more into this then that work is probably the right direction.
As part of the AI Safety Camp our team is preparing a research report on the state of AI safety! Should be online within a week or two :)
Interesting, I added a note to the text highlighting this! I was not aware of that part of the story at all. That makes it more of a Moloch-example than a “mistaking adversarial for random”-example.
Yes, that’s a pretty fair interpretation! The macroscopic/folk psychology notion of “surprise” of course doesn’t map super cleanly onto the information-theoretic notion. But I tend to think of it as: there is a certain “expected surprise” about what future possible states might look like if everything evolves “as usual”, Ip([x1,…,xN]). And then there is the (usually larger) “additional surprise” about the states that the AI might steer us into, Iξ([x1,…,xN]). The delta between those two is the “excess surprise” that the AI needs to be able to bring about.
It’s tricky to come up with a straightforward setting where the actions of the AI can be measured in nats, but perhaps the following works as an intuition pump: “If we give the AI full, unrestricted access to a control panel that controls the universe, how many operations does it have to perform to bring about the catastrophic event?”. That’s clearly still not well defined (there is no obvious/privileged way that the panel should look like), but it shows that 1) the “excess surprise” is a lower bound (we wouldn’t usually give the AI unrestricted access to that panel) and 2) that the minimum amount of operations required to bring about a catastrophic event is probably still larger than 1.
Thank you for your comment! You are right, these things are not clear from this post at all and I did not do a good job at clarifying that. I’m a bit low on time atm, but hopefully, I’ll be able to make some edits to the post to set the expectations for the reader more carefully.
The short answer to your question is: Yep, X is the space of events. In Vanessa’s post it has to be compact and metric, I’m simplifying this to an interval in R. And P+/P− can be derived from PHg by plugging in g=0 and replacing the measure m(A) by the Lesbegue integral ∫Adm. I have scattered notes where I derive the equations in this post. But it was clear to me that if I want to do this rigorously in the post, then I’d have to introduce an annoying amount of measure theory and the post would turn into a slog. So I decided to do things hand-wavy, but went a bit too hard in that direction.