Link to Rob Bensinger’s comments on this market:
Joel Burget
Paul Christiano named as US AI Safety Institute Head of AI Safety
Highlights from Lex Fridman’s interview of Yann LeCun
[Question] How to Model the Future of Open-Source LLMs?
Anthropic’s SoLU (Softmax Linear Unit)
Two points:
The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.
I worry that this is conflating two possible meanings of FLOPS:
Floating Point Operations (FLOPs)
Floating Point Operations per Second (Maybe FLOPs/s is clear?)
The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).
(I noticed a type error when thinking about comparing real-time brain emulation vs training).
Meta-comment: I’m happy to see this—someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
Interpretability isn’t Free
Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
“We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface.” I’m picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.
Chesterton’s Fence vs The Onion in the Varnish
[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard
Though the statement doesn’t say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don’t know if they were asked to sign).
Thiel’s arguments about both the Vulnerable World Hypothesis and Death with Dignity were so (uncharacteristically?) shallow that I had to question whether he actually believes what he said, or was just making an argument he thought would be popular with the audience. I don’t know enough about his views to say but my guess is that it’s somewhat (20%+) likely.
How would you distinguish between weak and strong methods?
From the latest Conversations with Tyler interview of Peter Thiel
I feel like Thiel misrepresents Bostrom here. He doesn’t really want a centralized world government or think that’s “a set of things that make sense and that are good”. He’s forced into world surveillance not because it’s good but because it’s the only alternative he sees to dangerous ASI being deployed.
I wouldn’t say he’s optimistic about human nature. In fact it’s almost the very opposite. He thinks that we’re doomed by our nature to create that which will destroy us.
Subcortical reinforcement circuits, though, hail from a distinct informational world… and so have to reinforce computations “blindly,” relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I’d like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it’s going to be quite bad at accurately assigning credit to shards.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It’s a lovely little book. I didn’t expect to it to be nearly so engrossing.
Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights