Thanks Pattern—I’ve taken your advice and updated the title.
Joel Burget
Chesterton’s Fence vs The Onion in the Varnish
[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard
Academics not willing to leave their jobs might still be interested in working on a problem part-time. One could imagine that the right researcher working part-time might be more effective than the wrong researcher full time.
Can’t answer the second question, but see https://www.gwern.net/Scaling-hypothesis for the first.
Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It’s a lovely little book. I didn’t expect to it to be nearly so engrossing.
If you’re interested in following up on John’s comments on financial markets, nonexistence of a representative agent, and path dependence, he speaks more about them in his post, Why Subagents?
In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result—even if the externally-visible variables, i.e. prices, end up the same.
The Soviet nail factory always used to illustrate Goodhart’s law… did it actually exist? Some good answers on the skeptics StackExchange https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
Two points:
The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.
Human values aren’t a repeller, but they’re a very narrow target to hit.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
This of course doesn’t help with corrigibility.
Gwern often posts to https://www.reddit.com/r/mlscaling/ as well
Anthropic’s SoLU (Softmax Linear Unit)
Interpretability isn’t Free
Typo:
In total, this game has a coco-value of (145, 95), which would be realized by Alice selling at the beach, Bob selling at the airport, and Alice transferring 55 to Bob.
I believe the transfer should be 25.
Subcortical reinforcement circuits, though, hail from a distinct informational world… and so have to reinforce computations “blindly,” relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I’d like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it’s going to be quite bad at accurately assigning credit to shards.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Meta-comment: I’m happy to see this—someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.
Do you happen to know how this compares with https://github.com/BlinkDL/RWKV-LM which is described as an RNN with performance comparable to a transformer / linear attention?
Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
“We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface.” I’m picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.
Funny, this is exactly what I was trying to argue for (section 4 explicitly says “Really, both anecdotes teach us the same thing”). Trying to think how I can make this clearer.