Joel Burget

Karma: 483

Chesterton’s Fence vs The Onion in the Varnish

Joel Burget24 Mar 2022 21:20 UTC

7 points

6 comments1 min readLW link

Joel Burget 25 Mar 2022 16:07 UTC
3 points
in reply to: Adam Zerner’s comment on: Chesterton’s Fence vs The Onion in the Varnish
I don’t actually think that they are in conflict.
Funny, this is exactly what I was trying to argue for (section 4 explicitly says “Really, both anecdotes teach us the same thing”). Trying to think how I can make this clearer.

[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard

Joel Burget19 Apr 2022 17:15 UTC

7 points

2 comments1 min readLW link

Joel Burget 20 Apr 2022 14:14 UTC
1 point
in reply to: Pattern’s comment on: The two missing core reasons why aligning at-least-partially superhuman AGI is hard
Thanks Pattern—I’ve taken your advice and updated the title.

Joel Burget 25 Apr 2022 14:37 UTC
3 points
in reply to: LGS’s comment on: Hiring a mathematician to work on the learning-theoretic AI alignment agenda
Academics not willing to leave their jobs might still be interested in working on a problem part-time. One could imagine that the right researcher working part-time might be more effective than the wrong researcher full time.

Joel Burget 1 May 2022 15:25 UTC
1 point
in reply to: Mohammad Bavarian’s comment on: We Are Conjecture, A New Alignment Research Startup
Can’t answer the second question, but see https://www.gwern.net/Scaling-hypothesis for the first.

Joel Burget 3 Jun 2022 21:10 UTC
4 points
on: How to get into AI safety research
Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It’s a lovely little book. I didn’t expect to it to be nearly so engrossing.

Joel Burget 4 Jun 2022 14:10 UTC
1 point
on: AXRP Episode 15 - Natural Abstractions with John Wentworth
If you’re interested in following up on John’s comments on financial markets, nonexistence of a representative agent, and path dependence, he speaks more about them in his post, Why Subagents?
In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result—even if the externally-visible variables, i.e. prices, end up the same.

Joel Burget 11 Jun 2022 19:53 UTC
3 points
on: Joel Burget’s Shortform
The Soviet nail factory always used to illustrate Goodhart’s law… did it actually exist? Some good answers on the skeptics StackExchange https://skeptics.stackexchange.com/questions/22375/did-a-soviet-nail-factory-produce-useless-nails-to-improve-metrics

Joel Burget 14 Jun 2022 14:49 UTC
11 points
in reply to: James Payor’s comment on: On A List of Lethalities
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.

Joel Burget 15 Jun 2022 18:47 UTC
LW: 13 AF: 7
−3
AF
on: A central AI alignment problem: capabilities generalization, and the sharp left turn
Two points:
1. The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
2. The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.

Joel Burget 15 Jun 2022 22:24 UTC
2 points
1
in reply to: Rob Bensinger’s comment on: A central AI alignment problem: capabilities generalization, and the sharp left turn
Human values aren’t a repeller, but they’re a very narrow target to hit.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
This of course doesn’t help with corrigibility.

Joel Burget 16 Jun 2022 0:31 UTC
3 points
in reply to: cometthecat’s comment on: What are all the AI Alignment and AI Safety Communication Hubs?
Gwern often posts to https://www.reddit.com/r/mlscaling/ as well

Anthropic’s SoLU (Softmax Linear Unit)

Joel Burget4 Jul 2022 18:38 UTC

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

Interpretability isn’t Free

Joel Burget4 Aug 2022 15:02 UTC

10 points

1 comment2 min readLW link

Joel Burget 8 Aug 2022 19:05 UTC
0 points
on: What should superrational players do in asymmetric games?
Typo:
In total, this game has a coco-value of (145, 95), which would be realized by Alice selling at the beach, Bob selling at the airport, and Alice transferring 55 to Bob.
I believe the transfer should be 25.

Joel Burget 11 Aug 2022 17:04 UTC
LW: 4 AF: 2
0
AF
on: Shard Theory: An Overview
Subcortical reinforcement circuits, though, hail from a distinct informational world… and so have to reinforce computations “blindly,” relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I’d like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it’s going to be quite bad at accurately assigning credit to shards.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

Joel Burget 12 Aug 2022 14:47 UTC
11 points
13
on: Seriously, what goes wrong with “reward the agent when it makes you smile”?
Meta-comment: I’m happy to see this—someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.

Joel Burget 23 Sep 2022 23:34 UTC
1 point
0
in reply to: Nathan Helm-Burger’s comment on: Sparse trinary weighted RNNs as a path to better language model interpretability
Do you happen to know how this compares with https://github.com/BlinkDL/RWKV-LM which is described as an RNN with performance comparable to a transformer / linear attention?

Joel Burget 24 Sep 2022 18:28 UTC
8 points
5
on: Interpreting Neural Networks through the Polytope Lens
1. Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
2. This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
3. “We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface.” I’m picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
4. A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.

Joel Burget

Ch­ester­ton’s Fence vs The Onion in the Varnish

[Question] The two miss­ing core rea­sons why al­ign­ing at-least-par­tially su­per­hu­man AGI is hard

An­thropic’s SoLU (Soft­max Lin­ear Unit)

In­ter­pretabil­ity isn’t Free

Chesterton’s Fence vs The Onion in the Varnish

[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard

Anthropic’s SoLU (Softmax Linear Unit)

Interpretability isn’t Free