Joel Burget

Karma: 483

Paul Christiano named as US AI Safety Institute Head of AI Safety

Joel Burget16 Apr 2024 16:22 UTC

251 points

58 comments1 min readLW link

(www.commerce.gov)

Highlights from Lex Fridman’s interview of Yann LeCun

Joel Burget13 Mar 2024 20:58 UTC

48 points

15 comments41 min readLW link

[Question] How to Model the Future of Open-Source LLMs?

Joel Burget19 Apr 2024 14:28 UTC

23 points

9 comments1 min readLW link

Joel Burget 5 Feb 2023 18:18 UTC
22 points
2
on: SolidGoldMagikarp (plus, prompt generation)
Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.

Anthropic’s SoLU (Softmax Linear Unit)

Joel Burget4 Jul 2022 18:38 UTC

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

Joel Burget 26 Mar 2023 0:23 UTC
19 points
0
on: Manifold: If okay AGI, why?
Link to Rob Bensinger’s comments on this market:
- Google Doc
- Twitter Version

Joel Burget 15 Jun 2022 18:47 UTC
LW: 13 AF: 7
−3
AF
on: A central AI alignment problem: capabilities generalization, and the sharp left turn
Two points:
1. The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
2. The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.

Joel Burget 10 Feb 2023 17:19 UTC
11 points
3
on: Is it a coincidence that GPT-3 requires roughly the same amount of compute as is necessary to emulate the human brain?
I worry that this is conflating two possible meanings of FLOPS:
1. Floating Point Operations (FLOPs)
2. Floating Point Operations per Second (Maybe FLOPs/s is clear?)
The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).
(I noticed a type error when thinking about comparing real-time brain emulation vs training).

Joel Burget 12 Aug 2022 14:47 UTC
11 points
13
on: Seriously, what goes wrong with “reward the agent when it makes you smile”?
Meta-comment: I’m happy to see this—someone knowledgeable, who knows and seriously engages with the standard arguments, willing to question the orthodox answer (which some might fear would make them look silly). I think this is a healthy dynamic and I hope to see more of it.

Joel Burget 14 Jun 2022 14:49 UTC
11 points
in reply to: James Payor’s comment on: On A List of Lethalities
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.

Interpretability isn’t Free

Joel Burget4 Aug 2022 15:02 UTC

10 points

1 comment2 min readLW link

Joel Burget 24 Sep 2022 18:28 UTC
8 points
5
on: Interpreting Neural Networks through the Polytope Lens
1. Are there plans to release the software used in this analysis or will it remain proprietary? How does it scale to larger networks?
2. This provides an excellent explanation for why deep networks are useful (exponential growth in polytopes).
3. “We’re not completely sure why polytope boundaries tend to lie in a shell, though we suspect that it’s likely related to the fact that, in high dimensional spaces, most of the hypervolume of a hypersphere is close to the surface.” I’m picturing a unit hypersphere where most of the volume is in, e.g., the [0.95,1] region. But why would polytope boundaries not simply extend further out?
4. A better mental model (and visualizations) for how NNs work. Understanding when data is off-distribution. New methods for finding and understanding adversarial examples. This is really exciting work.

Chesterton’s Fence vs The Onion in the Varnish

Joel Burget24 Mar 2022 21:20 UTC

7 points

6 comments1 min readLW link

[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard

Joel Burget19 Apr 2022 17:15 UTC

7 points

2 comments1 min readLW link

Joel Burget 30 May 2023 16:41 UTC
7 points
1
in reply to: Kaj_Sotala’s comment on: Statement on AI Extinction—Signed by AGI Labs, Top Academics, and Many Other Notable Figures
Though the statement doesn’t say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don’t know if they were asked to sign).

Joel Burget 7 Dec 2022 16:25 UTC
7 points
1
in reply to: Razied’s comment on: Peter Thiel on Technological Stagnation and Out of Touch Rationalists
Thiel’s arguments about both the Vulnerable World Hypothesis and Death with Dignity were so (uncharacteristically?) shallow that I had to question whether he actually believes what he said, or was just making an argument he thought would be popular with the audience. I don’t know enough about his views to say but my guess is that it’s somewhat (20%+) likely.

Joel Burget 8 Oct 2023 22:37 UTC
LW: 6 AF: 4
2
AF
in reply to: abramdemski’s comment on: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
How would you distinguish between weak and strong methods?

Joel Burget 19 Apr 2024 1:48 UTC
5 points
0
on: Joel Burget’s Shortform
From the latest Conversations with Tyler interview of Peter Thiel
I feel like Thiel misrepresents Bostrom here. He doesn’t really want a centralized world government or think that’s “a set of things that make sense and that are good”. He’s forced into world surveillance not because it’s good but because it’s the only alternative he sees to dangerous ASI being deployed.
I wouldn’t say he’s optimistic about human nature. In fact it’s almost the very opposite. He thinks that we’re doomed by our nature to create that which will destroy us.

Joel Burget 11 Aug 2022 17:04 UTC
LW: 4 AF: 2
0
AF
on: Shard Theory: An Overview
Subcortical reinforcement circuits, though, hail from a distinct informational world… and so have to reinforce computations “blindly,” relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I’d like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it’s going to be quite bad at accurately assigning credit to shards.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?

Joel Burget 3 Jun 2022 21:10 UTC
4 points
on: How to get into AI safety research
Thank you for mentioning Gödel Without Too Many Tears, which I bought it based on this recommendation. It’s a lovely little book. I didn’t expect to it to be nearly so engrossing.

Joel Burget

Paul Chris­ti­ano named as US AI Safety In­sti­tute Head of AI Safety

High­lights from Lex Frid­man’s in­ter­view of Yann LeCun

[Question] How to Model the Fu­ture of Open-Source LLMs?

An­thropic’s SoLU (Soft­max Lin­ear Unit)

In­ter­pretabil­ity isn’t Free

Ch­ester­ton’s Fence vs The Onion in the Varnish

[Question] The two miss­ing core rea­sons why al­ign­ing at-least-par­tially su­per­hu­man AGI is hard