This is fantastic. Thank you.
Joel Burget
Thanks! I added a note about LeCun’s 100,000 claim and just dropped the Chollet reference since it was misleading.
Thanks for the correction! I’ve updated the post.
Highlights from Lex Fridman’s interview of Yann LeCun
I assume the 44k PPM CO2 exhaled air is the product of respiration (I.e. the lungs have processed it), whereas the air used in mouth-to-mouth is quickly inhaled and exhaled.
What’s your best guess for what percentage of cells (in the brain) receive edits?
Are edits somehow targeted at brain cells in particular or do they run throughout the body?
I don’t have a well-reasoned opinion here but I’m interested in hearing from those who disagree.
How would you distinguish between weak and strong methods?
Re Na:K : Potassium Chloride is used as a salt substitute (which tastes surprisingly like regular salt). This makes it really easy to tweak the Na:K ratio (if it turns out to be important). OTOH, it’s some evidence that it’s not important, otherwise I’d expect someone to notice that people lose weight when they substitute it for table salt.
We don’t hear much about Apple in AI—curious why you rank them so important.
Though the statement doesn’t say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don’t know if they were asked to sign).
I have a couple of basic questions:
Shouldn’t diagonal elements in the perplexity table all be equal to the baseline (since the addition should be 0)?
I’m a bit confused about the use of perplexity here. The added vector introduces bias (away from one digit and towards another). It shouldn’t be surprising that perplexity increases? Eyeballing the visualizations they do all seem to shift mass away from
b
and towardsa
.
Link to Rob Bensinger’s comments on this market:
I worry that this is conflating two possible meanings of FLOPS:
Floating Point Operations (FLOPs)
Floating Point Operations per Second (Maybe FLOPs/s is clear?)
The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).
(I noticed a type error when thinking about comparing real-time brain emulation vs training).
One more, related to your first point: I wouldn’t expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?
As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.
I suspect it’ll have more. OpenAI recently released https://github.com/openai/tiktoken. This includes “cl100k_base” with ~100k tokens.
The capabilities case for this is that GPT-{2,3} seem to be somewhat hobbled by their tokenizer, at least when it comes to arithmetic. But cl100k_base has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (None have preceding spaces).
Previous related exploration: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.
Thiel’s arguments about both the Vulnerable World Hypothesis and Death with Dignity were so (uncharacteristically?) shallow that I had to question whether he actually believes what he said, or was just making an argument he thought would be popular with the audience. I don’t know enough about his views to say but my guess is that it’s somewhat (20%+) likely.
how is changing to an orthonormal basis importantly different from just any change of basis?
What exactly do you have in mind here?
Three questions:
What format do you upload SAEs in?
What data do you run the SAEs over to generate the activations / samples?
How long of a delay is there between uploading an SAE and it being available to view?