Am I just inexperienced or confused, or is this paper using a lot of words to say effectively very little? Sure, this functional form works fine for a given set of regimes of scaling, but it effectively gives you no predictive ability to determine when the next break will occur.
Sorry if this is overly confrontational, but I keep seeing this paper on Twitter and elsewhere and I’m not sure I understand why.
When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.
Obvious crackpot; says on Twitter that there’s a $1 billion prize for “breaking” BNSL funded by Derek Parfit’s family office. I’d cut him more slack for potentially being obviously joking, if it wasn’t surrounded by claims that sounded also crackpottery to me. https://twitter.com/ethanCaballero/status/1587502829580820481
Ethan has an unusual communication style, but he’s not a crackpot, and this work is legit (according to me, the anchor author). I haven’t listened to the interview.
Well, I am Ethan’s primary supervisor (since 2020), and really appreciate his provocative sense of humor (though not everyone does 😀) - regarding the BNSL paper though, it’s 100% serious (though I did promise to double the $1B prize on Twitter; like advisor like student 😜)
I would be remiss not to warn folks that Ethan had a long-standing habit of making misleading, sensationalized, and outright false statements about ML scaling and other topics in the EleutherAI Discord. It got to the point several times where the moderation team had to step in to address the issue. Would recommend taking everything with a massive grain of salt.
Early on, we typically just ignored it or tried to discuss it with him. After a while it became common knowledge that Ethan would post “bait”. Eventually we escalated to calling the behavior out when it occurred, deleting the relevant post(s), and/or handing him a temporary timeout. I don’t know what has happened since then, I’ve been away from the server the past few months.
Ethan, my impression is that you’re mildly overfitting. I appreciate your intellectual arrogance quite a bit; it’s a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I’d expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models—note that even humans who have low causal quality training data form weird and false superstitions! I agree with the “test loss != capability” claim because the test distribution is weird and made up and doesn’t exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it’s misleading because of the correlation problem.
this seems like a solid empirical generative representation but I don’t feel comfortable assuming it is a causally accurate generative model. it appears overparameterized without causal justification to me. certainly we can fit known data using this, but it explicitly bakes in an assumption of non-generalization. perhaps that’s the only significant claim being made? but I don’t see how we can even generalize that the sharpness of the breaks is reliable. ethan says come at me, I say this is valid but does not refine predictive distribution significantly and that is itself the difficult problem we’d hope to solve in the first place.
[humor] could one use this method to represent the convergence behavior of researchers with crackpots as the size of the cracks in crackpots’ pot decreases and the number of objects colliding with respected researchers’ pots increases?
Am I just inexperienced or confused, or is this paper using a lot of words to say effectively very little? Sure, this functional form works fine for a given set of regimes of scaling, but it effectively gives you no predictive ability to determine when the next break will occur.
Sorry if this is overly confrontational, but I keep seeing this paper on Twitter and elsewhere and I’m not sure I understand why.
When f (in equation 1 of the paper ( https://arxiv.org/abs/2210.14891 ) not the video) of next break is sufficiently large, it gives you predictive ability to determine when that next break will occur; also, the number of seeds needed to get such predictive ability is very large. When f of next break is sufficiently small (& nonnegative), it does not give you predictive ability to determine when that next break will occur.
Play around with fi in this code to see what I mean:
https://github.com/ethancaballero/broken_neural_scaling_laws/blob/main/make_figure_1__decomposition_of_bnsl_into_power_law_segments.py#L25-L29
Obvious crackpot; says on Twitter that there’s a $1 billion prize for “breaking” BNSL funded by Derek Parfit’s family office. I’d cut him more slack for potentially being obviously joking, if it wasn’t surrounded by claims that sounded also crackpottery to me. https://twitter.com/ethanCaballero/status/1587502829580820481
I am co-supervising Ethan’s PhD, and we previously wrote another ML paper together: https://arxiv.org/abs/2003.00688
Ethan has an unusual communication style, but he’s not a crackpot, and this work is legit (according to me, the anchor author). I haven’t listened to the interview.
Well, I am Ethan’s primary supervisor (since 2020), and really appreciate his provocative sense of humor (though not everyone does 😀) - regarding the BNSL paper though, it’s 100% serious (though I did promise to double the $1B prize on Twitter; like advisor like student 😜)
I would be remiss not to warn folks that Ethan had a long-standing habit of making misleading, sensationalized, and outright false statements about ML scaling and other topics in the EleutherAI Discord. It got to the point several times where the moderation team had to step in to address the issue. Would recommend taking everything with a massive grain of salt.
Source: I was one of those mods.
What happened with that? I didn’t realize he had issues with claims on scaling.
Early on, we typically just ignored it or tried to discuss it with him. After a while it became common knowledge that Ethan would post “bait”. Eventually we escalated to calling the behavior out when it occurred, deleting the relevant post(s), and/or handing him a temporary timeout. I don’t know what has happened since then, I’ve been away from the server the past few months.
https://discord.com/channels/729741769192767510/785968841301426216/958570285760647230
Ethan posts an annotated image from openai’s paper https://arxiv.org/pdf/2001.08361.pdf , stating that it’s “apparently wrong now” after the compute-efficient scaling laws paper from deepmind: https://cdn.discordapp.com/attachments/785968841301426216/958570284665946122/Screen_Shot_2021-10-20_at_12.30.58_PM_1.png—the screenshot claims that the crossover point between data and compute in the original openai paper predicts agi.
Ethan, my impression is that you’re mildly overfitting. I appreciate your intellectual arrogance quite a bit; it’s a great attitude to have as a researcher, and more folks here should have attitudes like yours, IMO. But, I’d expect that data causal isolation quality is going to throw a huge honkin wrench into any expectations we form about how we can use strong models—note that even humans who have low causal quality training data form weird and false superstitions! I agree with the “test loss != capability” claim because the test distribution is weird and made up and doesn’t exist outside the original dataset. IID is catastrophically false, and figuring that out is the key limiter preventing robotics from matching pace with the rest of ML/AI right now, imo. So, your scaling model might even be a solid representation space, but it’s misleading because of the correlation problem.
this seems like a solid empirical generative representation but I don’t feel comfortable assuming it is a causally accurate generative model. it appears overparameterized without causal justification to me. certainly we can fit known data using this, but it explicitly bakes in an assumption of non-generalization. perhaps that’s the only significant claim being made? but I don’t see how we can even generalize that the sharpness of the breaks is reliable. ethan says come at me, I say this is valid but does not refine predictive distribution significantly and that is itself the difficult problem we’d hope to solve in the first place.
[humor] could one use this method to represent the convergence behavior of researchers with crackpots as the size of the cracks in crackpots’ pot decreases and the number of objects colliding with respected researchers’ pots increases?