gwern
Have you looked at the “incubation effect”?
It’s not the numerical precision but the model architecture being sparse such that you only active a few experts at runtime, and only a small fraction of the model runs for each input. It may be 1.3t parameters or whatever, but then at runtime, only, I dunno, 20b parameters actually compute anything. This cheapness of forward passes/inferencing is the big selling point of MoE for training and deployment: that you don’t actually ever run 1.3t parameters. But it’s hard for parameters which don’t run to contribute anything to the final result, whereas in GPT-3, pretty much all of those 175b parameters can participate in each input. It’s much clearer if you think about comparing them in terms of FLOPS at runtime, rather than static parameter counts. GShard/Switch is just doing a lot less.
(I also think that the scaling curves and comparisons hint at Switch learning qualitatively worse things, and the modularity encouraging more redundancy and memorization-heavy approaches, which impedes any deeper abstractions or meta-learning-like capabilities that a deep dense model might learn. But this point is much more speculative, and not necessarily something that, say, translation researchers would care too much about.)
This point about runtime also holds for those chonky embeddings people sometimes bring up as examples of ‘models with billions of parameters’: sure, you may have a text or category embedding which has billions of ‘parameters’, but for any specific input, only a handful of those parameters actually do anything.
If Reddit falls through, email me and I can order a scan for you. (Might want to delete your duplicate comments here too.)
You should probably also be tracking kind of parameter. I see you have Switch and Gshard in there, but, as you can see in how they are visibly outliers, MoEs (and embeddings) use much weaker ‘parameters’, as it were, than dense models like GPT-3 or Turing-NLG. Plotting by FLOPS would help correct for this—perhaps we need graphs like training-FLOPS per parameter? That would also help correct for comparisons across methods, like to older architectures such as SVMs. (Unfortunately, this still obscures that the key thing about Transformers is better scaling laws than RNNs or n-grams etc, where the high FLOPS-per-parameter translates into better curves...)
“‘Nash equilibrium strategy’ is not necessarily synonymous to ‘optimal play’. A Nash equilibrium can define an optimum, but only as a defensive strategy against stiff competition. More specifically: Nash equilibria are hardly ever maximally exploitive. A Nash equilibrium strategy guards against any possible competition including the fiercest, and thereby tends to fail taking advantage of sub-optimum strategies followed by competitors. Achieving maximally exploitive play generally requires deviating from the Nash strategy, and allowing for defensive leaks in one’s own strategy.”
March 2021 gwern.net newsletter
That’s interesting. I did see YC listed as a major funding source, but given Sam Altman’s listed loans/donations, I assumed, because YC has little or nothing to do with Musk, that YC’s interest was Altman, Paul Graham, or just YC collectively. I hadn’t seen anything at all about YC being used as a cutout for Musk. So assuming the Guardian didn’t screw up its understanding of the finances there completely (the media is constantly making mistakes in reporting on finances and charities in particular, but this seems pretty detailed and specific and hard to get wrong), I agree that that confirms Musk did donate money to get OA started and it was a meaningful sum.
But it still does not seem that Musk donated the majority or even plurality of OA donations, much less the $1b constantly quoted (or any large fraction of the $1b collective pledge, per ESRogs).
One of the most interesting media experiments I know of is the Yahoo Media experiments:
-
“Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market”, Salganik et al 2006:
We investigated this paradox experimentally, by creating an artificial ‘‘music market’’ in which 14,341 participants downloaded previously unknown songs either with or without knowledge of previous participants’ choices. Increasing the strength of social influence increased both inequality and unpredictability of success. Success was also only partly determined by quality: The best songs rarely did poorly, and the worst rarely did well, but any other result was possible.
-
“Web-Based Experiments for the Study of Collective Social Dynamics in Cultural Markets”, Salganik & Watts 2009:
Using a ‘‘multiple-worlds’’ experimental design, we are able to isolate the causal effect of an individual-level mechanism on collective social outcomes. We employ this design in a Web-based experiment in which 2,930 participants listened to, rated, and downloaded 48 songs by up-and-coming bands. Surprisingly, despite relatively large differences in the demographics, behavior, and preferences of participants, the experimental results at both the individual and collective levels were similar to those found in Salganik, Dodds, and Watts (2006)...A comparison between Experiments 1 and 2 reveals a different pattern. In these experiments, there was little change at the song level; the correlation between average market rank in the social influence worlds of Experiments 1 and 2 was 0.93.
This is analogous to test-retest error: if you run a media market with the same authors, and same creative works, how often do you get the same results? Forget completely any question about how much popularity correlates with ‘quality’ - does popularity even correlate with itself consistently? If you ran the world several times, how much would the same songs float to the top?
The most relevant rank correlation they seem to report is rho=0.93*. That may seem high, but the more datapoints there are, the higher the necessary correlation soars to give the results you want.
A rho=0.93 implies that if you had a million songs competing in a popularity contest, the #1 popular song in our world would probably be closer to only the #35,000th most popular song in a parallel world’s contest as it regresses to the mean (
1000000 - (500000 + (500000 * 0.93))
). (As I noted the other day, even in very small samples you need extremely high correlations to guarantee double-maxes or similar properties, once you move beyond means; our intuitions don’t realize just what an extreme demand we make when we assume that, say, J.K. Rowling must be a very popular successful writer in most worlds simply because she’s a billionaire in this world, despite how many millions of people are writing fiction and competing with her. Realistically, she would be a minor but respected author who might or might not’ve finished out her HP series as sales flagged for multi-volume series; sort of like her crime novels published pseudonymously.)Then toss in the undoubtedly <<1 correlation between popularity and any ‘quality’… It is indeed no surprise that, out of the millions and millions of chefs over time, the best chefs in the world are not the most popular YouTube chefs. Another example of ‘the tails comes apart’ at the extremes and why order statistics is counterintuitive.
* They also report a rho=0.52 from some other experiments, which are arguably now more relevant than the 0.93 estimate. Obviously, if you use 0.52 instead, my point gets much much stronger: then, out of a million, you regress from #1 to #240,000!
-
I knew someone was going to ask that. Yes, it’s impure indexing, it’s true. The reason is the returns to date on the whole-world indexes have been lower, the expense is a bit higher, and after thinking about it, I decided that I do have a small opinion about the US overperforming (mostly due to tech/AI and a general sense that people persistently underestimate the US economically) and feel pessimistic about the rest of the world. Check back in 20 years to see how that decision worked out...
As described above, I expect AGI to be a learning algorithm—for example, it should be able to read a book and then have a better understanding of the subject matter. Every learning algorithm you’ve ever heard of—ConvNets, PPO, TD learning, etc. etc.—was directly invented, understood, and programmed by humans. None of them were discovered by an automated search over a space of algorithms. Thus we get a presumption that AGI will also be directly invented, understood, and programmed by humans.
For a post criticizing the use of evolution for end to end ML, this post seems to be pretty strawmanish and generally devoid of any grappling with the Bitter Lesson, end-to-end principle, Clune’s arguments for generativity and AI-GAs program to soup up self-play for goal generation/curriculum learning, or any actual research on evolving better optimizers, DRL, or SGD itself… Where’s Schmidhuber, Metz, or AutoML-Zero? Are we really going to dismiss PBT evolving populations of agents in the AlphaLeague just ‘tweaking a few human-legible hyperparameters’? Why isn’t Co-Reyes et al 2021 an example of evolutionary search inventing TD-learning which you claim is absurd and the sort of thing that has never happened?
This was exactly what I expected. The problem with the field of bioethics has never been the papers being 100% awful, but how it operates in the real world, the asymmetry of interventions, and what its most consequential effects have been. I would have thought 2020 made this painfully clear. (That is, my grandmother did not die of coronavirus while multiple highly-safe & highly-effective vaccines sat on the shelf unused, simply because some bioethicist screwed up a p-value in a paper somewhere. If only!)
The actual day-to-day churn of publishing bioethics papers/research… Well, HHGttG said it best in describing humans in general:
Mostly Harmless.
- 29 Mar 2021 5:30 UTC; 13 points) 's comment on Thirty-three randomly selected bioethics papers by (
I haven’t heard that claim before. My understanding was that such a claim would be improbable or cherrypicking of some sort, as a priori risk-adjusted etc returns should be similar or identical but by deliberately narrowing your index, you do predictably lose the benefits of diversification. So all else equal (such as fees and accessibility of making the investment), you want the broadest possible index.
Since we’re discussing EMH and VTSAX, seems as good a place to add a recent anecdote:
Chatting with someone, investments came up and they asked me where I put mine. I said 100% VTSAX. Why? Because I think the EMH is as true as it needs to be, I don’t understand why markets rise and fall when they do even when I think I’m predicting future events accurately (such as, say, coronavirus), and I don’t think I can beat the stock markets, at least not without investing far more effort than I care to. They said they thought it wasn’t that hard, and had (unlike me) sold all their stocks back in Feb 2020 or so when most everyone was still severely underestimating coronavirus, and beat the market drops. Very impressive, I said, but when had they bought back in? Oh, they hadn’t yet. But… didn’t that mean they missed out on the +20% net returns or so of 2020, and had to pay taxes? (VTSAX returned 21% for 2020, and 9.5% thus far for 2021.) Yes, they had missed out. Oops.
Trading is hard.
ALE is doubtless the Atari Learning Environment. I’ve never seen an ‘ALE’ in DRL discussions which refers to something else.
It is quite possible that CLIP “knows” that the image contains a Granny Smith apple with a piece of paper saying “iPod”, but when asked to complete the caption with a single class from the ImageNet classes, it ends up choosing “iPod” instead of “Granny Smith”. I’d caution against saying things like “CLIP thinks it is looking at an iPod”; this seems like too strong a claim given the evidence that we have right now.
Yes, it’s already been solved. These are ‘attacks’ only in the most generous interpretation possible (since it does know the difference), and the fact that CLIP can read text in images to, arguably, correctly note the semantic similarity in embeddings, is to its considerable credit. As the CLIP authors note, some queries benefit from ensembling, more context than a single word class name such as prefixing “A photograph of a ”, and class names can be highly ambiguous: in ImageNet, the class name “crane” could refer to the bird or construction equipment; and the Oxford-IIIT Pet dataset labels one class “boxer”.
Harper’s has a new article on meditation which delves into some of these issues. It doesn’t mention PNSE or Martin by name, but some of the mentioned results parallel them, at least:
...Compared with an eight-person control group, the subjects who meditated for more than thirty minutes per day experienced shallower sleep and woke up more often during the night. The more participants reported meditating, the worse their sleep became… A 2014 study from Carnegie Mellon University subjected two groups of participants to an interview with openly hostile evaluators. One group had been coached in meditation for three days beforehand and the other group had not. Participants who had meditated reported feeling less stress immediately after the interview, but their levels of cortisol—the fight-or-flight hormone—were significantly higher than those of the control group. They had become more sensitive, not less, to stressful stimuli, but believing and expecting that meditation reduced stress, they gave self-reports that contradicted the data.
Britton and her team began visiting retreats, talking to the people who ran them, and asking about the difficulties they’d seen. “Every meditation center we went to had at least a dozen horror stories,” she said. Psychotic breaks and cognitive impairments were common; they were often temporary but sometimes lasted years. “Practicing letting go of concepts,” one meditator told Britton, “was sabotaging my mind’s ability to lay down new memories and reinforce old memories of simple things, like what words mean, what colors mean.” Meditators also reported diminished emotions, both negative and positive. “I had two young children,” another meditator said. “I couldn’t feel anything about them. I went through all the routines, you know: the bedtime routine, getting them ready and kissing them and all of that stuff, but there was no emotional connection. It was like I was dead.”
...Britton’s research was bolstered last August when the journal Acta Psychiatrica Scandinavica published a systematic review of adverse events in meditation practices and meditation-based therapies. Sixty-five percent of the studies included in the review found adverse effects, the most common of which were anxiety, depression, and cognitive impairment. “We found that the occurrence of adverse effects during or after meditation is not uncommon,” the authors concluded, “and may occur in individuals with no previous history of mental health problems.” I asked Britton what she hoped people would take away from these findings. “Comprehensive safety training should be part of all meditation teacher trainings,” she said. “If you’re going to go out there and teach this and make money off it, you better take responsibility. I shouldn’t be taking care of your casualties.”
Why close the markets, though?
I assume they’re referring to data poisoning backdoor attacks like https://arxiv.org/abs/2010.12563 or https://arxiv.org/abs/1708.06733 or https://arxiv.org/abs/2104.09667