gwern
If we look on insects, we will see that they were not capable to evolve general intelligence despite the fact that they exist for 400 mln years, they are much more numerous than mammals and they have very high speed of evolution, in which a new species can appear in a few years and each tree can have different species of beetles in jungles.
This analogy doesn’t work because humans exist. Life on Earth did evolve intelligence in a rather short time and with an astronomically tiny amount of resources (only that which you can obtain on 1 small rocky planet). UAP would need to be so locked into niches with evolutionary strength many quadrillions times stronger than the difficulty in evolving intelligence on Earth in order for nowhere in the universe to create intelligent UAPs (and given that we are talking about quadrillions upon quadrillions upon quadrillions, do so many many times in many locations) which can then rapidly dominate in the way that humans do now in terms of passing a threshold and now controlling a large fraction of biomass and massively altering almost every trait of the Earth.
One observation on #2: because they move and colonize faster, evolution will happen at literal light-speed. Viral evolution will have nothing on how fast ‘UAP’ life evolves when you have densely colonized the entire universe and move & metabolize at literally light-speed; that represents who knows how many quadrillions times more individual units of selection, operating billions of times faster, than life on Earth. (Compare the frequencies of photonics vs electronics vs biological devices: terahertz vs gigahertz vs hertz.) UAP life would, universe-wide, go through more selection in seconds than Earth life has total to date. Anything that is possible with non-zero probability will be feasible, and fast. This includes intelligence. They will either be viruses or deities. (They will not be playing chase with fighter pilots for kicks.)
The primary limit would be that lightspeed limits the universe-wide propagation of more fit UAPs (hypothetically, a superior UAP on the other edge of the visible universe would still be billions of years away from reaching us), but this would just lead to convergent evolution as an innovation emerges repeatedly in multiple patches and starts spreading in ‘shockwaves’, recapitulating ‘grabby aliens’ dynamics except much faster.
So interesting UAPs are ruled out even more strongly by the complete absence of any discernible major heterogeneity in the cosmos: if UAPs are not purely epiphenomenal, if there is any way whatsoever for them to affect regular matter or stellar evolution or radiation or light at scale, we would see the different bubbles and be affected by UAPs, rather than seeing a clockwork universe unspooling from the Big Bang.
That post by Raemon also clearly called for a solution, which I don’t believe Raemon found.
What do you think of gwern.net style popups with summaries as a solution?
Maybe as we get better at decoding DNA we’ll find leftover scraps of some of them lurking in the seemingly-unused sections of various genomes.
Coincidentally, Quanta has an article on a modification of RNA World hypothesis arguing something similar: https://www.quantamagazine.org/lifes-first-peptides-may-have-grown-on-rna-20220524/
Helpfully, DeepMind’s chief operating officer, Lila Ibrahim (“a passionate advocate for social impact in her work and her personal life”), who would be intimately involved in any funding of safety research, overseeing large-sale deployment, and reacting to problems, has a blog post all about what she thinks AI safety is about and what she is concerned about in doing AI research responsibly: “Building a culture of pioneering responsibly: How to ensure we benefit society with the most impactful technology being developed today”
I believe pioneering responsibly should be a priority for anyone working in tech. But I also recognise that it’s especially important when it comes to powerful, widespread technologies like artificial intelligence. AI is arguably the most impactful technology being developed today. It has the potential to benefit humanity in innumerable ways – from combating climate change to preventing and treating disease. But it’s essential that we account for both its positive and negative downstream impacts. For example, we need to design AI systems carefully and thoughtfully to avoid amplifying human biases, such as in the contexts of hiring and policing.
She has also written enthusiastically about DM’s funding for “racial justice efforts”.
FWIW, the fact that the scaling laws were different and extrapolate very differently and also apparently resolve the contradiction were discussed a lot at the time; I dunno if they were discussed enough, but certainly it was in the discussions here & /r/MLscaling & by Daniel & Nostalgebraist & the usual suspects.
Yes, it is a cautionary lesson, just not the one people are taking it for; however, given the minimal documentation or transparency, there are a lot of better examples of short-term risk-heavy investments or blowing up (which are why Wall Street strictly segregates the risk-control departments from the trader and uses clawbacks etc) to tell, so the Zillow anecdote isn’t really worth telling in any context (except perhaps, like the other stories, as an example of low epistemic standards and how leprechauns are born).
Google Brain just announced Imagen (Twitter), which on skimming appears to be not just as good as DALL-E 2 but convincingly better. The main change appears to be reducing the CLIP reliance in favor of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on “compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts.” The samples also show text rendering fine inside the images as well.
I take this as strong support (already) for my claims 2-3: the problems with DALL-E 2 were not major or deep ones, do not require any paradigm shift to fix, or even any fix, really, beyond just scaling the components almost as-is. (In Kuhnian terms, the differences between DALL-E 2 and Imagen or Make-A-Scene are so far down in the weeds of normal science/engineering that even people working on image generation will forget many of the details and have to double-check the papers.)
I have approximately zero knowledge about Zillow or the real estate business, so I can’t comment on whether this actually is what went wrong.
FWIW, like the tank or Microsoft Tay or the Amazon hiring system stories, Zillow seems quite dubious as an AI-risk/bias story despite its rapid canonization into the narrative.
The current narrative seems to trace mostly to the Zillow CEO’s initial analyst call where he blamed Zillow’s losses on the models, saving his face although neglecting to explain how his competitors in the same markets using similar models & datasets mysteriously did so much better, which then got repeated in the online echo chamber (ever more encrusted with opinions and speculation).
But almost immediately, anonymous comments on Reddit/HN & Zillow insiders talking to Bloomberg journalists told a different story: Zillow had two sets of models, and was well-aware that the secondary price model necessarily would lead to the winner’s curse & losses if they bought at the model’s high bid because they’d only win the houses they were overpaying for; they bought all the houses anyway because of a failure of management, which wanted to hit sales & growth targets and used the secondary rather than primary model as an excuse to do all the buys. This risk-hungry growth-buying strategy looked good in the short-term but then failed long-term in the predictable way, as has happened to many market-makers and companies in the past, when a predictably unpredictable market fluctuation happened.
I think a version of this pretty much has to be true for at least a subset of skills/declarative knowledge, like factual knowledge (being a walking Wikipedia) or programming. A large model has read more of Wikipedia, and memorized more of Wikipedia (as well as Arxiv, Pubmed...), than any single human ever has. One trained on Github has also learned more languages to more depth than any human has: a human programmer will be superior on a few specific languages, doubtless, but they will still know less in aggregate. So when it conditions on an individual prompt/human, the imitation will be limited, but that vast pool of knowledge is still there. Across all the prompts one might test, one can elicit more knowledge than an individual human has.
In order for both of the points to be true, that is equivalent to claiming that it cannot tap into the full pool under all possible conditions, including invasive ones like RL training or prompt finetuning, which is to make a truly remarkable universal claim with a heavy burden of proof. Somehow all that knowledge is locked up inside the model parameters, but in so fiendishly encrypted a way that only small subsets—which always just happen to correspond to human subsets—can ever be used in a response...?
So, since at least some modest version is definitely true, the only question is how far it goes. Since the ways in which imitation learning can exceed experts are quite broad and general, it’s hard to see why you would then be able to cavil at any particular point. It just seems like an empirical question of engineering & capabilities about where the model lands in terms of its unshackled capabilities—the sort of thing you can’t really deduce in general, and just have to measure it directly, like prompting or gradient-ascending a Decision Transformer for its highest possible reward trajectory to see what it does.
The query is for “ASL”, not “Hands”, and these images don’t look like something from a protest.
ASL will always be depicted by a model like DALL-E as hands; I am sure that there are non-pictorial ways to write down ASL but I can’t recall them, and I actually took ASL classes. So that query should always produce hands in it. Then because actual ASL diagrams will be rare and overwhelmed by leakage from more popular classes (keep in mind that deafness is well under 1% of the US population, even including people like me who are otherwise completely uninvolved and invisible, and basically any political fad whatsoever will rapidly produce vastly more material than even core deaf topics), and maybe some more unCLIP looseness...
It’s unnatural, yes: ASL is predominantly white, and people involved in ASL are even more so (I went to NTID and the national convention, so can speak first-hand, but you can also check Google Image for that query and it’ll look like what you expect, which is amusing because ‘Deaf’ culture is so university & liberal-centric). So it’s not that ASL diagrams or photographs in the wild really do look like that—they don’t.
Overrepresentation of DEI material in the supersekrit licensed databases would be my guess. Stock photography sources are rapidly updated for fashions, particularly recent ones, and you can see this occasionally surfacing in weird queries. (An example going around Twitter which you can check for yourself: “happy white woman” in Google will turn up a lot of strange photos for what seems like a very easy straightforward query.) Which parts are causing it is a better question: I wouldn’t expect there to be much Deaf stock photo material which had been updated, or much ASL material at all, so maybe there’s bleedthrough from all of the hand-centric (eg ‘Black Power salute’, upraised Marxist fists, protests) iconography? There being so much of the latter and so little of the former that the latter becomes the default kind of hand imagery.
This point has also been made before: predictions of short-term stagnation without also simultaneously bumping back AGI timelines would appear to imply steep acceleration at some point, in order for the necessary amounts of progress to ‘fit’ in the later time periods.
I’m not sure if that matters. By definition, it probably won’t happen and so any kind of argument or defense based on tail-risks-crippling-AIs-but-not-humans will then also by definition usually fail (unless the tail risk can be manufactured on demand and also there’s somehow no better approach), and it’s unclear that’s really any worse than humans (we’re supposedly pretty bad at tail risks). Tail risks also become a convergent drive for empowerment: the easiest way to deal with tail risks is to become wealthy so quickly that it’s irrelevant, which is what an agent may be trying to do anyway. Tail stuff, drawing on vast amounts of declarative knowledge, is also something that can be a strength of of artificial intelligence compared to humans: an AI trained on a large corpus can observe and ‘remember’ tail risks in a way that individual humans never will—a stock market AI trained on centuries of data will remember Black Friday vividly in a way that I can’t. (By analogy, an ImageNet CNN is much better at recognizing dog breeds than almost any human, even if that human still has superior image skills in other ways. Preparing for a Black Friday crash may be more analogous to knowing every kind of terrier than being able to few-shot a new kind of terrier.)
I have an earlier comment on the ‘horizon hypothesis’ and temporal scaling laws: https://www.lesswrong.com/posts/65qmEJHDw3vw69tKm/proposal-scaling-laws-for-rl-generalization?commentId=wMerfGZfPHerdzDAi
I disagree that you can be sure of any sort of monotonically-decreasing capability with temporal duration because there seem to be a lot of ways around it which could produce inversions for specific agents, and more broadly, in a control context, for an agent, longer timescales is itself a resource which can be spent (eg. on outsourcing to hired guns, or recovering from errors like we see in inner-monologues, or guess-and-check, or brute-force), so longer episodes can be more rewarding than short episodes. It is a thin reed on which to rely.
Alphastar was trained by creating a league of AlphaStars which competed against each other in actual games. To continue our weightlifting analogy, this is like a higher rep range with lower weight.
I think by this point your weightlifting analogy has started to obscure much more than clarify. (Speaking as something who just came back from doing some higher rep exercises with lower weight, I struggle to see how that was in any sense like the AlphaStar League PBT training.)
I disagree with the claim that progress has slowed down but I am also not too sure what you are arguing since you are redefining ‘progress’ to mean something other than ‘quickly making way more powerful systems like AlphaFold or GPT-3’, which you do agree with. To rephrase this more like the past scaling discussions, I think you are arguing something along the lines of
Recent ‘AI progress’ in DL is unsustainable because it was due not to fundamentals but picking low-hanging fruits, the one-time using-up of a compute overhang: it was largely driven by relatively small innovations like the Transformer which unlocked scaling, combined with far more money spent on compute to achieve that scaling—as we see in the ‘AI And Compute’ trend. This trend broke around when it was documented, and will not resume: PaLM is about as large as it’ll get for the foreseeable future. The fundamentals remain largely unchanged, and if anything, improvement of those slowed recently as everyone was distracted picking the low-hanging fruits and applying them. Thus, the near future will be very disappointing to anyone extrapolating from the past few years, as we have returned to the regime where research ideas are the bottleneck, and not data/compute/money, and the necessary breakthrough research ideas will arrive unpredictably at their own pace.
But remember, to train “a Gato,” we have to first train all the RL policies that generate its training data. So we have access to all of them too.
No, you don’t have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it’s simply doing self-supervised learning on (how did we ‘train the RL policy’ for photographs?). It is also not true of it because it’s off-policy and offline: the experts could be human, or they could be the output of non-RL algorithms which are infeasible to run much like large search processes (eg chess endgame tables) or brittle non-generalizable expert-hand-engineered algorithms, or they could be RL policies you don’t have direct access to (because they’ve bitrotten or their owners won’t let you), or even RL policies which no longer exist because the agents were deleted but their data remains, or they could be RL policies from an oracle setting where you can’t run the original policy in the meaningful real world context (eg in robotics sim2real where you train the expert with oracle access to the simulation’s ground truth to get a good source of demonstrations, but at the end you need a policy which doesn’t use that oracle so you can run it in a real robot) or more broadly any kind of meta-learning context where you have data from RL policies for some problems in a family of problems and want to induce general solving, or they are filtered high-reward episodes from large numbers of attempts by brute force dumb (even random) agents where you trivially have ‘access to all of them’ but that is useless, or… Those RL policies may also not be better than a Gato or DT to begin with, because imitation learning can exceed observed experts and the ‘RL policies’ here might be, say, random baselines which merely have good coverage of the state-space. Plus, nothing at all stops Decision Transformer from doing its own exploration (planning was already demonstrated by DT/Trajectory Transformer, and there’s been work afterwards like Online Decision Transformer).
Yeah, I don’t find a linear regression on pairs of models to be all that informative:
-
the parameterization as % is misleading, squashing differences
especially as you would expect for 2 reasons performance to spend most of its time near 0 or 1: near 1, because we are so excited about DL because it is solving so many tasks, and once solved they stay solved; and near 0 because, so many of the tasks now approaching 1, we need to create even more super-duper hard, now usually adversarially constructed, tasks, where all the models start off around 0
it also would tend to exaggerate or erase the plateaus and phase transitions based on where the model sizes start in the transition and what the base rate of error is, neither of which has any principled connection to the phenomenon of interest (it is not important if the baseline is 10% error because it’s a multiple-choice with 10 options instead of 25% because it had 4 options).
-
individual tasks have a lot of random sampling error: ie. if we constructed it a second time with fresh data, we would see the same models get different scores
-
individual models have ‘sampling error’: each model is a sample from the Bayesian posterior and will make somewhat different predictions; this will lead to different scores on the same task (ie. if we trained the same model in exactly the same way except for the random seed & other nondeterminism like GPU ops/network, it would get different scores on the same task)
-
comparing a single pair of models is not very powerful:
You don’t have ‘n = 62’, you have ‘n = 1’. (Imagine if you broke down each task into its individual single-questions instead of the fairly arbitrary existing task chunks. Do you now suddenly have n=100,000 or whatever? No, of course not.)
range restriction in model scaling: these are power/log phenomenon; pair of models differing by a single order of magnitude is not informative.
Plotting the predictive loss over many models and multiple orders of magnitude is meaningful. Plotting it versus normalized performance across many tasks is also reasonable albeit highly noisy. Plotting individual tasks of single checkpoints against somewhat larger checkpoints is a recipe for baking in so much noise at so many levels, deflating everything by measurement error so far towards zero, that I’m not too surprised one doesn’t see any clear patterns in the residuals and may be chasing noise.
-
It would be completely arbitrary and clearly motivated only by saving the appearances, but yes, you pretty much need a hard fundamental limit to explain why they are so epiphenomenal despite such cosmic resources. There’s no plausible story where they can manipulate the mortal realm enough to screw with pilots but also we observe the universe exactly as it is.