No disagreements here; I just want to note that if “the EA community” waits too long for such a pivot, at some point AI labs will probably be faced with people from the general population protesting because even now a substantial share of the US population views the AI progress in a very negative light. Even if these protests don’t accomplish anything directly, they might indirectly affect any future efforts. For example, an EA-run fire alarm might be compromised a bit because the memetic ground would already be captured. In this case, the concept of “AI risk” would, in the minds of AI researchers, shift from “obscure overconfident hypotheticals of a nerdy philosophy” to “people with different demographics, fewer years of education, and a different political party than us being totally unreasonable over something that we understand far better”.
alexlyzhov
“AI and Compute” trend isn’t predictive of what is happening
I don’t expect Putin to use your interpretation of “d” instead of his own interpretation of it which he is publicly advertising whenever he has a big public speech on the topic.
From the latest speech:
> In the 80s they had another crisis they solved by “plundering our country”. Now they want to solve their problems by “breaking Russia”.
This directly references an existential threat.
From the speech a week ago:
> The goal of that part of the West is to weaken, divide and ultimately destroy our country. They are saying openly now that in 1991 they managed to split up the Soviet Union and now is the time to do the same to Russia, which must be divided into numerous regions that would be at deadly feud with each other.
Same.
Also, consider nuclear false flags—the frame for them, including in these same speeches, was created and maintained throughout the entire year.
I could imagine that OpenAI getting top talent to ensure their level of research achievements while also filtering people they hire by their seriousness about reducing civilization-level risks is too hard. Or at least it could easily have been infeasible 4 years ago.
I know a couple of people at DeepMind and none of them have reducing civilization-level risks as one of their primary motivations for working there, as I believe is the case with most of DeepMind.
Given that the details in generated objects are often right, you can use superresolution neural models to upscale the images to a needed size.
Are PaLM outputs cherry-picked?
I reread the description of the experiment and I’m still unsure.
The protocol is on page 37 goes like this:
- the 2-shot exemplars used for few-shot learning were not selected or modified based on model output. I infer this from the line “the full exemplar prompts were written before any examples were evaluated, and were never modified based on the examination of the model output”.
- greedy decoding is used, so they couldn’t filter outputs given a prompt.What about the queries (full prompt without the QAQA few-shot data part)? Are they included under “the full exemplar prompts” or not? If they are there’s no output selection, if they aren’t the outputs could be strongly selected with the selection magnitude unreported. On one hand, “full prompts” should refer to full prompts. On the other hand, they only use “exemplar” when talking about the QAQA part they prepend to every query versus “evaluated example” meaning the query.
My calculation for AlphaStar: 12 agents * 44 days * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s * 32 TPUv3 boards * 33% actual board utilization = 2.02 * 10^23 FLOP which is about the same as AlphaGo Zero compute.
For 600B GShard MoE model: 22 TPU core-years = 22 years * 365 days/year * 24 hours/day * 3600 sec/hour * 420*10^12 FLOP/s/TPUv3 board * 0.25 TPU boards / TPU core * 0.33 actual board utilization = 2.4 * 10^21 FLOP.
For 2.3B GShard dense transformer: 235.5 TPU core-years = 2.6 * 10^22 FLOP.
Meena was trained for 30 days on a TPUv3 pod with 2048 cores. So it’s 30 days * 24 hours/day * 3600 sec/hour * 2048 TPUv3 cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 1.8 * 10^23 FLOP, slightly below AlphaGo Zero.
Image GPT: “iGPT-L was trained for roughly 2500 V100-days”—this means 2500 days * 24 hours/day * 3600 sec/hour * 100*10^12 * 33% actual board utilization = 6.5 * 10^9 * 10^12 = 6.5 * 10^21 FLOP. There’s no compute data for the largest model, iGPT-XL. But based on the FLOP/s increase from GPT-3 XL (same num of params as iGPT-L) to GPT-3 6.7B (same num of params as iGPT-XL), I think it required 5 times more compute: 3.3 * 10^22 FLOP.
BigGAN: 2 days * 24 hours/day * 3600 sec/hour * 512 TPU cores * 0.25 TPU boards / TPU core * 420*10^12 FLOP/s/TPUv3 board * 33% actual board utilization = 3 * 10^21 FLOP.
AlphaFold: they say they trained on GPU and not TPU. Assuming V100 GPU, it’s 5 days * 24 hours/day * 3600 sec/hour * 8 V100 GPU * 100*10^12 FLOP/s * 33% actual GPU utilization = 10^20 FLOP.
gwern has recently remarked that one cause of this is supply and demand disruptions and this may be a temporary phenomenon in principle.
Papers on protein design
with the only recursive element of its thought being that it can pass 16 bits to its next running
I would name activations for all previous tokens as the relevant “element of thought” here that gets passed, and this can be gigabytes.
From how the quote looks, I think his gripe is with the possibility of in-context learning, where human-like learning happens without anything about how the network works (neither its weights nor previous token states) being ostensibly updated.
This idea tries to discover translations between the representations of two neural networks, but without necessarily discovering a translation into our representations.
I think this has been under investigation for a few years in the context of model fusion in federated learning, model stitching, and translation between latent representations in general.
Relative representations enable zero-shot latent space communication—an analytical approach to matching representations (though this is a new work, it may be not that good, I haven’t checked)
Git Re-Basin: Merging Models modulo Permutation Symmetries—recent model stitching work with some nice results
Latent Translation: Crossing Modalities by Bridging Generative Models—some random application of unsupervised translation to translation between autoencoder latent codes (probably not the most representative example)
To receive epistemic credit, make sure that people would know you haven’t made all possible predictions on a topic this way and then revealed the right one after the fact. You can probably publish plaintext metadata for this.
On prior work: they cited l-lxmert (Sep 2020) and TReCS (Nov 2020) in the blogpost. These are the baselines it seems.
https://arxiv.org/abs/2011.03775
https://arxiv.org/abs/2009.11278The quality of objects and scenes there is far below the new model. They are often just garbled and not looking quite right.
But more importantly, the best they could sometimes understand from the text is something like “a zebra is standing in the field”, i.e. the object and the background, all the other stuff was lost. With this model, you can actually use much more language features for visualization. Specifying spatial relations between the objects, specifying attributes of objects in a semantically precise way, camera view, time and place, rotation angle, printing text on objects, introducing text-controllable recolorations and reflections. I may be mistaken but I think I haven’t seen any convincing demonstrations of any of these capabilities in an open-domain image+text generation before.
One evaluation drawback that I see is they haven’t included any generated human images in the blogpost besides busts. Because of this, there’s a chance scenes with humans are of worse quality, but I think they would nevertheless be very impressive compared to prior work, given how photorealistic everything else looks.
I’m not sure what accounts for this performance, but it may well mostly be more parameters (2-3 orders of magnitude more compared to previous models?) plus more and better data (that new dataset of image-text pairs they used for CLIP?)
Does anyone have a good model of how do they reconcile
1) a pretty large psychosis rate in this survey, a bunch of people in https://www.lesswrong.com/posts/MnFqyPLqbiKL8nSR7/my-experience-at-and-around-miri-and-cfar-inspired-by-zoe saying that their friends got mental health issues after using psychedelics, anecdotal experiences and stories about psychedelic-induced psychosis in the general cultural field
and
2) Studies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3747247/ https://journals.sagepub.com/doi/10.1177/0269881114568039 finding no correlation, or, in some cases, negative correlation between psychedelic consumption and mental health issues?
- Studies done wrong?
- Studies don’t have enough statistical power?
- Something with confounding and Simpson’s paradox? Maybe there’s a particular subgroup in the population where psychedelic use correlates negatively with likelihood of mental health issues within this subgroup, or a subgroup where there is more psychedelic use and simultaneously lower likelihood of mental health issues on average across the subgroup?
- Psychedelics impart mental well-being and resilience to some people to such a degree that it cancels out the negative mental health effects in other people, so that in expectation psychedelics wouldn’t affect your mental health negatively?
why it is so good in general (GPT-4)
What are the examples indicating it’s at the level of performance at complex tasks you would expect from GPT-4? Especially performance which is clearly attributable to improvements that we expect to be made in GPT-4? I looked through a bunch of screenshots but haven’t seen any so far.
These games are really engaging for me and haven’t been named:
Eleven Table Tennis. Ping-pong in VR (+ multiplayer and tournaments):
Racket NX. This one is much easier but you still move around a fair bit. The game is “Use the racket to hit the ball” as well.
Synth Riders. An easier and more chill Beat Saber-like game:
Holopoint. Archery + squats, gets very challenging on later levels:
Some gameplay videos for excellent games that have been named:
Beat Saber. “The VR game”. You can load songs from the community library using mods.
Thrill of the Fight (boxing):
For every token, model activations are computed once when the token is encountered and then never explicitly revised → “only [seems like it] goes in one direction”
Every other day I have a bunch of random questions related to AI safety research pop up but I’m not sure where to ask them. Can you recommend any place where I can send these questions and consistently get at least half of them answered or discussed by people who are also thinking about it a lot? Sort of like an AI safety StackExchange (except there’s no such thing), or a high-volume chat/discord. I initially thought about LW shortform submissions, but it doesn’t really look like people are using the shortform for asking questions at all.
Actually, the Metaculus community prediction has a recency bias:
> approximately sqrt(n) new predictions need to happen in order to substantially change the Community Prediction on a question that already has n players predicting.In this case, n=298, the prediction should change substantially after sqrt(n)=18 new predictions (usually it takes up to a few days). Over the past week, there were almost this many predictions and the AGI community median has shifted 2043 → 2039, and the 30th percentile is 8 years.
This is the link to Yudkowsky discussion of concept merging with the triangular lightbulb example: https://intelligence.org/files/LOGI.pdf#page=10
Generated lightbulb images: https://i.imgur.com/EHPwELf.png