NYU PhD student working on AI safety
Jacob Pfau
The above estimate was mislead since I had mistakenly read ′ I then compute the fraction #answers(Yes, Yes, Yes) / #answers(Yes, *, *) ′ as ′ I then compute the fraction #answers(Yes, Yes, Yes) / #answers(Yes, Yes, *)’.
I agree with Ethan’s recent comment that experience with RL matter a lot, so a lot comes down to how the ′ Is X’s work related to AGI? ′ criterion is cashed out. On some reading of this, many NLP researchers do not count, on another reading they do count. I’d say my previous prediction was a decent, if slightly over-estimate of the scenario in which ‘related to AGI’ is interpreted narrowly, and many NLP researchers are ruled out.
A second major confounder is whether prominent AI researchers are far more likely to have been asked about their opinion on AI safety in which case they have some impetus to go read up on the issue.
To cash some of these concerns out into probabilities:
75% that Rohin takes a broad interpretation of AGI which includes e.g. GPT-team, NAS research etc.
33% estimated (Yes,Yes,Yes) by assuming prominent researchers 2x as likely to have read up on AI safety.
25% downweighted from 33% taking into account industry being less concerned.
Assuming that we’re at ~33% now, 50% doesn’t seem too far out of reach, so my estimates for following decades are based on the same concerns I listed in my above comment framed with the 33% in mind.
Updated personal distribution: elicited
Updated Rohin’s posterior: elicited
(Post competition footnote: seems to me over short time horizons we should have a more-or-less geometric distribution. Think of the more-or-less independent per year chance that a NeurIPS keynote features AI safety, or youtube recommender algorithm goes bonkers for a bit. Seems strange to me that some other people’s distribution over the next 10-15 years—if not longer—do not look geometric.)
- 4 Aug 2020 0:58 UTC; 15 points) 's comment on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns by (
Great post! I am very curious about how people are interpreting Q10 and Q11, and what their models are. What are prototypical examples of ‘insights on a similar level to deep learning’?
Here’s a break-down of examples of things that come to my mind:
Historical DL-level advances:
the development of RL (Q-learning algorithm, etc.)
Original formulation of a single neuron i.e. affine transformation + non-linearity
Future possible DL-level:
a successor to back-prop (e.g. the how biological neurons learn)
a successor to the Q-learning family (e.g. neatly generalizing and extending ‘intrinsic motivation’ hacks)
full brain simulation
an alternative to the affine+activation recipe
Below DL-level major advances:
an elegant solution to learn from cross-modal inputs in a self-supervised fashion (babies somehow do it)
a breakthrough in active learning
a generalizable solution to learning disentangled and compositional representations
a solution to adversarial examples
Grey areas:
breakthroughs in neural architecture search
a breakthrough in neural Turing machine-type research
I’d also like to know how people’s thinking fits in with my taxonomy: Are people who leaned yes on Q11 basing their reasoning on the inadequacy of the ‘below DL-level advances’ list, or perhaps on the necessity of the ‘DL-level advances’ list? Or perhaps people interpreted those questions completely differently, and don’t agree with my dividing lines?
One axis along which I’d like clarification is whether you want a form of explanation which is learner agnostic or learner specific? It seems to me that traditional transparency/interpretability tools try to be learner agnostic, but on the other hand the most efficient way to explain makes use of the learner’s pre-existing knowledge, inductive biases, etc.
In the learner agnostic case, I think it will be approximately impossible to succeed at this challenge. In the learner specific case, I think it will require something more than an interpretability method. This latter task will benefit from better and better models of human learning—in the limit I imagine something like a direct brain neuralink should do the trick...
On the learner specific side, it seems to me Nisan is right when he said ‘The question is if we can compress the bot’s knowledge into, say, a 1-year training program for professionals.’ To that end, it seems like a relevant method could be an improved version of influence functions. Something like find in the training phase when the go agent learned to make a better move than the pro and highlight the games (/moves) which taught it the improved play.
Not sure what the best way to formalize this intuition is, but here’s an idea. (To isolate this learner-agnostic/specific axis from the problem of defining explanation, let me assume that we have some metric for quantifying explanation quality, call it ‘R’ which is a function from <Model, learner, explanation> triples to real values.)
Define learner-agnostic explanation as optimizing for aggregate R across some distribution of learners—finding the one optimal explanation across this distribution. Learner-specific explanation optimizes for R taking the learner as an input—finding multiple optimal explanations, one for each learner.
The aggregation function in the learner-agnostic case could be the mean, or it could be a minimax function. The minimax case intuition would be formalizing the task of coming up with the most accessible explanation possible.
Things like influence functions, input-sensitivity methods, automated concept discovery are all learner-agnostic. On the other hand, probing methods (e.g. as used in NLP) could maybe be called learner-specific. The variant of influence functions I suggested above is learner-specific.
In general, it seems to me that as the models get more and more complex, we’ll probably need explanations to be more learner-specific to achieve reasonable performance. Though perhaps learner-agnostic methods will suffice for answering general questions like ‘Is my model optimizing for a mesa-objective’?
[2104.07143v1] An Interpretability Illusion for BERT (arxiv.org) suggests a more complicated picture wherein many neurons give the impression that they’re encoding coherent concepts, but then seem to encode completely different concepts when tested on a different dataset. They’re certainly not directly contradictory, but Figure 2 of the illusion paper suggests the opposite of what Figure 5 of the Knowledge Neuron’s paper suggests. On the other hand, the illusion paper mentions they found tentative evidence for the existence of global concept directions and perhaps all knowledge neurons are such global concept directions.
Ordered from most to least plausible, possible explanations for this apparent discrepancy include:
Knowledge neurons are more specialized than the average neuron (knowledge neurons are ‘global’)
Dataset choice matters. In particular, Pararel sentences isolate relations in a way that other datasets don’t, helping to identify specialized neurons
Attribution method matters
Layer choice matters (Illusion papers mentions quick looks at layers 2 and 7 showed similar results, Knowledge neuron paper motivates the layer choice by analogy to key-value pairs)
These don’t quite qualify as research film study, but Fields medallist Timothy Gowers has a number of videos in which he records his problem solving process in detail. E.g. Two products that cannot be equal. From what I can tell, he chooses quite accessible problems. Studying this sort of video might be most analogous to studying how an expert athlete does a drill.
Some users of the Alignment Forum’s post their work-in-progress ideas on topics. Taken as a sequence this amounts to something like a paper plus how it was made. Perhaps it would be worth looking back retrospectively and curating sequences which lead to significant insight for study purposes? The closest thing to film study available in one post is probably Commentary on AGI Safety from First Principles—AI Alignment Forum.
Am I correct to assume that the discussion of StarCraft and Minecraft are discussing single-player variants of those games?
It seems to me that in a competitive, 2-player, minimize-resource-competition StarCraft, you would want to go kill your opponent so that they could no longer interfere with your resource loss? More generally, I think competitions to minimize resources might still usually involve some sort of power-seeking. I remember reading somewhere that ‘losing chess’ involves normal-looking (power-seeking?) early game moves.
Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you’re done. However, I don’t see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game?
Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based on end-of-game resource minimisation, you end up participating in an unbounded resource-maximisation competition trying to guarantee control over your opponent; then you spend your resources safely after crippling your opponent? In the single player setting, you will be incentivised to build up your infrastructure so as to spend your resources more quickly.
It seems to me the multi-player case involves power-seeking. Then, it seems like negative alpha weights don’t generically imply anything about the existence of power-seeking incentives?
(I’m actually not clear on whether the single-player case should be seen as power-seeking or not? Maybe it depends on your choice of discount rate, gamma? You are building up infrastructure, i.e. unit-producing buildings, which seems intuitively power-seeking. But the number of long-term possibilities available to you following spending resources on infrastructure is reduced—assuming gamma=1 -- OTOH the number of short-term possibilities may be higher given infrastructure, so you may have increased power assuming gamma<1?)
It’s worth noting that Table 7 shows Github pre-training outperforming MassiveText (natural language corpus) pre-training. The AlphaCode dataset is 715GB compared to the 10TB of MassiveText (which includes 3TB of Github). I have not read the full details of both cleaning processes, but I assume that the cleaning / de-duplication process is more thorough in the case of the AlphaCode Github only dataset. EDIT: see also Algon’s comment on this below.
I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim. One remaining question might be whether this finding is an artefact of code dataset growth temporarily accelerating relative to NL dataset size growth. Insofar as dataset sizes are constrained by compute rather than absolute data availability, I think we should expect code dataset sizes to approach NL dataset sizes. Also cf. my recent Metaculus question on future prospects for NL-pretraining to programming transfer.
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!
Here’s one way of thinking about sleep which seems compatible with both the less-sleep-needed thesis and the lower-productivity-while-deprived observation: Some minimal amount of sleep provides a metabolic / cognitive role, and beyond this amount, additional hours of sleep were useful in the evolutionary context to save calories when the additional wakeful hours would not provide pay off.
If true, we’d expect there to a more-or-less fixed function from sleep quantity to sleepiness within the very low sleep range, but in the mid-sleep (5-8 hr?) range this function from quantity to sleepiness would be entirely mediated by stimulation. Stimulation here could mean physical exercise, but I expect excitement / anticipation are also very relevant—in an evolutionary context such feelings signal higher payoff for wakefulness.
The importance of such a perspective, is that reducing sleep quantity would be possible only conditional on the upstream stimulation/excitement variable. Elon, Guzey, highly motivated or active people would all have an easier time avoiding unpleasant struggles to overcome sleepiness. If you are not highly motivated / excited by a given day’s activities there are a few possible implications: 1) simply assume greater sleepiness and give up on economising sleep 2) you could try intermittent physical exercise e.g. periodically doing some squats 3) you could deliberately schedule things which you find exciting / something for late in the day to generate anticipation.
I would be interested to see data on this idea, either by testing strategies (2) and (3) in a psychological study, or by comparing sleep patterns in hunter societies as they vary across time (as a function of hunting opportunity). I think there’s already decent support for this explanation, since it explains the discrepancy between the Elon/Guzey/Sailors anecdata and the fact that most people aren’t happy about missing two hours of sleep. The story also seems to fit well with the depression-treatment result. To make this point clear, one way of putting things is that there’s some excitement/looking-forwardness state which overlaps with the low-sleep state.
Here’s Chalmers defending his combinatorial state automata idea.
I once had a multiple day hospitalization following use of modafinil (to prevent jetlag) during a flight—checkups found no clear cause. This is obviously N=1, but still makes me wonder if there’s some adverse interaction between modafinil and pressure changes. Would be interested if anyone has had similar experiences and/or knows of a relevant mechanism.
Sleeping 22 hours a day for 2-3 days pre-admission and fever. I think the presumption was those sorts of symptoms merit careful investigation. Don’t remember if there were any particular test results that were remarkable. IIRC there weren’t.
Seeing as the modafinil was not prescription, and I’ve never heard of similar symptoms from others, it’s quite plausible my pills were just contaminated with some other substance. Still should probably update against taking modafinil without prescription, since this contamination risk is just as important as side-effect symptoms.
That certainly sounds scary, but seems unlikely in my case. No tox screen, but also did not buy locally in Berkeley, and had previously used the pills without problem.
A quadratic funding mechanism (similar to Gitcoin) could make sense for putting up distillation bounties. Quadratic funding (QF) lets a grant-maker put up a pool of matching funds while individual researchers specify how much each individual bounty would be valuable to her/him; then the matching is done via the QF strategy to optimize for aggregate researcher utility. Speaking for myself, I would contribute to a community fund for further distillations, and I would also be more likely to distill.
I find the level of distillation done by Daniel Filan at AXRP to be great. Short and listenable enough for easy access while detailed enough to let you form opinions/directions for further research.
Some IMO valuable targets for distillation: Infra-Bayesianism and recent work on why deep nets converge e.g. ‘Gradient Descent Finds Global Minima of Deep Neural Networks’, Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers, A Theoretical Analysis of Deep Q-Learning
How exactly does reward relate to valenced states in humans? In general, what gives rise to pleasure and pain, in addition to (or instead of) the processing of reward signals?
These problems seem important and tractable even if working out the full computational theory of valence might not be. We can distinguish three questions:
What is the high-level functional role of valence? (coarse-grained functionalism)
What evolutionary pressures incentivized valenced experience?
What computational processes constitute valence? (fine-grained functionalism)
Answering #3 would be best, but it seems to me that answering #1 and #2 is far more feasible. A promising and realistic scenario might be discovering a distinction between positive and negative valence from perspectives #1 and 2, and then giving the DeepMind presentation encouraging them to avoid the coarse-grained functional structures and incentives for negative valence. From my incomplete understanding of the consciousness and valence literature, it seems to me that almost all work is contributing to answering question #1 not question #3.
One avenue in this direction might be looking into the interaction between valence and attention. It seems to me that there is an asymmetry there (or at least a canonical way of fixing a zero point). Positive valence involves attention concentration whereas negative valence involves diffusion of attention / searching for ways to end this experience. A couple reasons why I’m optimistic about this direction: First, attention likely bears some intrinsic connection with consciousness (other coarse-grained functional correlates such as commensurability, addiction etc. need not); second, attention manipulation seems like it might be formalizable in a way relevant for machine learning practitioners. (I’m using attention here in the philosophy/neuro sense not the transformer sense)
Valence is of course a result of evolution. If we can identify precisely what evolutionary pressures incentivize valence, we can take an outside (non-anthropomorphizing, non-xenomorphizing) view: applying Laplace’s rule gives us a 2⁄3 chance that AI developed with similar incentives will also experience valence?
My old prediction for when the fraction be >= 0.5: elicited
My old prediction for Rohin’s posterior: elicited
I went through the top 20 list of most cited AI researchers on google scholar (thanks to Amanda for linking), and estimated that roughly 9 of them may qualify under Rohin’s criterion. Of those 9, my guess was that 7⁄9 would answer ‘Yes’ on Rohin’s question 3.
My sampling process was certainly biased. For one, AI researchers are likely to be more safety conscious than industry experts. My estimate also involved considerable guesswork, so I down-weighted the estimated 7⁄9 to a 65% chance that the >=0.5 threshold will be met within the first couple years. Given the extreme difference between my distribution and the others posted, I guess there’s a 1⁄3 chance that my estimate based on the top 20 sampling will carry significant weight in Rohin’s posterior.
The justification for the rest of my distribution is similar to what others have said here and elsewhere about AI safety. My AGI timeline is roughly in line with the metaculus estimate here. Before the advent of AGI, a number of eventualities are possible: a warning shot occurs, perhaps theoretical consensus will emerge, perhaps industry researchers will be oblivious to safety concerns because of a principal-agent nature to the problem, perhaps AGI will be invented before safety is worked out, etc.
Edit: One could certainly do a better job of estimating where the sample population of researchers currently stands by finding a less biased population. Maybe people interviewed by Lex Fridman, that might be a decent proxy for AGI-research-fame?