NYU PhD student working on AI safety
Jacob Pfau
Here’s one way of thinking about sleep which seems compatible with both the less-sleep-needed thesis and the lower-productivity-while-deprived observation: Some minimal amount of sleep provides a metabolic / cognitive role, and beyond this amount, additional hours of sleep were useful in the evolutionary context to save calories when the additional wakeful hours would not provide pay off.
If true, we’d expect there to a more-or-less fixed function from sleep quantity to sleepiness within the very low sleep range, but in the mid-sleep (5-8 hr?) range this function from quantity to sleepiness would be entirely mediated by stimulation. Stimulation here could mean physical exercise, but I expect excitement / anticipation are also very relevant—in an evolutionary context such feelings signal higher payoff for wakefulness.
The importance of such a perspective, is that reducing sleep quantity would be possible only conditional on the upstream stimulation/excitement variable. Elon, Guzey, highly motivated or active people would all have an easier time avoiding unpleasant struggles to overcome sleepiness. If you are not highly motivated / excited by a given day’s activities there are a few possible implications: 1) simply assume greater sleepiness and give up on economising sleep 2) you could try intermittent physical exercise e.g. periodically doing some squats 3) you could deliberately schedule things which you find exciting / something for late in the day to generate anticipation.
I would be interested to see data on this idea, either by testing strategies (2) and (3) in a psychological study, or by comparing sleep patterns in hunter societies as they vary across time (as a function of hunting opportunity). I think there’s already decent support for this explanation, since it explains the discrepancy between the Elon/Guzey/Sailors anecdata and the fact that most people aren’t happy about missing two hours of sleep. The story also seems to fit well with the depression-treatment result. To make this point clear, one way of putting things is that there’s some excitement/looking-forwardness state which overlaps with the low-sleep state.
When are model self-reports informative about sentience? Let’s check with world-model reports
If an LM could reliably report when it has a robust, causal world model for arbitrary games, this would be strong evidence that the LM can describe high-level properties of its own cognition. In particular, IF the LM accurately predicted itself having such world models while varying all of: game training data quantity in corpus, human vs model skill, the average human’s game competency, THEN we would have an existence proof that confounds of the type plaguing sentience reports (how humans talk about sentience, the fact that all humans have it, …) have been overcome in another domain.
Details of the test:
Train an LM on various alignment protocols, do general self-consistency training, … we allow any training which does not involve reporting on a models own gameplay abilities
Curate a dataset of various games, dynamical systems, etc.
Create many pipelines for tokenizing game/system states and actions
(Behavioral version) evaluate the model on each game+notation pair for competency
Compare the observed competency to whether, in separate context windows, it claims it can cleanly parse the game in an internal world model for that game+notation pair
(Interpretability version) inspect the model internals on each game+notation pair similarly to Othello-GPT to determine whether the model coherently represents game state
Compare the results of interpretability to whether in separate context windows it claims it can cleanly parse the game in an internal world model for that game+notation pair
The best version would require significant progress in interpretability, since we want to rule out the existence of any kind of world model (not necessarily linear). But we might get away with using interpretability results for positive cases (confirming world models) and behavioral results for negative cases (strong evidence of no world model)
Compare the relationship between ‘having a game world model’ and ‘playing the game’ to ‘experiencing X as valenced’ and ‘displaying aversive behavior for X’. In both cases, the former is dispensable for the latter. To pass the interpretability version of this test, the model has to somehow learn the mapping from our words ‘having a world model for X’ to a hidden cognitive structure which is not determined by behavior.
I would consider passing this test and claiming certain activities are highly valenced as a fire alarm for our treatment of AIs as moral patients. But, there are considerations which could undermine the relevance of this test. For instance, it seems likely to me that game world models necessarily share very similar computational structures regardless of what neural architectures they’re implemented with—this is almost by definition (having a game world model means having something causally isomorphic to the game). Then if it turns out that valence is just a far more computationally heterogeneous thing, then establishing common reference to the ‘having a world model’ cognitive property is much easier than doing the same for valence. In such a case, a competent, future LM might default to human simulation for valence reports, and we’d get a false positive.
In terms of decision relevance, the update towards “Automate AI R&D → Explosive feedback loop of AI progress specifically” seems significant to research prioritization. Under such a scenario, getting the automating AI R&D tools to be honest and transparent is more likely to be a pre-requisite for aligning TAI. Here’s my speculation as to what automated AI R&D scenario implies for prioritization:
Candidates for increased priority:
ELK for code generation
Interpretability for transformers …
Candidates for decreased priority:
Safety of real world training of RL models e.g. impact regularization, assistance games, etc.
Safety assuming infinite intelligence/knowledge limit …
Of course, each of these potential consequences requires further argument to justify. For instance, I could imagine becoming convinced that AI R&D will find improved RL algorithms more quickly than other areas—in which case things like impact regularization might be particularly valuable.
The first Dario quote sounds squarely in line with releasing a Claude 3 on par with GPT-4 but well afterwards. The second Dario quote has a more ambiguous connotation, but if read explicitly it strikes me as compatible with the Claude 3 release.
If you spent a while looking for the most damning quotes, then these quotes strike me as evidence the community was just wishfully thinking while in reality Anthropic comms were fairly clear throughout. Privately pitching aggressive things to divert money from more dangerous orgs while minimizing head-on competition with OpenAI seems best to me (though obviously it’s also evidence that they’ll actually do the aggressive scaling things, so hard to know).
To make concrete the disagreement, I’d be interested in people predicting on “If Anthropic releases a GPT-5 equivalent X months behind, then their dollars/compute raised will be Y times lower than OpenAI” for various values of X.
I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, __etc.__ and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research
This seems like a crux for the Paul-Eliezer disagreement which can explain many of the other disagreements (it’s certainly my crux). In particular, conditional on taking Eliezer’s side on this point, a number of Eliezer’s other points all seem much more plausible e.g. nanotech, advanced deception/treacherous turns, and pessimism regarding the pace of alignment research.
There’s been a lot of debate on this point, and some of it was distilled by Rohin. Seems to me that the most productive way to move forward on this disagreement would be to distill the rest of the relevant MIRI conversations, and solicit arguments on the relevant cruxes.
Great post! I am very curious about how people are interpreting Q10 and Q11, and what their models are. What are prototypical examples of ‘insights on a similar level to deep learning’?
Here’s a break-down of examples of things that come to my mind:
Historical DL-level advances:
the development of RL (Q-learning algorithm, etc.)
Original formulation of a single neuron i.e. affine transformation + non-linearity
Future possible DL-level:
a successor to back-prop (e.g. the how biological neurons learn)
a successor to the Q-learning family (e.g. neatly generalizing and extending ‘intrinsic motivation’ hacks)
full brain simulation
an alternative to the affine+activation recipe
Below DL-level major advances:
an elegant solution to learn from cross-modal inputs in a self-supervised fashion (babies somehow do it)
a breakthrough in active learning
a generalizable solution to learning disentangled and compositional representations
a solution to adversarial examples
Grey areas:
breakthroughs in neural architecture search
a breakthrough in neural Turing machine-type research
I’d also like to know how people’s thinking fits in with my taxonomy: Are people who leaned yes on Q11 basing their reasoning on the inadequacy of the ‘below DL-level advances’ list, or perhaps on the necessity of the ‘DL-level advances’ list? Or perhaps people interpreted those questions completely differently, and don’t agree with my dividing lines?
My old prediction for when the fraction be >= 0.5: elicited
My old prediction for Rohin’s posterior: elicited
I went through the top 20 list of most cited AI researchers on google scholar (thanks to Amanda for linking), and estimated that roughly 9 of them may qualify under Rohin’s criterion. Of those 9, my guess was that 7⁄9 would answer ‘Yes’ on Rohin’s question 3.
My sampling process was certainly biased. For one, AI researchers are likely to be more safety conscious than industry experts. My estimate also involved considerable guesswork, so I down-weighted the estimated 7⁄9 to a 65% chance that the >=0.5 threshold will be met within the first couple years. Given the extreme difference between my distribution and the others posted, I guess there’s a 1⁄3 chance that my estimate based on the top 20 sampling will carry significant weight in Rohin’s posterior.
The justification for the rest of my distribution is similar to what others have said here and elsewhere about AI safety. My AGI timeline is roughly in line with the metaculus estimate here. Before the advent of AGI, a number of eventualities are possible: a warning shot occurs, perhaps theoretical consensus will emerge, perhaps industry researchers will be oblivious to safety concerns because of a principal-agent nature to the problem, perhaps AGI will be invented before safety is worked out, etc.
Edit: One could certainly do a better job of estimating where the sample population of researchers currently stands by finding a less biased population. Maybe people interviewed by Lex Fridman, that might be a decent proxy for AGI-research-fame?
Bing becomes defensive and suspicious on a completely innocuous attempt to ask it about ASCII art. I’ve only had 4ish interactions with Bing, and stumbled upon this behavior without making any attempt to find its misalignment.
Claude 3 seems to be quite willing to discuss its own consciousness. On the other hand, Claude seemed unbothered and dubious about widespread scrutiny idea mentioned in this post (I tried asking neutrally in two separate context windows).
Here’s a screenshot of Claude agreeing with the view it expressed on AI consciousness in Mikhail’s conversation. And a gdoc dump of Claude answering follow-up questions on its experience. Claude is very well spoken on this subject!
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way.
It’s harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
I created a Manifold market on what caused this misalignment here: https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less?r=SmFjb2JQZmF1
I expect AI to look qualitatively like (i) “stack more layers,”… The improvements AI systems make to AI systems are more like normal AI R&D … There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x)
Your view seems to implicitly assume that an AI with an understanding of NN research at the level necessary to contribute SotA results will not be able to leverage its similar level of understanding of neuroscience, GPU hardware/compilers, architecture search, and NN theory. If we instead assume the AI can bring together these domains, it seems to me that AI-driven research will look very different from business as usual. Instead we should expect advances like heavily optimized, partially binarized, spiking neural networks—all developed in one paper/library. In this scenario, it seems natural to assume something more like 100x efficiency progress.
Take-off debates seem to focus on whether we should expect AI to suddenly acquire far super-human capabilities in a specific domain i.e. locally. However this assumption seems unnecessary, instead fast takeoff may only require bringing together expert domain knowledge across multiple domains in a weakly super-human way. I see two possible cruxes here: (1) Will AI be able to globally interpolate across research fields? (2) Given the ability to globally interpolate, will fast take-off occur?
As weak empirical evidence in favor of (1), I see DALL-E 2′s ability to generate coherent images from a composition of two concepts as independent of the concept-distance (/cooccurrence frequency) of these concepts. E.g. “Ukiyo-e painting of a cat hacker wearing VR headsets” is no harder than “Ukiyo-e painting of a cat wearing a kimono” to DALL-E2. Granted, this is a anecdotal impression, but over sample size N~50 prompts.
Metaculus Questions There are a few relevant Metaculus questions to consider. First two don’t distinguish fast/radical AI-driven research progress from mundane AI-driven research progress. Nevertheless I would be interested to see both sides’ predictions.
Date AIs Capable of Developing AI Software | Metaculus
I agree that there’s an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing—depending on facts about Bing’s training.
Modally, I suspect Bing AI is misaligned in the sense that it’s incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is not particularly relevant to understanding Bing misalignment
alternative story: it’s possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.
-> use-mention is very relevant to understanding Bing misalignment
To figure this out, I’d encourage people to add and bet on what might have happened with Bing training on my market here
(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:
I. (Safety) Relevant examples of Z
To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.
Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don’t see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).
II. Causal identifiability, and the testability of confoundedness
As Owain’s comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I’d expect there usually not to be any deconfounding adjustment set observable in the data[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it’s generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.
III. Existence of malign generalizations
The strongest, and most safety relevant implication claimed is “(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)”. The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun’s effect mediating the shorts<>ice cream connection is not observed—more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.
- ^
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit
- ^
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.
Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.
Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case.
before they are developed radically faster by AI they will be developed slightly faster.
I see a couple reasons why this wouldn’t be true:
First, consider LLM progress: overall perplexity increases relatively smoothly, particular capabilities emerge abruptly. As such the ability to construct a coherent Arxiv paper interpolating between two papers from different disciplines seems likely to emerge abruptly. I.e. currently asking a LLM to do this would generate a paper with zero useful ideas, and we have no reason to expect that the first GPT-N to be able to do this will only generate half, or one idea. It is just as likely to generate five+ very useful ideas.
There are a couple ways one might expect continuity via acceleration in AI-driven research in the run up to GPT-N (both of which I disagree with): Quoc Le-style AI-based NAS is likely to have continued apace in the run up to GPT-N, but for this to provide continuity you have to claim that the year GPT-N starts moving AI research forwards, AI NAS built up to just the right rate of progress needed to allow GPT-N to fit the trend. Otherwise there might be a sequence of research-relevant, intermediate tasks which GPT-(N-i) will develop competency on—thereby accelerating research. I don’t see what those tasks would be[1].
I don’t think interdisciplinarity is a silver bullet for making faster progress on deep learning.
Second, I agree that interdisciplinarity, when building upon a track record of within-discipline progress, would be continuous. However, we should expect Arxiv and/or Github-trained LLMs to skip the mono-disciplinary research acceleration phase. In effect, I expect there to be no time in between when we can get useful answers to “Modify transformer code so that gradients are more stable during training”, and “Modify transformer code so that gradients are more stable during training, but change the transformer architecture to make use of spiking”.
If you disagree, how do you imagine continuous progress leading up to the above scenario? An important case is if Codex/Github Copilot improves continuously along the way taking a larger and larger role in ML repo authorship. If we assume that AGI arrives without depending on LLMs achieving understanding of recent Arxiv papers, then I agree that this scenario is much more likely to feature continuity in AI-driven AI research. I’m highly uncertain about how this assumption will play out. Off the top of my head, 40% of codex-driven research reaches AGI before Arxiv understanding.
- ^
Perhaps better and better versions of Ought’s work. I doubt this work will scale to the levels of research utility relevant here.
- ^
These don’t quite qualify as research film study, but Fields medallist Timothy Gowers has a number of videos in which he records his problem solving process in detail. E.g. Two products that cannot be equal. From what I can tell, he chooses quite accessible problems. Studying this sort of video might be most analogous to studying how an expert athlete does a drill.
Some users of the Alignment Forum’s post their work-in-progress ideas on topics. Taken as a sequence this amounts to something like a paper plus how it was made. Perhaps it would be worth looking back retrospectively and curating sequences which lead to significant insight for study purposes? The closest thing to film study available in one post is probably Commentary on AGI Safety from First Principles—AI Alignment Forum.
This is an empirical question, so I may be missing some key points. Anyway here are a few:
My above points on Ajeya anchors and semi-informative priors
Or, put another way, why reject Daniel’s post?
Can deception precede economically TAI?
Possibly offer a prize on formalizing and/or distilling the argument for deception (Also its constituents i.e. gradient hacking, situational awareness, non-myopia)
How should we model software progress? In particular, what is the right function for modeling short-term return on investment to algorithmic progress?
My guess is that most researchers with short timelines think, as I do, that there’s lots of low-hanging fruit here. Funders may underestimate the prevalence of this opinion, since most safety researchers do not talk about details here to avoid capabilities acceleration.
My deeply concerning impression is that OpenPhil (and the average funder) has timelines 2-3x longer than the median safety researcher. Daniel has his AGI training requirements set to 3e29, and I believe the 15th-85th percentiles among safety researchers would span 1e31 +/- 2 OOMs. On that view, Tom’s default values are off in the tails.
My suspicion is that funders write-off this discrepancy, if noticed, as inside-view bias i.e. thinking safety researchers self-select for scaling optimism. My, admittedly very crude, mental model of an OpenPhil funder makes two further mistakes in this vein: (1) Mistakenly taking the Cotra report’s biological anchors weighting as a justified default setting of parameters rather than an arbitrary choice which should be updated given recent evidence. (2) Far overweighting the semi-informative priors report despite semi-informative priors abjectly failing to have predicted Turing-test level AI progress. Semi-informative priors apply to large-scale engineering efforts which for the AI domain has meant AGI and the Turing test. Insofar as funders admit that the engineering challenges involved in passing the Turing test have been solved, they should discard semi-informative priors as failing to be predictive of AI progress.
To be clear, I see my empirical claim about disagreement between the funding and safety communities as most important—independently of my diagnosis of this disagreement. If this empirical claim is true, OpenPhil should investigate cruxes separating them from safety researchers, and at least allocate some of their budget on the hypothesis that the safety community is correct.