Previously “Lanrian” on here. Research analyst at Open Philanthropy. Views are my own.
Lukas Finnveden
Extrapolating GPT-N performance
PaLM in “Extrapolating GPT-N performance”
PaLM-2 & GPT-4 in “Extrapolating GPT-N performance”
Before smart AI, there will be many mediocre or specialized AIs
Project ideas: Epistemics
Implications of evidential cooperation in large worlds
Non-alignment project ideas for making transformative AI go well
I wrote down some places where my memory disagreed with the notes. (The notes might well be more accurate than me, but I thought I’d flag in case other people’s memories agree with mine. Also, this list is not exhaustive, e.g. there are many things on the notes that I don’t remember, but where I’d be unsurprised if I just missed it.)
AGI will not be a binary moment. We will not agree on the moment it did happen. It will be gradual. Warning sign will be, when systems become capable of self-improvement.
I don’t remember hearing that last bit as a generic warning sign, but I might well have missed it. I do remember hearing that if systems became capable of self-improvement (sooner than expected?), that could be a big update towards believing that fast take-off is more likely (as mentioned in your next point).
AGI will not be a pure language model, but language will be the interface.
I remember both these claims as being significantly more uncertain/hedged.
AGI (program able to do most economically useful tasks …) in the first half of the 2030ies is his 50% bet, bit further out than others at OpenAI.
I remembered this as being a forecast for ~transformative AI, and as explicitly not being “AI that can do anything that humans can do”, which could be quite a bit longer. (Your description of AGI is sort-of in-between those, so it’s hard to tell whether it’s inconsistent with my memory.)
Merging via CBI most likely path to a good outcome.
I was a bit confused about this answer in the Q&A, but I would not have summarized it like this. I remember claims that some degree of merging with AI is likely to happen conditional on a good outcome, and maybe a claim that CBI was the most likely path towards merging.
Unfortunately, it’s generally a lot easier to generate karma through commenting than through posting.
Once upon a time, I hear there was a 10x multiplier on post karma. 10x is a lot, but it seems pretty plausible to me that a ~3x multiplier on post karma would be good.
Some thoughts on automating alignment research
Memo on some neglected topics
Participants scoring in the bottom quartile on our humor test (...) overestimated their percentile ranking
A less well-known finding of Dunning—Kruger is that the best performers will systematically underestimate how good they are, by about 15 percentile points.
Isn’t this exactly what you’d expect if people were good bayesians receiving scarce evidence? Everyone starts out with assuming that they’re in the middle, and as they find something easy or hard, they gradually update away from their prior. If they don’t have good information about how good other people are, they won’t update too much.
If you then look at the extremes, the very best and the very worst people, of course you’re going to see that they should extremify their beliefs. But if everyone followed that advice, you’d ruin the accuracy of the people more towards the middle, since they haven’t received enough evidence to distinguish themselves from the extremes.
(Similarly, I’ve heard that people often overestimate their ability on easy tasks and underestimate their ability on difficult tasks, which is exactly what you’d expect if they had good epistemics but limited evidence. If task performance is a function of task difficulty and talent for a task, and the only things you can observe is your performance, then believing that you’re good at tasks you do well at and bad at tasks you fail at is the correct thing to do. As a consequence, saying that people overestimate their driving ability doesn’t tell you that much about the quality of their epistemics, in isolation, because they might be following a strategy that optimises performance across all tasks.)
The finding that people at the bottom overestimate their position with 46 percentile points is somewhat more extreme than this naïve model would suggest. As you say, however, it’s easily explained when you take into account that your ability to judge your performance on a task is correlated with your performance on that task. Thus, the people at the bottom are just receiving noise, so on average they stick with their prior and judge that they’re about average.
Of course, just because some of the evidence is consistent with people having good epistemics doesn’t mean that they actually do have good epistemics. I haven’t read the original paper, but it seems like people at the bottom actually thinks that they’re a bit above average, which seems like a genuine failure, and I wouldn’t be surprised if there are more examples of such failures which we can learn to correct. The impostor syndrome also seems like a case where people predictably fail in fixable ways (since they’d do better by estimating that they’re of average ability, in their group, rather than even trying to update on evidence).
But I do think that people often are too quick to draw conclusions from looking at a specific subset of people estimating their performance on a specific task, without taking into account how well their strategy would do if they were better or worse, or were doing a different task. This post fixes some of those problems, by reminding us that everyone lowering the estimate of their performance would hurt the people at the top, but I’m not sure if it correctly takes into account how the people in the middle of the distribution would be affected.
(The counter-argument might be that people who know about Dunning-Kruger is likely to be at the top of any distribution they find themselves in, but this seems false to me. I’d expect a lot of people to know about Dunning-Kruger (though I may be in a bubble) and there are lots of tasks where ability doesn’t correlate a lot with knowing about Dunning-Kruger. Perhaps humor is an example of this.)
and some of my sense here is that if Paul offered a portfolio bet of this kind, I might not take it myself, but EAs who were better at noticing their own surprise might say, “Wait, that’s how unpredictable Paul thinks the world is?”
If Eliezer endorses this on reflection, that would seem to suggest that Paul actually has good models about how often trend breaks happen, and that the problem-by-Eliezer’s-lights is relatively more about, either:
that Paul’s long-term predictions do not adequately take into account his good sense of short-term trend breaks.
that Paul’s long-term predictions are actually fine and good, but that his communication about it is somehow misleading to EAs.
That would be a very different kind of disagreement than I thought this was about. (Though actually kind-of consistent with the way that Eliezer previously didn’t quite diss Paul’s track-record, but instead dissed “the sort of person who is taken in by this essay [is the same sort of person who gets taken in by Hanson’s arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2]”?)
Also, none of this erases the value of putting forward the predictions mentioned in the original quote, since that would then be a good method of communicating Paul’s (supposedly miscommunicated) views.
Quantifying anthropic effects on the Fermi paradox
As the main author of the “Alignment”-appendix of the truthful AI paper, it seems worth clarifying: I totally don’t think that “train your AI to be truthful” in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:
While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.
In other words: I don’t think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.
At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it’s good if people (at least/especially people who wouldn’t otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)
I don’t think you’re into this theory of change, because I suspect that you think that anyone who isn’t directly aiming at the alignment problem has negligible chance of contributing any useful progress.
I just wanted to clarify that the truthful AI paper isn’t evidence that people who try to hit the hard bits of alignment always miss — it’s just a paper doing a different thing.
(And although I can’t speak as confidently about others’ views, I feel like that last sentence also applies to some of the other sections. E.g. Evan’s statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan’s proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)
Here’s a 1-year-old answer from Christiano to the question “Do you still think that people interested in alignment research should apply to work at OpenAI?”. Generally pretty positive about people going there to “apply best practices to align state of the art models”. That’s not exactly what Aaronson will be doing, but it seems like alignment theory should have even less probability of differentially accelerating capabilities.
From the post:
only votes on new content will count
Upvoting comments/posts that were made before today doesn’t get you any tokens.
Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
Each individual system won’t by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones.
If the sum of projects from before carry more power than the individual system, such that it can’t win by defection, there’s no reason for it to defect. It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike. If the way we build these things systematically renders them misaligned, we’ll sooner or later end up with a majority of them being misaligned, at which point we can’t trivially use them to shut down defectors.
(I agree that continuous takeoff does give us more warning, because some systems will presumably defect early, especially weaker ones. And IDA is kind of similar to this strategy, and could plausibly work. I just wanted to point out that a naive implementation of this doesn’t solve the problem of treacherous turns.)
If this is something that everyone reads, it might be nice to provide links to more technical details of the site. I imagine that someone reading this who then engages with LW might wonder:
What makes a curated post a curated post? (this might fit into the site guide on personal vs frontpage posts)
Why do comments/posts have more karma than votes?
What’s the mapping between users’ karma and voting power?
How does editing work? Some things are not immediately obvious, like:
How do I use latex?
How do I use footnotes?
How do I create images?
How does moderation work? Who can moderate their own posts?
This kind of knowledge isn’t gathered in one place right now, and is typically difficult to google.
In general, I’d very much like a permanent neat-things-to-know-about-LW post or page, which receives edits when there’s a significant update (do tell me if there’s already something like this). For example, I remember trying to find information about the mapping between karma and voting power a few months ago, and it was very difficult. I think I eventually found an announcement post that had the answer, but I can’t know for sure, since there might have been a change since that announcement was made. More recently, I saw that there were footnotes in the sequences, and failed to find any reference whatsoever on how to create footnotes. I didn’t learn how to do this until a month or so later, when the footnotes came to the EA forum and aaron wrote a post about it.