Previously “Lanrian” on here. Research analyst at Open Philanthropy. Views are my own.
Lukas Finnveden
I wrote down some places where my memory disagreed with the notes. (The notes might well be more accurate than me, but I thought I’d flag in case other people’s memories agree with mine. Also, this list is not exhaustive, e.g. there are many things on the notes that I don’t remember, but where I’d be unsurprised if I just missed it.)
AGI will not be a binary moment. We will not agree on the moment it did happen. It will be gradual. Warning sign will be, when systems become capable of self-improvement.
I don’t remember hearing that last bit as a generic warning sign, but I might well have missed it. I do remember hearing that if systems became capable of self-improvement (sooner than expected?), that could be a big update towards believing that fast take-off is more likely (as mentioned in your next point).
AGI will not be a pure language model, but language will be the interface.
I remember both these claims as being significantly more uncertain/hedged.
AGI (program able to do most economically useful tasks …) in the first half of the 2030ies is his 50% bet, bit further out than others at OpenAI.
I remembered this as being a forecast for ~transformative AI, and as explicitly not being “AI that can do anything that humans can do”, which could be quite a bit longer. (Your description of AGI is sort-of in-between those, so it’s hard to tell whether it’s inconsistent with my memory.)
Merging via CBI most likely path to a good outcome.
I was a bit confused about this answer in the Q&A, but I would not have summarized it like this. I remember claims that some degree of merging with AI is likely to happen conditional on a good outcome, and maybe a claim that CBI was the most likely path towards merging.
Unfortunately, it’s generally a lot easier to generate karma through commenting than through posting.
Once upon a time, I hear there was a 10x multiplier on post karma. 10x is a lot, but it seems pretty plausible to me that a ~3x multiplier on post karma would be good.
Participants scoring in the bottom quartile on our humor test (...) overestimated their percentile ranking
A less well-known finding of Dunning—Kruger is that the best performers will systematically underestimate how good they are, by about 15 percentile points.
Isn’t this exactly what you’d expect if people were good bayesians receiving scarce evidence? Everyone starts out with assuming that they’re in the middle, and as they find something easy or hard, they gradually update away from their prior. If they don’t have good information about how good other people are, they won’t update too much.
If you then look at the extremes, the very best and the very worst people, of course you’re going to see that they should extremify their beliefs. But if everyone followed that advice, you’d ruin the accuracy of the people more towards the middle, since they haven’t received enough evidence to distinguish themselves from the extremes.
(Similarly, I’ve heard that people often overestimate their ability on easy tasks and underestimate their ability on difficult tasks, which is exactly what you’d expect if they had good epistemics but limited evidence. If task performance is a function of task difficulty and talent for a task, and the only things you can observe is your performance, then believing that you’re good at tasks you do well at and bad at tasks you fail at is the correct thing to do. As a consequence, saying that people overestimate their driving ability doesn’t tell you that much about the quality of their epistemics, in isolation, because they might be following a strategy that optimises performance across all tasks.)
The finding that people at the bottom overestimate their position with 46 percentile points is somewhat more extreme than this naïve model would suggest. As you say, however, it’s easily explained when you take into account that your ability to judge your performance on a task is correlated with your performance on that task. Thus, the people at the bottom are just receiving noise, so on average they stick with their prior and judge that they’re about average.
Of course, just because some of the evidence is consistent with people having good epistemics doesn’t mean that they actually do have good epistemics. I haven’t read the original paper, but it seems like people at the bottom actually thinks that they’re a bit above average, which seems like a genuine failure, and I wouldn’t be surprised if there are more examples of such failures which we can learn to correct. The impostor syndrome also seems like a case where people predictably fail in fixable ways (since they’d do better by estimating that they’re of average ability, in their group, rather than even trying to update on evidence).
But I do think that people often are too quick to draw conclusions from looking at a specific subset of people estimating their performance on a specific task, without taking into account how well their strategy would do if they were better or worse, or were doing a different task. This post fixes some of those problems, by reminding us that everyone lowering the estimate of their performance would hurt the people at the top, but I’m not sure if it correctly takes into account how the people in the middle of the distribution would be affected.
(The counter-argument might be that people who know about Dunning-Kruger is likely to be at the top of any distribution they find themselves in, but this seems false to me. I’d expect a lot of people to know about Dunning-Kruger (though I may be in a bubble) and there are lots of tasks where ability doesn’t correlate a lot with knowing about Dunning-Kruger. Perhaps humor is an example of this.)
and some of my sense here is that if Paul offered a portfolio bet of this kind, I might not take it myself, but EAs who were better at noticing their own surprise might say, “Wait, that’s how unpredictable Paul thinks the world is?”
If Eliezer endorses this on reflection, that would seem to suggest that Paul actually has good models about how often trend breaks happen, and that the problem-by-Eliezer’s-lights is relatively more about, either:
that Paul’s long-term predictions do not adequately take into account his good sense of short-term trend breaks.
that Paul’s long-term predictions are actually fine and good, but that his communication about it is somehow misleading to EAs.
That would be a very different kind of disagreement than I thought this was about. (Though actually kind-of consistent with the way that Eliezer previously didn’t quite diss Paul’s track-record, but instead dissed “the sort of person who is taken in by this essay [is the same sort of person who gets taken in by Hanson’s arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2]”?)
Also, none of this erases the value of putting forward the predictions mentioned in the original quote, since that would then be a good method of communicating Paul’s (supposedly miscommunicated) views.
As the main author of the “Alignment”-appendix of the truthful AI paper, it seems worth clarifying: I totally don’t think that “train your AI to be truthful” in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:
While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.
In other words: I don’t think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.
At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it’s good if people (at least/especially people who wouldn’t otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)
I don’t think you’re into this theory of change, because I suspect that you think that anyone who isn’t directly aiming at the alignment problem has negligible chance of contributing any useful progress.
I just wanted to clarify that the truthful AI paper isn’t evidence that people who try to hit the hard bits of alignment always miss — it’s just a paper doing a different thing.
(And although I can’t speak as confidently about others’ views, I feel like that last sentence also applies to some of the other sections. E.g. Evan’s statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan’s proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)
Here’s a 1-year-old answer from Christiano to the question “Do you still think that people interested in alignment research should apply to work at OpenAI?”. Generally pretty positive about people going there to “apply best practices to align state of the art models”. That’s not exactly what Aaronson will be doing, but it seems like alignment theory should have even less probability of differentially accelerating capabilities.
From the post:
only votes on new content will count
Upvoting comments/posts that were made before today doesn’t get you any tokens.
Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
Each individual system won’t by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones.
If the sum of projects from before carry more power than the individual system, such that it can’t win by defection, there’s no reason for it to defect. It might just join the ranks of “projects from before”, and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike. If the way we build these things systematically renders them misaligned, we’ll sooner or later end up with a majority of them being misaligned, at which point we can’t trivially use them to shut down defectors.
(I agree that continuous takeoff does give us more warning, because some systems will presumably defect early, especially weaker ones. And IDA is kind of similar to this strategy, and could plausibly work. I just wanted to point out that a naive implementation of this doesn’t solve the problem of treacherous turns.)
If this is something that everyone reads, it might be nice to provide links to more technical details of the site. I imagine that someone reading this who then engages with LW might wonder:
What makes a curated post a curated post? (this might fit into the site guide on personal vs frontpage posts)
Why do comments/posts have more karma than votes?
What’s the mapping between users’ karma and voting power?
How does editing work? Some things are not immediately obvious, like:
How do I use latex?
How do I use footnotes?
How do I create images?
How does moderation work? Who can moderate their own posts?
This kind of knowledge isn’t gathered in one place right now, and is typically difficult to google.
I like this direction and this write-up of it!
If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.
Let’s assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn’t expect). I’m curious how you’d ensure (or evaluate for) red-team competitiveness in this case.
I think it’s important to be clear about what SIA says in different situations, here. Consider the following 4 questions:
A) Do we live in a simulation?
B) If we live in a simulation, should we expect basement reality to have a large late filter?
C) If we live in basement reality, should we expect basement reality (ie our world) to have a large late filter?
D) If we live in a simulation, should we expect the simulation (ie our world) to have a large late filter?
In this post, you persuasively argue that SIA answers “yes” to (A) and “not necessarily” to (B). However, (B) is almost never decision-relevant, since it’s not about our own world. What about (C) and (D)? (Which are easier to see how they could be decision-relevant, for someone who buys SIA. I personally agree with you that something like Anthropic Decision Theory is the best way to reason about decisions, but responsible usage of SIA+CDT is one way to get there, in anthropic dilemmas.)
To answer (C): If we condition on living in basement reality, then SIA favors hypotheses that imply many observers in basement reality. The simulated copies are entirely irrelevant, since we have conditioned them away. (You can verify this with bayes theorem.) So we are back with the SIA doomsday argument again, and we face large late filters.
To answer (D): Detailed simulations of civilisations that spread to the stars are vastly more expensive than detailed simulations of early civilizations. This means that the latter are likely to be far more common, and we’re almost certainly living in a simulation we’re we’ll never spread to the (simulated) stars. (This is plausibly because the simulation will be turned off before we get the chance.) You could discuss what terminology to use for this, but I’d be inclined to call this a large late filter, too.
So my preferred framing isn’t really that the simulation hypothesis “undercuts” the SIA doomsday argument. It’s rather that the simulation hypothesis provides one plausible mechanism for it: that we’re in a simulation that will end soon. But that’s just a question of framing/terminology. The main point of this comment is to provide answers to questions (C) and (D).
- 30 Mar 2023 8:40 UTC; 3 points) 's comment on Zach Stein-Perlman’s Shortform by (
+1.
I’m a big fan of extrapolating trendlines, and I think the current trendlines are concerning. But when evaluating the likelihood that “most democratic Western countries will become fascist dictatorships”, I’d say these trends point firmly against this being “the most likely overall outcome” in the next 10 years. (While still increasing my worry about this as a tail-risk, a longer-term phenomena, and as a more localized phenomena.)
If we extrapolate the graphs linearly, we get:
If we wait 10 years, we will have 5 fewer “free” countries and 7 more “non-free” countries. (Out of 195 countries being tracked. Or: ~5-10% fewer “free” countries.)
If we wait 10 years, the average democracy index will fall from 5.3 to somewhere around 5.0-5.1.
That’s really bad. But it would be inconsistent with a wide fascist turn in the West, which would cause bigger swings in those metrics.
(As far as I can tell, the third graph is supposed to indiciate the sign of the derivative of something like a democracy index, in each of many countries? Without looking into their criteria more, I don’t know what it’s supposed to say about the absolute size of changes, if anything.)
This also makes me confused about the next section’s framing. If there’s no “National Exceptionalism” where western countries are different from the others, then presumably the same trends should apply. But those suggest that the headline claim is unlikely. (But that we should be concerned about less probable, less widespread, and/or longer-term changes of the same kind.)
Ok so I tried running the numbers for the neural net anchor in my bio-anchors guesstimate replica.
Previously the neural network anchor used an exponent (alpha) of normal(0.8, 0.2) (first number is mean, second is standard deviation). I tried changing that to normal(1, 0.1) (smaller uncertainty because 1 is a more natural number, and some other evidence was already pointing towards 1). Also, the model previously said that a 1-trillion parameter model should be trained with 10^normal(11.2, 1.5) data points. I changed that to have a median at 21.2e12 parameters, since that’s what the chinchilla paper recommends for a 1-trillion parameter models. (See table 3 here.)
The result of this is to increase the median compute needed by ~2.5 OOMs. The 5th percentile increases ~2 OOMs and the 95th percentile increases ~3.5 OOMs.
I’m confused about the argument you’re trying to make here (I also disagree with some things, but I want to understand the post properly before engaging with that). The main claims seem to be
There are simply not enough excess deaths for these claims to be plausible.
and, after telling us how many preventable deaths there could be,
Either charities like the Gates Foundation and Good Ventures are hoarding money at the price of millions of preventable deaths, or the low cost-per-life-saved numbers are wildly exaggerated.
But I don’t understand how these claims interconnect. If there were more people dying from preventable diseases, how would that dissolve the dilemma that the second claim poses?
Also, you say that $125 billion is well within the reach of the GF, but their website says that their present endowment is only $50.7 billion. Is this a mistake, or do you mean something else with “within reach”?
Did you ever finalize any bet(s)?
Two different formulations of the problem that Chris faced:
Chris got a message saying that he had to enter the codes, or else the frontpage would be destroyed. He believed it, and thought that he had to enter the codes to save the frontpage. Arguably, if he had destroyed the frontpage by inaction (ie., if the message had been real, and failing to enter the codes would’ve caused the destruction of the frontpage) he would have been far less chastised by the local culture than if he had destroyed the frontpage by action (ie., what actually happened). In this case, is it more in the spirit of Petrov to take the action that your local culture will blame you the least for, or the action that you honestly think will save the frontpage?
Chris got a message that he had to enter the codes, or else bad things would happen, just like Petrov got a sign that the US had launched nukes, and that the russian military needed to be informed. The message wasn’t real, and in fact, the decision with the least bad consequences was to ignore it. In this case, is it more in the spirit of Petrov to consider that a message might not be what it’s claiming to be (and accurately determining that it’s not) or to just believe it?
I don’t know what the take-away is. Maybe we should celebrate Petrov’s skepticism/perceptiveness more, and not just his willingness to not defer to superiors.
Depends on how you were getting to that +N OOMs number.
If you were looking at my post, or otherwise using the scaling laws to extrapolate how fast AI was improving on benchmarks (or subjective impressiveness), then the chinchilla laws means you should get there sooner. I haven’t run the numbers on how much sooner.
If you were looking at Ajeya’s neural network anchor (i.e. the one using the Kaplan scaling-laws, not the human-lifetime or evolution anchors), then you should now expect that AGI comes later. That model anchors the number of parameters in AGI to the number of synapses in the human brain, and then calculates how much compute you’d need to train a model of that size, if you were on the compute-optimal trajectory. With the chinchilla scaling laws, you need more data to train a compute-optimal model with a given number of parameters (data is proportional to parameters instead of parameters^0.7). So now it seems like it’s going to be more expensive to train a compute-optimal model with 10^15 parameters, or however many parameteres AGI would need.
This doesn’t mean that it’s a good idea to blow up the frontpage because it’s more fun, or whatever. I think it’s probably better to not blow up the frontpage, but the case for this is based on meta-level things like ~trust and ~culture, and I think you do need to go to that level to make a convincing consequentialist case for not blowing up the frontpage. The stakes just aren’t high enough that the direct consequences dominate. (And it’s hard to raise the stakes until that’s false, because that would mean we’re risking more than we stand to gain.)
Unfortunately, this makes the situation pretty disanalagous to Petrov. Petrov defied the local culture (following orders) because he thought that reporting the alarm would have bad consequences. But in the lesswrong tradition, the direct consequences matter less than the effects on the local culture; and the reputational consequences point in the opposite direction, encouraging people to not press the button.
(Though from skimming the wikipedia article, it’s unclear exactly how much Petrov’s reputation suffered. It seems like he was initially praised, then reprimanded for not filing the correct paperwork. He’s been quoted both as saying that he wasn’t punished, and that he was made a scapegoat.)
I think everyone in the discussion expects AI progress to be at least exponentially fast. See all of Paul’s mention of hyperbolic growth — that’s faster than an exponential.
The discussion is more about continuous vs discontinuous takeoff, or centralised vs decentralised takeoff. (The slow/fast terminology isn’t great.)
In general, I’d very much like a permanent neat-things-to-know-about-LW post or page, which receives edits when there’s a significant update (do tell me if there’s already something like this). For example, I remember trying to find information about the mapping between karma and voting power a few months ago, and it was very difficult. I think I eventually found an announcement post that had the answer, but I can’t know for sure, since there might have been a change since that announcement was made. More recently, I saw that there were footnotes in the sequences, and failed to find any reference whatsoever on how to create footnotes. I didn’t learn how to do this until a month or so later, when the footnotes came to the EA forum and aaron wrote a post about it.