Previously “Lanrian” on here. Research analyst at Open Philanthropy. Views are my own.
Lukas Finnveden
If they don’t go to a doctor, it could be either because the problem is minor enough that they can’t be bothered, or because they generally don’t seek medical help when they are seriously unwell, in which case the risk from something like B12 deficiency is negligible compared to e.g. the risk of an untreated heart attack.
I’m personally quite bad at noticing and tracking (non-sudden) changes in my energy, mood, or cognitive ability. I think there are issues that I wouldn’t notice (or would think minor) that I would still care a lot about fixing.
Also, some people have problems with executive function. Even if they notice issues, the issues might have to be pretty bad before they’ll ask a doctor about them. Bad enough that it could be pretty valuable to prevent less bad issues (that would go untreated).
(This could be exacerbated if people are generally unexcited about seeking medical help — I think there are plenty of points on this axis where people will seek help for heart-attacks but will be pessimistic about getting help with “vaguely feeling tired lately”. Or maybe not even pessimistic. Just… not having “ask a dr” be generated as an obvious thing to try.)
doesn’t it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?
Yes, I’m sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn’t count on being able to defer it to smarter AIs (though that might work). So I think it’s probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected.
I’m currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.
This was also my impression.
Curious if OP or anyone else has a source for the <1% claim? (Partially interested in order to tell exactly what kind of “doom” this is anti-predicting.)
I assume that’s from looking at the GPT-4 graph. I think the main graph I’d look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.
That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.
(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn’t change that too much.)
Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:
Instead of having linear performance on the y-axis, switch to something like log(max_performance—actual_performance). (So that we get a log-log plot.)
Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.)
Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.)
That description misses out on effects where BNSL-fitting would predict that there’s a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don’t know how important that is. Curious for your intuition about whether or not that’s important, and/or other reasons for why my above description is or isn’t reasonable.
When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the “sigmoid” forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim “BNSL does better” corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the “sigmoid”) isn’t informative for how fast they converge to ~maximum performance (top part of the “sigmoid”)? That seems plausible.
PaLM-2 & GPT-4 in “Extrapolating GPT-N performance”
Thanks, fixed.
Some thoughts on automating alignment research
Before smart AI, there will be many mediocre or specialized AIs
I’m also concerned about how we’ll teach AIs to think about philosophical topics (and indeed, how we’re supposed to think about them ourselves). But my intuition is that proposals like this looks great on that perspective.
For areas where we don’t have empirical feedback-loops (like many philosophical topics), I imagine that the “baseline solution” for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)
(A different direction is to predict what we would think after thinking about it more. That has some advantages, but it doesn’t get around the issue where we’re at-best speeding things up.)
One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops. (C.f. imitative generalization.) The above proposal is doing that for a specific sub-category of epistemic practices (recognising ways in which you can be misled by an argument).
Worth noting: The broad category of “transfer epistemic practices from feedback-rich questions to questions with little feedback” contains a ton of stuff, and is arguably the root of all our ability to reason about these topics:
Evolution selected human genes for ability to accomplish stuff in the real world. That made us much better at reasoning about philosophy than our chimp ancestors are.
Cultural evolution seems to have at least partly promoted reasoning practices that do better at deliberation. (C.f. possible benefits from coupling competition and deliberation.)
If someone is optimistic that humans will be better at dealing with philosophy after intelligence-enhancement, I think they’re mostly appealing to stuff like this, since intelligence would typically be measured in areas where you can recognise excellent performance.
It seems like the list mostly explains away the evidence that “human’s can’t currently prevent value drift” since the points apply much less to AIs. (I don’t know if you agree.)
As you mention, (1) probably applies less to AIs (for better or worse).
(2) applies to AIs in the sense that many features of AIs’ environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it’s the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it’s plausible that we can design them so that their values aren’t very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs’ environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
(3): “can’t change the trajectory of general value drift by much” seems less likely to apply to AIs (or so I’m arguing). “Most people are selfish and don’t care about value drift except to the extent that it harms them directly” means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
(4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn’t super obvious.)
(5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is “random drift away from the starting point”, it seems like it would be for the better.)
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
I don’t understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)
(Imagine if AGI is built out of transformers. You could then argue “since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change”. And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn’t work, but I don’t see the relevant disanalogy to your argument.)
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular)
Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?
Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI’s values start drifting. But if there’s a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?
It’s possible that there’s a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there’s intense competition, then it wouldn’t be crazy if there was a race-to-the-bottom on caring less about things. (Though there’s also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)
This question is related to “we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects”, but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we’ll need some clever coordinated solution. (Do tell me if I’m reading you wrong!) But I’m asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?
Maybe the answer is: “If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy.” I agree this is concerning if AIs are aligned in the sense of “their terminal values are similar to my terminal values”, because it seems like there’s lots of room for subtle and gradual changes, there. But if they’re aligned in the sense of “at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation” then there’s less room for subtle and gradual changes:
If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict “[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate” and so take actions to occasionally fix those misconceptions. (You’d have to be really bad at predicting humans to not realise that the humans wanted that^.)
Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it’d be abrupt enough that you could set up monitoring for it, and then you’re back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you’d have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
Maybe they randomly acquire some other small motivation alongside “do what humans would have wanted”. But if it’s predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that’s shaped lilke “do what humans would have wanted” will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can’t provide enough of a counteracting motivation to defend itself.
(Maybe you think that this type of alignment is implausible / maybe the action is in your “there’s slight misalignment”.)
Maybe x-risk driven by explosive (technological) growth?
Edit: though some people think AI point of no return might happen before the growth explosion.
This is true if “the standard setting” refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer’s curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.
(Where I’m using “more robust evidence” to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer’s curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)
Where do you get the 3-4 months max training time from? GPT-3.5 was made available March 15th, so if they made that available immediately after it finished training, that would still have left 5 months for training GPT-4. And more realistically, they finished training GPT-3.5 quite a bit earlier, leaving 6+ months for GPT-4′s training.
Are you saying that you would have expected GPT-4 to be stronger if it was 500B+10T? Is that based on benchmarks/extrapolations or vibes?
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: “I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.”
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.