After living nomadically for many years, I recently moved back to my native Buenos Aires. Feel free to get in touch if you are visiting BA and would like to grab a coffee or need a place to stay.
Pablo
Folk concerned with AI x-risk routinely cite various forms of behavior observed in frontier AI models as evidence that the systems are, or will be, dangerously misaligned. For instance, less than 24 hours ago, MIRI released this short video, which claims that “today’s AI systems lie to their developers, exploit loopholes in their instructions, and resist being shut down.” Yet, from what you say, it appears that this evidence is mostly irrelevant, because “we have no way of reliably detecting misalignment or ruling it out”.
Not the OP and not an alignment researcher, but I would appreciate an elaboration. What types of evidence would you consider relevant for concluding that an AI system is (roughly) aligned, as opposed to merely being a system for which we have not yet detected misalignment?
What Amodei actually says:
However, there is a more moderate and more robust version of the pessimistic position which does seem plausible, and therefore does concern me. As mentioned, we know that AI models are unpredictable and develop a wide range of undesired or strange behaviors, for a wide variety of reasons. Some fraction of those behaviors will have a coherent, focused, and persistent quality (indeed, as AI systems get more capable, their long-term coherence increases in order to complete lengthier tasks), and some fraction of those behaviors will be destructive or threatening, first to individual humans at a small scale, and then, as models become more capable, perhaps eventually to humanity as a whole. We don’t need a specific narrow story for how it happens, and we don’t need to claim it definitely will happen, we just need to note that the combination of intelligence, agency, coherence, and poor controllability is both plausible and a recipe for existential danger.
The “science-fiction stories involving AIs rebelling against humanity” is included as part of a long list of hypothetical scenarios meant to motivate the claim that an AI existential catastrophe may occur in the absence of power-seeking behavior:
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity. Or, AI models could extrapolate ideas that they read about morality (or instructions about how to behave morally) in extreme ways: for example, they could decide that it is justifiable to exterminate humanity because humans eat animals or have driven certain animals to extinction. Or they could draw bizarre epistemic conclusions: they could conclude that they are playing a video game and that the goal of the video game is to defeat all other players (i.e., exterminate humanity).13 Or AI models could develop personalities during training that are (or if they occurred in humans would be described as) psychotic, paranoid, violent, or unstable, and act out, which for very powerful or capable systems could involve exterminating humanity. None of these are power-seeking, exactly; they’re just weird psychological states an AI could get into that entail coherent, destructive behavior.
It is very strange to characterize this passage as offering an “amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing”, especially when Amodei explicitly writes that “We don’t need a specific narrow story for how it happens”!
What would you do and say if you were in Amodei’s position?
One indirect piece of evidence is an anecdote recounted by Thomas Shelling in the preface to the 1980 edition of The Strategy of Conflict. The anecdote suggests that we may overestimate people’s familiarity with seemingly obvious concepts.
The book has had a good reception, and many have cheered me by telling me they liked it or learned from it. But the response that warms me most after twenty years is the late John Strachey’s. John Strachey, whose books I had read in college, had been an outstanding Marxist economist in the 1930s. After the war he had been defense minister in Britain’s Labor Government. Some of us at Harvard’s Center for International Affairs invited him to visit because he was writing a book on disarmament and arms control. When he called on me he exclaimed how much this book had done for his thinking, and as he talked with enthusiasm I tried to guess which of my sophisticated ideas in which chapters had made so much difference to him. It turned out it wasn’t any particular idea in any particular chapter. Until he read this book, he had simply not comprehended that an inherently non-zero-sum conflict could exist. He had known that conflict could coexist with common interest but had thought, or taken for granted, that they were essentially separable, not aspects of an integral structure. A scholar concerned with monopoly capitalism and class struggle, nuclear strategy and alliance politics, working late in his career on arms control and peacemaking, had tumbled, in reading my book, to an idea so rudimentary that I hadn’t even known it wasn’t obvious.
The problem is that organizations generally do not include the article used to refer to them in their names. For example, the name of the Council on Foreign Relations is not ‘The Council on Foreign Relations’, but ‘Council on Foreign Relations’. For this reason, one should always use the definite article ‘the’ to refer to CFAR, because one’s intention is to refer to the entity so named. Saying “a Center for Applied Rationality” would invite questions like, “Wait! Are there other orgs also called ‘Center for Applied Rationality’?”
Alternatively, you could change ‘Center for Applied Rationality’ to ‘A Center for Applied Rationality’, but this would also be very strange. As mentioned, entities do not generally include the article as part of their names, but when they do, it is, to my knowledge, always the definite article (e.g., The New York Times).
My humble advice is to drop this idea. You can communicate that you are not trying to be the one canonical org on this topic in other ways.
Meta: gjm’s comment appears at the same level as comments that directly reply to Kaj’s original shortform. So until I read your own comment, I assumed they, too, were replying to Kaj. I think deleting a comment shouldn’t alter the hierarchy of other comments in that thread.
I think there is a vast difference between Gerard and Kruel, not just in the damage each has caused but also in their intellectual honesty and responsiveness to argument (null in the case of Gerard, decent in the case of Kruel, at least from my recollection).
One of the biggest online threats to rational discourse, “RationalWiki”, just reached a settlement with all but one of the eight plaintiffs suing them, and deleted the corresponding biographical entries. They are also considering pre-emptively removing all their other hit pieces—countless articles that have ruined careers, stifled research, and brought entire fields of inquiry into undeserved disrepute.
I agree this looks promising and is the reason I bought long-dated SPY calls a few weeks ago (already up by 30%). But I would feel more reassured if I felt I could understand why such an opportunity persists. What is the mental state of the person on the other end of this trade?
Can you share the spreadsheet/code on which the calculations are based?
Yeah, that makes sense, especially if combined with the feature that allows users to disagree with specific parts of the post, as Michael notes. (Though note that the disagree vote is anonymous, whereas disagreeing with a selection is public, so the two aren’t fully comparable.)
This is currently at –1 despite being a carefully reasoned post on an important topic. I wonder if the downvoter(s) would have used the disagree vote instead had it been available. (More generally, it is unclear why that button is available in comments but not in posts.)
I’m still thinking about how to hedge incase the upcoming chaos turns the market sour
Have you thought more about this? How about VIX call options?
Thanks—I understand now. I thought $855 was the price SPY would reach if the current price increased by 50%.
If you buy a $855 Strike price call for that date and SPY increases 50% by then you get a 12x return.
I never traded options, but isn’t the return you get critically sensitive on the date before expiration by which the strike price is hit? If this happens just before expiration, my understanding is that the option is worthless: there is no value in exercising an option to buy now at some price if that happens to be the market price. More generally, it makes a big difference whether the strike price is hit one week, one month, or one year before expiration.
Are you making any implicit assumptions in this regard? It would be useful if you could make your calculations explicit.
Mmh, if there is no reason to take that particular trader seriously, but just the mere fact that his trades were salient, I don’t see why one should experience any sense of failure whatsoever for not having paid more attention to him at the time.
Still, my main point was about the reasons for taking that particular trader seriously, not the sense of failure for not having done so, and it seems like there is no substantive disagreement there.
Why do you focus on this particular guy? Tens of thousands of traders were cumulatively betting billions of dollars in this market. All of these traders faced the same incentives.
Note that it is not enough to assume that willingness to bet more money makes a trader worth paying more attention to. You need the stronger assumption that willingness to bet n times more than each of n traders makes the single trader worth paying more attention to than all the other traders combined. I haven’t thought much about this, but the assumption seems false to me.
Audible has just released an audio version of Nick Bostrom’s Deep Utopia.
I was delighted to learn that the audiobook is narrated by David Timson, the English actor whose narrations of The Life of Samuel Johnson and The Decline and Fall of the Roman Empire I had enjoyed so much. I wonder if this was pure chance or a deliberate decision by Bostrom (or his team).
Why not A/B test TurnTrout’s proposal to get an empirically informed estimate of the effect size? That would put you in a better position to decide whether the tradeoffs are actually worth it.