LessWrong dev & admin as of July 5th, 2022.
RobertM
And it seems like you forgot about them too by the time you wrote your comment.
It was not clear from your comment which particular catastrophic failures you meant (and in fact it’s still not clear to me which things from your post you consider to be in that particular class of “catastrophic failures”, which of them you attribute at least partial responsibility for to MIRI/CFAR, by what mechanisms/causal pathways, etc).
ETA: “OpenAI existing at all” is an obvious one, granted. I do not think EY considers SBF to be his responsibility (reasonable, given SBF’s intellectual inheritance from the parts of EA that were least downstream of EY’s thoughts). You don’t mention other grifters in your post.
FYI I am generally good at tracking inside baseball but I understand neither what specific failures[1] you would have wanted to see discussed in an open postmortem nor what things you’d consider to be “improvements” (and why the changes since 2022/04/01 don’t qualify).
- ^
I’m sure there were many, but I have no idea what you consider to have been failures, and it seems like you must have an opinion because otherwise you wouldn’t be confident that the changes over the last three years don’t qualify as improvements.
- ^
People sometimes ask me what’s good about glowfic, as a reader.
You know that extremely high-context joke you could only make to that one friend you’ve known for years, because you shared a bunch of specific experiences which were load-bearing for the joke to make sense at all, let alone be funny[1]? And you know how that joke is much funnier than the average low-context joke?
Well, reading glowfic is like that, but for fiction. You get to know a character as imagined by an author in much more depth than you’d get with traditional fiction, because the author writes many stories using the same character “template”, where the character might be younger, older, a different species, a different gender… but still retains some recognizable, distinct “character”. You get to know how the character deals with hardship, how they react to surprises, what principles they have (if any). You get to know Relationships between characters, similarly. You get to know Societies.
Ultimately, you get to know these things better than you know many people, maybe better than you know yourself.
Then, when the author starts a new story, and tosses a character you’ve seen ten variations of into a new situation, you already have _quite a lot of context_ for modeling how the character will deal with things. This is Fun. It’s even more Fun when you know many characters by multiple authors like that, and get to watch them deal with each other. There’s also an element of parasocial attachment and empathy, here. Knowing someone[2] like that makes everything they’re going through more emotionally salient—victory or defeat, fear or jubilation, confidence or doubt.
Part of this is simply a function of word count. Most characters don’t have millions of words[3] written featuring them. I think the effect of having the variation in character instances and their circumstances is substantial, though.
Probably I should’ve said this out loud, but I had a couple of pretty explicit updates in this direction over the past couple years: the first was when I heard about character.ai (and similar), the second was when I saw all TPOTers talking about using Sonnet 3.5 as a therapist. The first is the same kind of bad idea as trying a new addictive substance and the second might be good for many people but probably carries much larger risks than most people appreciate. (And if you decide to use an LLM as a therapist/rubber duck/etc, for the love of god don’t use GPT-4o. Use Opus 3 if you have access to it. Maybe Gemini is fine? Almost certainly better than 4o. But you should consider using an empty Google Doc instead, if you don’t want to or can’t use a real person.)
I think using them as coding and research assistants is fine. I haven’t customized them to be less annoying to me personally, so their outputs often are annoying. Then I have to skim over the output to find the relevant details, and don’t absorb much of the puffery.
If we assume conservatively that a bee’s life is 10% as unpleasant as chicken life
This doesn’t seem at all conservative based on your description of how honey bees are treated, which reads like it was selecting for the worst possible things you could find plausible citations for. In fact, very little of your description makes an argument about how much we should expect such bees to be suffering in an ongoing way day-to-day. What I know of how broiler chickens are treated makes suffering ratios like 0.1% (rather than 10%) seem reasonable to me. This also neglects the quantities that people are likely to consume, which could trivially vary by 3 OoM.
If you’re a vegan I think there are a bunch of good reasons not to make exceptions for honey. If you’re trying to convince non-vegans who want to cheaply reducing their own contributions to animal suffering, I don’t think they should find this post very convincing.
I agree it’s more related than a randomly selected Nate post would be, but the comment itself did not seem particularly aimed at arguing that Nate’s advice was bad or that following it would have undesirable consequences[1]. (I think the comments it was responding to were pretty borderline here.)
I think I am comfortable arguing that it would be bad if every post that Nate made on subjects like “how to communicate with people about AI x-risk” included people leaving comments with argument-free pointers to past Nate-drama.
The most recent post by Nate seemed good to me; I think its advice was more-than-sufficiently hedged and do not think that people moving in that direction on the margin would be bad for the world. If people think otherwise they should say so, and if they want to use Nate’s interpersonal foibles as evidence that the advice is bad that’s fine, though (obviously) I don’t expect I’d find such arguments very convincing.
- ^
When keeping in mind its target audience.
- ^
I think it would be bad for every single post that Nate publishes on maybe-sorta-related subjects to turn into a platform for relitigating his past behavior[1]. This would predictably eat dozens of hours of time across a bunch of people. If you think Nate’s advice is bad, maybe because you think that people following it risk behaving more like Nate (in the negative ways that you experienced), then I think you should make an argument to that effect directly, which seems more likely to accomplish (what I think is) your goal.
- ^
Which, not having previously expressed an opinion on, I’ll say once—sounds bad to me.
- ^
(Separately, even accepting for the sake of argument that you notice most work done and have a negative reaction to it, that is not very strong counterevidence to the original claim.)
If the only thing you see about Aella is that she had work done on her lips, then I think that sufficiently well demonstrates the point that you don’t notice most “high quality” plastic surgery.
They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it.
That’s surprising and concerning. As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges. Any place where they don’t is a place where humans turn into a bottleneck. I can imagine a few possible objections to this:
They don’t expect those bottlenecks to impose that much of a penalty.
I’m not sure how this could be true unless you think that AI systems will be pretty weak, which is sort of fighting the premise. What are we worried about, again?
They expect the bottlenecks to impose a large penalty and plan on biting that bullet when the time comes.
I currently roll to disbelieve based on both the publicly-held positions of the relevant organizations and also their historical track records. Incentives here seem quite bad.
They plan on trying to thread the needle by employing some control schemes where (for example) different “agents” have different permissions. i.e. a “code writing” agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes. Another set of agents have permissions to review pull requests, and then request changes, approve/merge them, or flag the PR as suspicious. Yet another set of agents act as gatekeepers to sensitive data that might be needed for some experiments but only with good justification.
This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it’ll be straight up cheaper and easier!
Like, to be clear, I would definitely prefer a world where these organizations wrote “small and carefully locked-down infrastructure” as the limited surface their AIs were allowed to interact with; I just don’t expect that to actually happen in practice.
This comment describes how the images for the “Best of LessWrong” (review winners) were generated. (The exact workflow has varied a lot over time, as image models have changed quite a lot, and LLMs didn’t always exist, and we’ve built more tooling for ourselves, etc.)
The prompt usually asks for an aquarelle painting, often in the style of Thomas Schaller. (Many other details, but I’m not the one usually doing artwork, so not the best positioned to point to common threads.) And then there’s a pretty huge amount of iteration and sometimes post-processing/tweaking.
Almost every comment rate limit stricter than “once per hour” is in fact conditional in some way on the user’s karma, and above 500 karma you can’t even be (automatically) restricted to less than one comment per day:
// 3 comments per day rate limits { ...timeframe('3 Comments per 1 days'), appliesToOwnPosts: false, rateLimitType: "newUserDefault", isActive: user => (user.karma < 5), rateLimitMessage: `Users with less than 5 karma can write up to 3 comments a day.<br/>${lwDefaultMessage}`, }, { ...timeframe('3 Comments per 1 days'), // semi-established users can make up to 20 posts/comments without getting upvoted, before hitting a 3/day comment rate limit appliesToOwnPosts: false, isActive: (user, features) => ( user.karma < 2000 && features.last20Karma < 1 ), // requires 1 weak upvote from a 1000+ karma user, or two new user upvotes, but at 2000+ karma I trust you more to go on long conversations rateLimitMessage: `You've recently posted a lot without getting upvoted. Users are limited to 3 comments/day unless their last ${RECENT_CONTENT_COUNT} posts/comments have at least 2+ net-karma.<br/>${lwDefaultMessage}`, }, // 1 comment per day rate limits { ...timeframe('1 Comments per 1 days'), appliesToOwnPosts: false, isActive: user => (user.karma < -2), rateLimitMessage: `Users with less than -2 karma can write up to 1 comment per day.<br/>${lwDefaultMessage}` }, { ...timeframe('1 Comments per 1 days'), appliesToOwnPosts: false, isActive: (user, features) => ( features.last20Karma < -5 && features.downvoterCount >= (user.karma < 2000 ? 4 : 7) ), // at 2000+ karma, I think your downvotes are more likely to be from people who disagree with you, rather than from people who think you're a troll rateLimitMessage: `Users with less than -5 karma on recent posts/comments can write up to 1 comment per day.<br/>${lwDefaultMessage}` }, // 1 comment per 3 days rate limits { ...timeframe('1 Comments per 3 days'), appliesToOwnPosts: false, isActive: (user, features) => ( user.karma < 500 && features.last20Karma < -15 && features.downvoterCount >= 5 ), rateLimitMessage: `Users with less than -15 karma on recent posts/comments can write up to 1 comment every 3 days. ${lwDefaultMessage}` }, // 1 comment per week rate limits { ...timeframe('1 Comments per 1 weeks'), appliesToOwnPosts: false, isActive: (user, features) => ( user.karma < 0 && features.last20Karma < -1 && features.lastMonthDownvoterCount >= 5 && features.lastMonthKarma <= -30 ), // Added as a hedge against someone with positive karma coming back after some period of inactivity and immediately getting into an argument rateLimitMessage: `Users with -30 or less karma on recent posts/comments can write up to one comment per week. ${lwDefaultMessage}` },
I think you could make an argument that being rate limited to one comment per day is too strict given its conditions, but I don’t particularly buy this as argument against rate limiting long-term commenters in general.
But presumably you want long-term commenters with large net-positive karma staying around and not be annoyed by the site UI by default.
A substantial design motivation behind the rate limits, beyond throttling newer users who haven’t yet learned the ropes, was to reduce the incidence and blast radius of demon threads. There might be other ways of accomplishing this, but it does require somehow discouraging or preventing users (even older, high-karma users) from contributing to them. (I agree that it’s reasonable to be annoyed by how the rate limits are currently communicated, which is a separate question from being annoyed at the rate limits existing at all.)
Hi Bharath, please read our policy on LLM writing before making future posts consisting almost entirely of LLM-written content.
In a lot of modern science, top-line research outputs often look like “intervention X caused 7% change in metric Y, p <0.03” (with some confidence intervals that intersect 0%). This kind of relatively gear-free model can be pathological when it turns out that metric Y was actually caused by five different things, only one of which was responsive to intervention X, but in that case the effect size was very large. (A relatively well-known example is the case of peptic ulcers, where most common treatments would often have no effect, because the ulcers were often caused by an H. pylori infection.)
On the other end of the spectrum are individual
trip reportsself-experiments. These too have their pathologies[1], but they are at least capable of providing the raw contact with reality which is necessary to narrow down the search space of plausible theories and discriminate between hypotheses.With the caveat that I’m default-skeptical of how this generalizes (which the post also notes), such basic foundational science seems deeply undersupplied at this level of rigor. Curated.
- ^
Taking psychedelic experiences at face value, for instance.
- ^
Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
The notion of a “powerful optimization process” is necessary and sufficient to a discussion about an Artificial Intelligence that could harm or benefit humanity on a global scale. If you say that an AI is mechanical and therefore “not really intelligent”, and it outputs an action sequence that hacks into the Internet, constructs molecular nanotechnology and wipes the solar system clean of human(e) intelligence, you are still dead. Conversely, an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.
EDIT: and my claim was motivated as a response to paragraphs like this from the OP:
It doesn’t matter that Claude is a bleeding heart and a saint, now. That is not supposed to be relevant to the threat model. The bad ones will come later (later, always later…). And when they come, will be “like Claude” in all the ways that are alarming, while being unlike Claude in all the ways that might reassure.
Like, yes, in fact it doesn’t really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.
This is LLM slop. At least it tells you that upfront (and that it’s long). Did you find any interesting, novel claims in it?
I enjoyed most of this post but am (as always) frustrated by the persistent[1] refusal to engage with the reasons for serious concern about ASI being unaligned by default that came from the earliest of those who were worried, whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Separately, I think you are both somewhat too pessimistic about the state of knowledge re: the “spiritual bliss attractor state” among Anthropic employees prior to the experiments that fed into their most recent model card, and also I am sort of confused by why you think this is obviously a more worthy target for investigation than whatever else various Anthropic employees were doing. Like, yes, it’s kind of weird and interesting. But it also doesn’t strike me as a particularly promising research direction given my models of AI risk, and even though most Anthropic employees are much more optimistic than I am, I expect the same is true of them. Your argument seems to be skipping the necessary step of engaging with their upstream model, and is going directly to being confused about why their (different) model is (predictably) leading them to different conclusions about what they should be prioritizing. I think you should either engage with their upstream model or make a specific argument that they’re making a mistake even conditioning on their model, which is not obvious to me.
- ^
Not by you, necessarily, but by the cluster of people who point to the behavior of current LLMs as if they are supposed to be meaningful evidence against the original arguments for risk from ASIs.
- ^
This is a great experiment similar to some that I’ve been thinking about over the last few months, thanks for running it. I’d be curious what the results are like for stronger models (and whether, if they do, that substantially changes their outputs when answering in interesting ways). My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
Why did the early computer vision scientists not write succeed in writing a formal ruleset for recognizing birds, and ultimately it took a messy cludge of inscrutable learned heuristics to solve that task?
I disapprove of Justice Potter in many respects, but “I know it when I see it” is indeed sometimes the only practical[1] way to carve reality.
(This is not meant to be a robust argument, just a couple of pointers at countervailing considerations.)
For humans.