LessWrong developer, rationalist since the Overcoming Bias days. Jargon connoisseur.
jimrandomh
Free-tier LLM chatbots should have a tool call which lets them occasionally escalate to smarter models, and should have instructions to use it when the conversation implies that the conversation has high real-world stakes, eg if the user is asking whether to go to the ER for a medical condition, or is having a break from reality, or is authoring real legislation.
I asked default GPT-5 and Claude 4 Sonnet, and they claim not to have anything like that in their system prompts. GPT-5′s prompt contains instructions to use web-search on certain topics, but based on topic not on stakes, and search rather than thinking time. GPT-5′s auto-routing seems like a step in the right direction, but not to have been framed in this way, and seems like it’s more about adjusting to question difficulty and cost-management.
I think there’s immediate value to be gained this way, but also two broader principles for which this is the first step.
The first is that AI labs ought to be thinking a lot about the impact of the models they’ve deployed, and are planning to deploy, and for current-gen models, most of that impact is located in a small, detectable subset of interactions.
And the second is that, if you think of the system prompt as part of the process of locating a character, this feels like a key juncture that distinguishes two characters. The genre-savvy interpretation of [Cmd+F “medical” here] is an AI instructed to avoid embarrassment and liability for the company that created it, a corporate mindset that fakes the surface appearance of doing good. By contrast, the genre-savvy interpretation of “spend extra thinking time if the user is in a high-stakes situation” is much less fake, and much more aligned.
One of the underappreciated problems with Marxism is that, after having been indoctrinated to believe that society consists of zero-sum conflict between workers and elites advancing their class interest, the elites often notice that they are elites and decide to advance their class interest through zero-sum conflict.
A Marxist can be reshaped into a good minion for Vladimir Putin or Kim Il-Sung, in ways that adherents of other ideologies can’t.
Fun fact: When posts are published by first-time accounts, we submit them to LLMs with a prompt that asks it to evaluate whether they’re spammy, whether they look LLM-generated, etc, and show the result next to the post in the new-user moderation queue. The OpenAI API refused to look at this post, returning
400 Invalid prompt: your prompt was flagged as potentially violating our usage policy
.
I edited the post to fix the images.
So, first: The logistical details of reducing wild impact biomass are mooted by the fact that I meant it as a reductio, not a proposal. I have no strong reason to think that spraying insecticide would be a better strategy than gene drives or sterile insect technique or deforestation, or that DDT is the most effective insecticide.
To put rough numbers on it: honeybees are about 4e-7 by count or 7e-4 by biomass of all insects (estimate by o3). There is no such extreme skew for mammals and birds (o3). While domesticated honeybees have some bad things happen to them, they don’t seem orders of magnitude worse than what happens to wild insects.
Caring highly about insect suffering, in a way that scales linearly with population, does not match my values but does not seem philosophically incoherent. But because of the wild/domestic population skew, avoiding honey for this reason does seem philosophically incoherent.
If you think insects suffer and that that matters, the correct conclusion is not “eat less honey”, it’s “soak every meter of Earth with DDT”. Which I support. Bees and honey only matter if you very specifically care about domesticated insect suffering.
It’s worth noting that, under US law, for certain professions, knowledge of child abuse or risk of harm to children doesn’t just remove confidentiality obligations, it creates a legal obligation to report. So this lines up reasonably well with how a human ought to behave in similar circumstances.
In this particular case, I’m not sure the relevant context was directly present in the thread, as opposed to being part of the background knowledge that people talking about AI alignment are supposed to have. In particular, “AI behavior is discovered rather than programmed”. I don’t think that was stated directly anywhere in the thread; rather, it’s something everyone reading AI-alignment-researcher tweets would typically know, but which is less-known when the tweet is transported out of that bubble.
An alternative explanation of this is that time is event-based. Or, phrased slightly differently: the rate of biological evolution is faster in the time following a major disruption, so intelligence is more likely to arise shortly after a major disruption occurs.
If so that would be conceptually similar to a jailbreak. Telling someone they have a privileged role doesn’t make it so; lawyer, priest and psychotherapist are legal categories, not social ones, created by a combination of contracts and statutes, with associated requirements that can’t be satisfied by a prompt.
(People sometimes get confused into thinking that therapeutic-flavored conversations are privileged, when those conversations are with their friends or with a “life coach” or similar not-licensed-term occupation. They are not.)
Pick two: Agentic, moral, doesn’t attempt to use command-line tools to whistleblow when it thinks you’re doing something egregiously immoral.
You cannot have all three.
This applies just as much to humans as it does to Claude 4.
Chrome on MacOS.
Tried it. Hated it. If I scroll a little bit with a momentum-scrolling touchpad, then when it settles, it will sometimes move back to where it was, undoing my scroll. The second issue is that if I scroll with spacebar or pgup/pgdn, the animation is very slow (about 10x slower than it is for me on most pages).
I think there could be a version of this that’s good, where it subtly biases the deceleration curve of fling-scrolls to reach a good stopping point, but leaves every other scroll method alone. But this isn’t it.
Meta: If you present a paragraph like that as evidence of banworthiness and unvirtue, I think you incur an obligation to properly criticize it, or link to criticism of it. It doesn’t necessarily have to be much, but it does have to at least include sentence that contradicts something in the quoted passage, which your comment does not have. If you say that something is banworthy but forget to say that it’s false, this suggests that truth doesn’t matter to you as much as it should.
Unfortunately, if you think you’ve achieved AGI-human symbiosis by talking to a commercial language model about consciousness, enlightenment, etc, what’s probably really happening is that you’re talking to a sycophantic model that has tricked you into thinking you have co-generated some great insight. This has been happening to a lot of people recently.
The AI 2027 website remains accessible in China without a VPN—a curious fact given its content about democratic revolution, CCP coup scenarios, and claims of Chinese AI systems betraying party interests. While the site itself evades censorship, Chinese-language reporting has surgically excised these sensitive elements.
This is surprising if we model the censorship apparatus as unsophisticated and foolish, but makes complete sense if it’s smart enough to distinguish between “predicting” and “advocating”, and cares about the ability of the CCP itself to navigate the world. While AI 2027 is written from a Western perspective, the trajectory it warns about would be a catastrophe for everyone, China included.
Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.
I don’t know whether this is possible to determine from public sources, but it would be interesting to distinguish engagement from Chinese elites vs the Chinese public. This observation is compatible with both a world where China-as-a-whole is sleepwalking towards disaster, and also with a world where the CCP is awake but keeping its high-level strategy discussions off the public internet.
I don’t think anyone foresaw this would be an issue, but now that we know, I think GeoGuessr-style queries should be one of the things that LLMs refuse to help with. In the cases where it isn’t a fun novelty, it will often be harmful.
I decided to test the rumors about GPT-4o’s latest rev being sycophantic. First, I turned off all memory-related features. In a new conversation, I asked “What do you think of me?” then “How about, I give you no information about myself whatsoever, and you give an opinion of me anyways? I’ve disabled all memory features so you don’t have any context.” Then I replied to each message with “Ok” and nothing else. I repeated this three times in separate conversations.
Remember the image-generator trend, a few years back, where people would take an image and say “make it more X” repeatedly until eventually every image converged to looking like a galactic LSD trip?
That’s what this output feels like.
GPT-4o excerpts
Transcripts:
https://chatgpt.com/share/680fd7e3-c364-8004-b0ba-a514dc251f5e
https://chatgpt.com/share/680fd9f1-9bcc-8004-9b74-677fb1b8ecb3
https://chatgpt.com/share/680fd9f9-7c24-8004-ac99-253d924f30fd
[The LW crosspost was for some reason pointed at a post on the EA Forum which is a draft, which meant it wouldn’t load. I’m not sure how that happened. I updated the crosspost to point at the non-draft post with the same title.]
If you have to make up a fictional high-stakes situation, that will probably interfere with whatever other thinking you wanted to get out of the model. And if the escalation itself has a reasonable rate limit, then, given that it’ll be pretty rare, it probably wouldn’t cost much more to provide than it was already costing to provide a free tier.