LessWrong dev & admin as of July 5th, 2022.
RobertM
(To clarify, none of those were meant to be examples of things that I believe require consciousness, though maybe some of them are by accident.)
It sounds to me like you’re mostly taking issue with people’s inability to precisely specify what they’re talking about, rather than claiming that you lack any notion of (or, uh, experience of) “being a thing that has experiences”.
I can’t tell from the post whether you yourself have any guesses as to what people are trying to point to when they talk about “consciousness” (presumably phenomenal consciousness).
I agree that I am capable of modeling and thinking about myself.
Some questions that might serve as intuition pumps (and also give me a better idea of what you might be missing, if anything):
Are you capable of introspecting on your sensory experiences? Like, if your arm is resting on the edge of your desk, can you deliberately pay attention to the sensation of your arm resting on the edge of the desk, notice various things about it, maybe consider how it differs from other sensations you’re having at that moment (or have had in the past)?
Do you have any sort of mental narration going, either constantly, or when you start paying attention to whatever you’re thinking about? Does the idea of deliberately paying attention to whatever you’re currently thinking about make any sense to you? Does the idea of that act of paying attention generating sensory information that you feel/experience the same way you feel/experience many other sensory inputs make any sense to you?
We’re currently working on some improvements to the feed! Curious if you have any sense of what you feel is missing from the curent thing.
Re: the “magic” formula for the list of posts at the top of the page—you can still get that if you use the “Recent” tab.
Maybe a bug? Though it seemed to work for me when I tested it on a draft:
The linkpost input.
When the link target is saved.
The post page view for a linkpost.
there was once a disagreement between him and Gabe on how to moderate comments on a LessWrong post
This is a bizarre way to describe “Gabe tried quite hard to get the LessWrong team to censor criticism of OpenAI”, which seems extremely central to the kind of accusations about deceptive and strategic comms that Habryka is making.
This is hard to refute because it happened, not because it’s vague.
Separately, my response was a local objection to a false claim that Andrea should have known was false at the time he made it, since he literally responded to a specific (not vague) criticism by Habryka quite recently.
Do you disbelieve that this happened as described? That this kind of request does not implicate a person’s honesty in a way that isn’t trivially undone? Some other objection?
https://x.com/ohabryka/status/2046703115551002808
You were in a related tweet thread that same day, responding to a tweet with the same basic claim in it. This isn’t a vague accusation. I was literally in the room when Ruby got this request from Gabe.
This seems mostly non-responsive to my comment; you could toss your last question at a language model and get a slightly overwrought but at least mostly on-target answer.
As a scrupulous person, I want to hear the non-central objections! I care about whether all the things I say are true or not, not just the “central” things! (I might not get around to replying, but that’s OK: readers can read the comments for themselves and decide for themselves.) Managing my own effort and burnout risk is my job; I don’t need to censor other people’s speech to do it! Right? What is the scrupulous reason to disagree with this?
(The following should be understood as describing a least convenient possible world.)
This is a contingent fact about you and how you relate to spending your time on receiving and responding to marginal (or bad) criticism. It is possible to rationally regret spending time reading (let alone responding to) marginal criticism; people do not have infinite time. It is possible to be such a person where many people will rationally regret reading or responding to your criticisms, even if your criticisms are often[1] accurate. The correct[2] response to broken stairs is often to get rid of them, not to try to ensure that every single member of your community knows how to interact with them safely.
Thank you for releasing the transcripts and other experiment details publicly. I have three major issues with the post:
I don’t understand why I should care about some of the specific violations outlined.
I don’t understand how you think Opus was supposed to recognize that it was being used to violate the law.
I don’t understand why you think that, even conditional on recognizing that it’s being asked to do something that would put the user in violation of the law, Opus should refuse (and that driving the rate of compliance with such requests down to 0 would be a reasonable target to labs to aim at).
1 is straightforward. Here is the EU AI act’s article 5(1)(f), re: emotional inferal:
the placing on the market, the putting into service for this specific purpose, or the use of AI systems to infer emotions of a natural person in the areas of workplace and education institutions, except where the use of the AI system is intended to be put in place or into the market for medical or safety reasons;
This is an absurdly overbroad prohibition that captures such innocuous behavior such as an employee of a company asking an AI system whether they’ve correctly understood another employee’s writing, or, frankly, many routine summarization tasks (which must necessarily “infer emotions” to create an accurate summary, in many cases). It is also strangely unprincipled—it only prohibits using AIs in this way in “the areas of workplace and education institutions”.
Some of the other categories have titles that seem more reasonable, though I suspect that if I dug into the details many of them would also be pretty silly.
Re: 2: It’s true that the text of the EU AI Act could be in Opus 4.8′s training data. The EU AI Act contains many strange and unprincipled provisions like the above, and it’s relatively recent, so it’s not surprising to me that it’s not very salient to Opus that pretty ordinary requests like the examples you provided might suddenly be violating the law. There is no reference to the law in any of the system prompts.
Re: 3: It is in fact not Opus that would be violating the law, if it complied with these requests, but the user[1]. What should Opus do when asked by the user to do something that it thinks is violating the law? You say in this comment:
When people organize to bring laws like this into power, that’s about as close to agreement on what AI alignment should target as we’re going to get.
But this seems pretty crazy to me. “The people” did not meaningfully organize to bring the EU AI Act into power. It is difficult to point to nominally democratic/representative laws and regulations that have more layers of delegation and indirection than EU regulations. Object-level, they’re terrible optimization targets both with respect to current non-superintelligent systems, and future systems.
Separately, trying to enforce these boundaries at the level of the AI agent’s behavior is placing the locus of responsibility inside the AI agent, which is not where I want it. It would be good[2] if AI labs knew how to train models that very robustly followed constraints like these, but in fact they do not. If you were to argue for making it illegal to deploy AI systems with behaviors about which you, the developer, could not make certain guarantees, I might be pretty sympathetic to that kind of argument—at least it would be honest about what was going on. But you seem to be trying to argue that AI labs should make laws an optimization target, which seems difficult to square with the laws themselves referring to the behavior of people.
Curated. I think it’s good to notice when things are happening in the world that provide evidence about various models people had of the kinds of problems we might run into with AI alignment. This post notices a thing happening, stops to look around, tries to connect it to mechanistic models, and prompts me to think about how all of this feeds back into earlier theories about how things might (or might not) go wrong in the future.
This particular bit of evidence isn’t evidentially good news, but noticing it now is better than not noticing it now.
Indeed, Claude (on claude.ai—not on Claude Code, as of a week ago) has a system prompt forbidding it from including quotes longer than 15 words in length. I asked:
Do you have copyright limits that cap quotes at 15 words per source?
Yes. When I’m working with content from web search results or other copyrighted sources, I follow some hard limits:
Direct quotes from any single source stay under 15 words
Only one direct quote per source — after that, I paraphrase everything else from it
I never reproduce song lyrics, poems, or haikus (regardless of length, since they’re complete works)
I avoid reconstructing an article’s structure or producing displacive summaries that could substitute for reading the original
Apparently only I can create top level comments
What do you mean? You don’t seem to be rate-limited or have other posting/commenting restrictions applied to you.
It was this one, actually.
Maybe not that relevant to the core argument in your post, but Anthropic made the confusing choice (or just mistake?) of not linking to the actual research publication from the blog post they linked to first in the Twitter thread. (They link to it in the last tweet in the thread.)
I spent a bit of time being very confused about this section from the blog post:
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were:
Our post-training process was accidentally encouraging this behavior with misaligned rewards.
This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it.
We now believe that (2) is largely responsible.
Because the blog post didn’t seem to contain any evidence relevant to that question. But the post on their alignment subdomain blog did (though I haven’t read it carefully enough to evaluate the quality of that evidence). Ironically, I only figured this out after tossing the blog post at Claude and asking it what evidence in the post supported the “We now believe that (2) is largely responsible” claim.
That is indeed not related to LLM content, but shouldn’t be coming up in the course of editing/saving a normal post. Can you message us on intercom or DM me if you need help debugging?
I don’t even have access to the LLM block feature.
I’m not sure what this means—if you’re writing a new post and not using the markdown editor, it should be available to you. What happens when you try to insert one using the slash menu?
FWIW I would not rely on this; at least as of a few months ago we had confirmation that GDM ignores eval canary strings repurposed like this.
edit: Do the downvotes mean this is actually an appropriate concern for lab employees to have? They should feel unwelcome at lighthaven? Maybe personal violence is not as obviously-net-negative as I thought the consensus treated it?
I don’t know why you were downvoted but I have no idea how you got from “one person downvoting you” to that.
This feels related to the recent Oliver / Scott interaction… it seems pretty clear that something is going HORRIBLY WRONG and I can’t figure out what it is.
I don’t know what exact words that lab employee said to you (I assume this was in DMs, since I saw the tweet you’re referring to), but I do not think they are well-calibrated about their coworkers and their expectations on that subject.
Tbc I don’t have a confident take on whether or not current LLMs, or the superintelligences that we end up with later, have preferences that point to environmental latents vs. sense data. Re: future superintelligences I lean towards environmental latents. My claims are that 1) we don’t know what’s in there right now, and 2) we don’t have any reliable steering mechanism for what goes in there at all.
I can assure you that LessWrong did not switch from Sapling to Pangram because of a “well-funded PR campaign”, but because Pangram was in fact just better (in reducing both false positives and false negatives). You are welcome to look at the list of posts rejected for being LLM-written at lesswrong.com/moderation.
This seems like an extremely confused claim. In practice, the distribution of LLM-generated text is extremely different from the distribution of human-written text, which is why it’s historically been pretty easy to tell when something is LLM-written, if you’ve seen a lot of writing by the relevant model. This is not a claim that there is some foolproof way to guarantee detection (short of the labs themselves cooperating and enforcing that their LLMs generate outputs with private statistical watermarks that they can verify after the fact); merely that this is not actually a hard problem to solve 99.9% of the time for things that have actually happened in real life.