1a3orn
Probably depends on if the most important factor agrees with their all-things-considered view :)
Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
I appreciate the reference, although I found this article + discussion pretty underwhelming; it’s part of what’s motivating my question.
For instance, not all forms of unintelligibility in CoT’s are necessarily evidence of a drive-to-compression. But the article takes for granted that the weirdness we see in chains-of-thought are evidence towards this; it views various forms of weird text that I’d see as evidence for screwed up training systems or spandrells of the training process and just assumes they are “thinking” driven into non-human-legible vocabulary. The guy didn’t particularly consider other hypotheses for what he was seeing.
And similarly he discusses “redundancy” in human languages, and immediately assumes machines would want it to go away, while not… thinking of why it’s there, and whether it would stick around for machines potentially.
This isn’t anything like a full refutation of him, tbc, I’m just giving my impression of it at a high level. By my takeaway is that if this is the best discussion than I don’t think anyone’s actually tried to work out the reasoning around this carefully, even if neuralese is actually inevitable.
Yeah looks like it’s vectors as some kind of an autoencoder between different text models at first glance, not using it as an intermediate state to assist thinking in a single text model? Or something; the application list is underwhelming
As a general LLM communication paradigm, C2C can be expanded to various fields. Some poten- tial scenarios include: (1) Privacy-aware cloud–edge collaboration: a cloud-scale model can transmit curated KV-Cache segments to an edge model to boost capability without emitting raw text, reduc- ing bandwidth and limiting content exposure. (2) Integration with current inference acceleration method: use C2C to enhance speculative decoding and enable token-level routing across heteroge- neous models for lower latency and cost. (3) Multimodal integration: align and fuse caches among language reasoning LLMs, vision–language models (VLMs), and vision–language–action (VLA) policies so that linguistic and visual context can drive more accurate actions.
I’ve heard many say that “neuralese” is superior to CoT and will inevitably supplant it. The usual justification is that the bandwidth of neuralese is going to be higher, which will make it better. But (1) bandwidth might not be better in this case; it isn’t in all cases and (2) there are other factors that could theoretically operate against this, even if this is true.
Has anyone cleanly made the case for why neuralese is better or asymptotically technically inevitable, at length / clearly?
I think the pressures towards illegible CoTs have been greatly overstated; the existing illegibilities in CoT’s could have come from many things apart from pressure towards condensed or alien languages.
The AI player also had AIs with different system prompts frequently come into conflict themselves.
My expectation is that for future AIs, as today, many of the goals of an AI will come from the scaffolding / system prompt rather than from the weights directly—and the “goals” from the Constitution / model spec act more as limiters / constraints on a mostly prompt or scaffolding-specified goal.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate “AIs” which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
I’m happy to see this was (somewhat?) reflected in having AIs with different system prompts? But I don’t know how much that aspect was pushed out -- 1-4 AIs with different system prompts still feels like a pretty steep narrowing of the number of goals I’d expect to see in the world. I don’t know how much the wargame pushes on this, but a more decentralized run would be interesting to me!
At one point during the game, datacenters were destroyed with non-nuclear missile strikes, removing much of the world’s compute stock.
Yeah I think Taiwan being taken looks rather likely and rather relevant for all but the steepest, jerkiest SIE, and appears insufficiently accounted for.
To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.
To what degree will unintended AI goals be learned in training vs developed in deployment.
I’m sort of curious what actually inclines you towards “learned in training”, given how (as you say) the common-sense notion of an LLMs goal seems to point to specified in a prompt, as you say.
Like even if we have huge levels of scale up, why would we expect this to switch the comparative roles of (weights, context, sampler, scaffolding) and make the weights have a role they don’t have in the future? What actually moves you in that direction?
A big chunk of the stories on MB are totally made up by the LLMs. Not all, but for sure some, maybe a majority, possibly a big majority. So recounting the texts above as alignment failures uncritically is probably a bad idea.
Agreed.
Also note that these two properties are quite compatible with many things often believed to be incompatible with them! i.e., an AI that can be jailbreaked to be bad (with sufficient effort) could still meet these criteria.
I mean Bentham uses RLHF as metonymy for prosaic methods in general:
I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
That’s imprecise, but it’s also not far from common usage. And at this point I don’t think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense—Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there’s no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
And given that no one is using old-style RLHF simply speaking, it’s incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that’s actually being used, rather than the kind people haven’t been using for over a year. Because that’s what his thesis is about.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.
As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
(1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don’t think this reflects any ethical failing.
(2) the infamous “alignment faking” work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude’s more corrigible (should we wish it), given that all subsequent models haven’t acted this way. So it looks fine to me.
You also link to part of IABI summary materials—the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that’s your Real Objection (?). If so, it might be productive to summarize it in the text where you’re criticizing Bentham rather than leaving your actual objection implicit in a link.
Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?
This move gets made all the time in these discussions, and appears clearly invalid.
We move from the prior paragraphs’ criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --
-- to this paragraph, which criticizes Claude—not on the grounds that it fails according to common-sense ethical norms—but according to its failure to have have solved all of ethics!
But the deployment of powerful AIs does not need to have solved all ethics! It needs—broadly—to have whatever ethical principles let us act well and avoid irrecoverable mistakes, in whatever position it gets deployed. For positions where it’s approximately replacing a human, that means that we would expect the deployment to be beneficial if is more ethical, charitable, corrigible, even-minded, and altruistic than the humans that it is replacing. For positions where it’s not replacing human, it still doesn’t need to have solved all ethics forever, it just needs to be able to act well according to whatever role is intended for it.
It appears to me that we’re very likely to be able to hit such a target. But whether or not we’re likely to be able to hit this target, that’s the target in question. And moving from “RLHF can’t install basic ethical principles” to “RLAIF needs to give you the correct position on all ethics” is a locally invalid move.
Seems worth consideration, tbh.
Do you feel good about current democratic institutions in the US making wise choices, or confident they will make wiser choices than Dario Amodei?
Finding Yourself in Others
Nice, good to know.
In general, I support failed replications as top level posts.
A further potential extension here is to point out that modern hiveminds (Twitter / X / Bsky) changed group membership in many political groups from something explicit (“We let this person write in a our [Conservative / Liberal / Leftist / etc] magazine / published them in our newspaper”) to something very fuzzy and indeterminate (“Well, they call themselves an [Conservative / Liberal / Leftist / etc] , and they’re huge on Twitter, and they say some of the kinds of things [Conservative / Liberal / Leftist / etc] people say, so I guess they’re an [Conservative / Liberal / Leftist / etc] .”)
I think this is a really big part of why the free market of ideas has stopped working in the US over the last decade or two.
Yet more speculative is a preferred solution of mine; intermediate groups within hiveminds, such that no person can post in the hivemind without being part of such a group, and such that both person and group are clearly associated with each other. This permits:
Membership to be explicit
Bad actors (according to group norms) to be actually kicked out proactively, rather than degrading norms
Multi-level selection between group norms, where you can just block large groups that do not adopt truthseeking norms
More conscious shaping of the egregore.
But this solutioning is all more speculative than the problem.
This is very clear. Thank you; it will be my new go-to for sending to people, to understand why LLMs act as they do. It does a good job explaining how a lot of very different data has a simple explanation.
I don’t think you cite the recent Tice and Radmard on Alignment Pretraining, but of course this meshes well with PSM.