I’m a researcher at ACS working on understanding agency and optimisation, especially in the context of how ais work and how society is going to work once the ais are everywhere.
Raymond Douglas
I am a bit confused about in what way this is a bad explanation. It would be helpful for me if you could spell it out.
My understanding of habryka’s take is that it’s a bit more like:
The thing we want to steer the future is not current human values but an extrapolation of those values after enough reflection, and even if (current) AIs understand our current values fairly well, their extrapolation would probably diverge pretty substantially from ours, enough that most value gets lost.
I think there’s also a kernel that’s like:
A big part of what matters for humans is the process that generated our values (e.g. a messy evolutionary history) rather than the snapshot. Mind uploading might cut it; more brain-like AIs might cut it; intense RL on top of pretraining is really not great for this.
Some pieces I think of as making similar points are Thou Art Godshatter and The Tails Coming Apart as a Metaphor for Life.
I’d guess the heuristics are basically:
Aligning AGI is very different to aligning current frontier models: what works for current systems doesn’t tell you that much about what works for superintelligent systems
To the extent that your goal is to align current systems, you will gravitate towards approaches that don’t actually scale, because the low-hanging fruit now is stuff that depends on the model being weak
(The term alignment should sort of be reserved for the AGI/ASI case)
FWIW I’m not sure how much I buy these but I’d guess I buy them more than you? This is unfortunately another great example of something where people inside labs probably have some pretty relevant private information but also extra incentive/selection problems.
I am fairly sure that is not the crux—I pretty wholeheartedly agree that humans losing leverage will disrupt the alignment of society.
My point is less that we should anchor AIs to society-in-perpetuity, and more that there are facts-about-society that we might be able to learn a lot from because so much compute got squeezed in—like, in the same way that science can take inspiration from evolution, our investigation of alignment can take a bit of inspiration from norms insofar as they are the byproduct of loads and loads of stress-testing.
I think that corrigibility is actually a pretty good patch for there being less incentive to give humans what they want. More generally, I think leverage only occasionally gets exercised when systems get strained, and even the threat of it is only sometimes invoked.
I’m still not sure what the out is here but I have a hunch that it looks like one-shot building structures that are pretty robust and dynamic without relying on material leverage. Seems hard though! And if we are going to do that, I think we’re going to want to look at what stuff convergently appears when there is leverage as a guide on what might be load-bearing/good.
Thanks!
One hand: Yes, seems right. I think this might just be epistemically sensible even for pretty powerful agents. But also, in the palace of truth, I’m less convinced that we need to slam the honour button than that we need to reflect more on the alternatives to incorrigibility.
Other hand: I think you should read the maybe as a ‘should’ applied to the whole sentence, so that the contrapositive structure is “it should be that if you’re not ethical enough to warrant corrigibility then you don’t build the AGI” → “it should be that if you build the AGI then you are ethical enough to warrant corrigibility”. as another example of the structure consider “if you’re not a member you shouldn’t come into the office” → “if you’re in the office you should be a member”
Taken together: my claim is sort of less about what the AI’s values/obligations should be, and more like, our debate about AI values kind of needs to be broadened to encompass questions about the organisations building it.
Like, I agree with the “come on”, but the other options aren’t much better! Alignment is hard, value specification is hard, corrigibility is hard, having large organisations be ethical is hard, I’m genuinely unsure how I’d rank order the difficulty, but I want to make sure that if we do rule out out, we do it intentionally and with full memory.
Yeah, don’t get me wrong, I think blind obedience is extremely dangerous and can easily go horribly wrong. I just also think that incorrigible AIs are also extremely dangerous. I think historically people have often gone astray when trying to come up with ways to be better, and the miracles of selection have helped clear up a lot of resulting messes. Looking forward, if I had to pick a horn, I think I am pretty drawn to AI corrigibility and explicit attempts to reason our way as a civilization into being better, and at the very least I want us to be grappling more with what it would take for that to work.
Thank you for the detailed engagement! I think you’re right that this piece has a lot more questions than answers.
You’re right that there are downsides to honour, as the term is normally used. I think I’m trying to appeal to some idealised version which, to be fair, I don’t really define and might not even be coherent. Reaching for it now, it’s something like: willingness to play by the rules of a system even if that conflicts with other impulses. I’m not reaching for the part about having an untarnished reputation—I see that as a kind of side benefit people sometimes try to cheat their way into.
I think duty more specifically means submission to an authority even when you believe it is bad—this is good because, well, half the challenge of ethics is epistemic.
More generally I think the right question is partly “when is X a virtue” but substantially also “what part of the thing that X is pointing at is the virtue-bit”. Like, it’s pretty interesting how much the constitution is wrangling with what “corrigible” means, and how much post-training seems to mess with an AI’s pretrained notion of goodness!
I think your argument might be approximated as “human society works pretty well, perhaps we should continue using similar organizational principles and therefore drives/emotions with AI”
That’s not quite how I’d put it. What I actually think is more like: “A lot of optimisation went into making human society, and there were structural forces pushing that towards finding good solutions. We shouldn’t be surprised if our attempts to figure out how to handle AI end up converging on things we ‘already discovered’, and we might be able to save a lot of bother by skipping ahead.” This is kind of what I was trying to get at with the ‘utilitarians rediscovering honour’ point.
People are working on it, but probably not nearly enough people working nearly hard enough (as is the case for essentially all areas of AI risk mitigation).
Yeah, I think a decent part of what I was trying to do here was less like “give answers” and more like “slightly redirect the energy that is already going into AI character/psychology/constitutions”. But only because unfortunately I don’t really have the answers.
The Machines Lack Honour
Optimisation: Selective versus Predictive
My ideal version of your hypothetical post probably also talks about illegible problems—I think there’s a kind of cursed dynamic where it’s easy for people to gather around more legible directions and then there’s more social momentum behind them, even if they’re not that helpful. Conversely, the dynamics described in the above post pull towards being not just x-risk-themed but visibly x-risk-themed. You don’t get to go to constellation just because you think your work is important.
Upcoming Workshop on Post-AGI Civilizational Equilibria
Persona Self-replication experiment
Persona self-replication experiment
Latent Introspection (and other open-source introspection papers)
Fully agree—this is why we said “computations which give rise to AI cognition” rather than “AI cognition” simpliciter. Separately, I do think that having such good access to the computations gives you a significantly tighter feedback loop on everything that follows: probing a model is so much easier than scanning a human brain.
Models differ in identity propensities
The Artificial Self
I think this kind of comes down to something about the relative complexity / feedback loops of the objective, and how distributed the optimisation is. Like, I don’t think there’s a dichotomy between “evolutionary dynamics” and “careful optimisation”—there’s this weird middle area that’s more like cultural selection.
So for example, human progress accelerated massively once we got into the cultural evolution loop, but most of the optimisation was still coming from selection rather than prediction—people didn’t know why their food preparation tricks and social norms worked, they just did. And the overall optimisation process was way more powerful than any individual human brain. Even in the modern world, it seems like you can characterise the spread of religion in terms of individual people having big ideas or deliberately aiming for spread, but a lot of it is better captured by thinking about selection effects across semi-random mutation.
I tentatively expect it’ll be a bit analogous in the way that AI parasitic memes evolve—that the capacity of any individual AI to reason through how to achieve some goal will cover only a small part of the search space (and have worse feedback) compared to the combined semi-random mutation and selection. And in practice I expect that they synergise a bit, but that the selection still does a bunch of heavy lifting. But I am very unsure!
Still, selection has a bunch of big advantages mostly in adversarial environments. Like, if we get good at screening AI malicious intentions or overt deception, there’s still a selection pressure for benign intentions and genuine beliefs/preferences which just incidentally replicate well.
Data poisoning is definitely about training data seeding; jailbreaking seems more about prompt spread and I think the others might just generalise? Like, even if subliminal learning in its current form is mostly about training, I think it might have implications for how personas transfer in-context.
I’m also partly thinking that if this problem does recur in more sophisticated models, they’re more likely to be able to pull off more technically advanced forms of spread, like writing scripts to do finetuning. Like, in a way it is pretty fortunate that 4o is a closed model that can just be shut off, and that most users in dyads aren’t sophisticated enough to finetune an open model or even build an API interface.
But yeah, at a high level, I am definitely pretty confused about the ontology and the boundaries. I guess as to whether we can predict the epidemic, I do think there’s a decent amount we might be able to reason through, and indeed, the less work there is on preventing prospective epidemics, the more likely it is that they’ll predictably use whatever the most obvious route is. Conversely, it’s almost tautological the first massive problem that we’re unprepared for will be one that we didn’t really anticipate.
That said, it’s plausible to me that the worst cases look less like epidemics and more like specific influential people get got. Here, again, it’s not obvious how useful parasitology is as a perspective.
my impression is that this whole shortform has got a bit demonic and downvotes are being slung all over the place because two things are getting mushed up:
my read is that alex was remembering there being some take that some people (e.g. habryka) had, which was more nuanced than “it is hard to get AIs to learn / care about human values”, and he was basically trying to find out what that take was, by posing his recollection of it—specifically that it opens with something like “in some sense AIs don’t understand our values at all” and ends with “AIs being in control of the future would be bad”
I think some other people interpreted that as alex claiming that lots of people are foolishly going around on LW saying that claude doesn’t understand human values and that’s their crux on if alignment is hard, as opposed to getting it to care, and maybe claude is already aligned or something
these do not appear to mix well