~[agent foundations]
Mateusz Bagiński
Charbel-Raphaël and Lucius discuss Interpretability
‘Theories of Values’ and ‘Theories of Agents’: confusions, musings and desiderata
GPTs’ ability to keep a secret is weirdly prompt-dependent
I ask you, do you really think that an AI aligned to human values would refrain from doing something like this to anyone? One of the most fundamental aspects of human values is the hated outgroup. Almost everyone has somebody they’d love to see suffer. How many times has one human told another “burn in hell” and been entirely serious, believing that this was a real thing, and 100% deserved?
This shows how vague a concept “human values” are, and how different people can interpret it very differently.
I always interpreted “aligning an AI to human values” as something like “making it obedient to us, ensuring it won’t do anything that we (whatever that ‘we’ is—another point of vagueness) wouldn’t endorse, lowering suffering in the world, increasing eudaimonia in the world, reducing X-risks, bringing the world closer to something we (or smarter/wiser versions of us) would consider a protopia/utopia”
Certainly I never thought it to be a good idea to imbue the AI with my implicit biases, outgroup hatred, or whatever. I’m ~sure that people who work on alignment for a living have also seen these skulls.
I know little about CEV, but if I were to coherently extrapolate my volition, then one aspect of that would be increasing the coherence and systematicity of my moral worldview and behavior, including how (much) different shards conform to it. I would certainly trash whatever outgroup bias I have (not counting general greater fondness for the people/other things close to me).
So, yeah, solving “human values” is also a part of the problem but I don’t think that it makes the case against aligning AI.
“Wanting” and “liking”
Does anybody know what happened to Julia Galef?
(Meta) Why do you not use capital letters, unless in acronyms? I find it harder to parse.
[Question] How do you manage your inputs?
Metacomment on speculations on who might have sabotaged NordStream.
It seems like people here mostly implicitly treat possible state actors as coherent, unified agents. But maybe it wasn’t any particular state acting as a whole but rather some small group within that state that decided to do it on their own. Even if they considered it likely to be identified after the fact, the subgroup may have judged the sabotage to be in the interest of the whole nation or maybe that particular subgroup.
(I don’t know how much fragmentation of that sort there is in any given country but I think it’s at least plausible)
I think wanting to seem like sober experts makes them kinda believe the things they expect other people to expect to hear from sober experts.
Also, there was Gato, trained on shitload of different tasks, achieving good performance on a vast majority of them, which led some to call it “subhuman AGI”.
I agree in general with the post, although I’m not quite sure how you would stich several narrow models/systems together to get an AGI. A more viable path is probably something like training it end-to-end, like Gato (needless to say, please don’t).
I would love to see something like Vanessa’s LTA reading list but for devinterp.
Reading this, it reminds me of the red flags that some people (e.g. Soares) saw when interacting with SBF and, once shit hit the fan, ruminated over not having taken some appropriate action.
Typo: it’s “Goodhart”, not “Goodheart”
First, this presupposes that for any amount of suffering there is some amount of pleasure/bliss/happiness/eudaimonia that could outweigh it. Not all LWers accept this, so it’s worth pointing that out.
But I don’t think the eternal paradise/mediocrity/hell scenario accurately represents what is likely to happen in that scenario. I’d be more worried about somebody using AGI to conquer the world and establish a stable totalitarian system built on some illiberal system, like shariah (according to Caplan, it’s totally plausible for global totalitarianism to persist indefinitely). If you get to post-scarcity, you may grant all your subjects UBI, all basic needs met, etc. (or you may not, if you decide that this policy contradicts Quran or hadith), but if your convictions are strong enough, women will still be forced to wear burkas, be basically slaves of their male kin etc. One could make an argument that abundance robustly promotes more liberal worldview, loosening of social norms, etc., but AFAIK there is no robust evidence for that.
This is meant just to illustrate that you don’t need an outgroup to impose a lot of suffering. Having a screwed up normative framework is just enough.
This family of scenarios is probably still better than AGI doom though.
Can’t you restate the second one as the relationship between two utility functions and such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.
[Question] What are the weirdest things a human may want for their own sake?
How about a dialogue on this, with no (asymmetric) posting rate limits?
I think sigma-algebras are probably not the right algebra to base beliefs on. Something resembling linear logic might be better for reasons we’ve discussed privately; that’s very speculative of course. Ideally the right algebra should be derived from considerations arising in construction of the representation theorem, rather than attempting to force any outcome top-down.
Have you elaborated on this somewhere or can you link some resource about why linear logic is a better algebra for beliefs than sigma algebra?
I’d love to see/hear you on his podcast.