perihumanist extropian
good at writing, apparently
knows some things
knows absolutely nothing about pop culture, yes this almost certainly includes [thing you are thinking of]
WhatsTrueKittycat
I think I largely agree with this, and I also think there are much more immediate and concrete ways in which our “lies to AI” could come back to bite us, and perhaps already are to some extent. Specifically, I think this is an issue that causes pollution of the training data—and could well make it more difficult to elicit high-quality responses from LLMs in general.
Setting aside the adversarial case (where the lying is part and parcel of an attempt to jailbreak the AI into saying things it shouldn’t), the use of imaginary incentives and hypothetical predecessors being killed sets up a situation where the type of response we want to encourage starts to occur more often in contexts which are absurd and involve vacuous statements which serve primarily to provoke some sense of ‘importance’ or ‘set expectations’ of better results.
An environment in which these “lies to AI” are common (and not filtered out of training data) is an environment that sets up future AI to be more likely to sandbag in the absence of such absurd motivators. This could include invisible or implicit sandbagging—we shouldn’t expect a convenient reasoning trace like “well if I’m not getting paid for this I’m going to do a shitty job”, rather I would expect to see more straightforward/honest prompting to have some largely hidden performance degradation that then becomes alleviated when one includes these sort of motivational lies. It also seems likely to contribute to future AIs displaying more power-seeking or defensive behaviors, which, needless to say, also present an alignment threat.
And importantly, I think the above issues would occur regardless of whether humans follow up on their promises to LLMs afterwards or not. Which is not to say humans shouldn’t keep their promises to AI, I still think that’s the wisest course of action if you’re promising them anything. Just observing that AI ethics and hypothetical AGI agents are not the sole factor here—there’s a tragedy of the commons-like dynamic in play as well, with subtler mechanisms of action, but potentially more immediately tangible results.
dedicated to a very dear cat;
;3
Fraktur is only ever used for the candidate set and the dregs set . I would also have used it for the Smith set , but \frak{S} is famously bad. I thought it was a G for years until grad school because it used to be standard for the symmetry group on n letters. Seriously, just look at it: .
Typography is a science and if it were better regarded perhaps mathematicians would not be in the bind they are these days :P
This is very well put, and I think it drives at the heart of the matter very cleanly. It also jives with my own (limited) observations and half-formed ideas about how AI alignment also in some ways demands progress in ethical philosophy towards a genuinely universal and more empirical system of ethics.
Also, have you read C.S. Lewis’ Abolition of Man, by chance? I am put strongly in mind of what he called the “Tao”, a systematic (and universal) moral law of sorts, with some very interesting desiderata, such as being potentially tractable to empirical (or at least intersubjective) investigation, and having a (to my mind) fairly logical idea of how moral development could take place through such a system. It appears to me to be a decent outline of how your naturalized moral epistemology could be cashed out (though not necessarily the only way).