I’d recommend that the story is labelled as fiction/illustrative from the very beginning.
Stuart_Armstrong(Stuart Armstrong)
Thanks, modified!
I believe I do.
Thanks!
Having done a lot of work on corrigibility, I believe that it can’t be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.
Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?
If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.
Thanks! Corrected.
Thanks! Corrected.
Great and fun :-)
Another way of saying this is that inner alignment is more important than outer alignment.
Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.
Thanks! Yes, this is some weird behaviour.
Keep me posted on any updates!
As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn’t trained on them at all.
Good work on this post.
I’ll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don’t think we can talk sensibly about an AI having “beneficial goal-directedness” without situational awareness. For instance, it’s of little use to have an AI with the goal of “ensuring human flourishing” if it doesn’t understand the meaning of flourishing or human. And, without situational awareness, it can’t understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I’d be focusing on “do the goals stay safe as the AI gains situational awareness?”, rather than “are the goals safe before the AI gains situational awareness?”
Here’s the review, though it’s not very detailed (the post explains why):
Thanks! Link changed.
A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.
The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.
I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).
It’s rare that I encounter a lesswrong post that opens up a new area of human experience—especially rare for a post that doesn’t present an argument or a new interpretation or schema for analysing the world.
But this one does. A simple review, with quotes, of an ethnographical study of late 19th century Russian peasants, opened up a whole new world and potentially changed my vision of the past.
Worth it from its many book extracts and choice of subject matter.
Fails to make a clear point; talks about the ability to publish in the modern world, then brushes over cancel culture, immigration, and gender differences. Needs to make a stronger argument and back it up with evidence.
Thanks! Corrected (though it is indeed a good hard problem).
Pre-training and domain specific knowledge are not needed.
Run them on examples such as frown-with-red-bar and smile-with-blue-bar.
Which problems are you thinking of?