PhD student in theoretical computer science (distributed computing) in France. Currently transitioning to AI Safety and fundamental ML work.
Nice post. I might try to use this idea to force myself to make more bet socially. I’m risk taking in terms of ideas and creation and jobs, but not enough in terms of talking to new people and flirting. Forcing myself to start a conversation with a stranger everyday is one way I’m trying to solve that; thinking about the rationality of the bet might become another.
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
I understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something.
So my question was a check to see if impact measures considered such judgements (which apparently they don’t) and if they didn’t, what was the problem.
I’m pretty sure expert systems are considered AI, if only because they were created by AI researchers. They’re not using ML though, and it’s not considered likely today that they will scale to human level AI or AGI.
The more I think about it, the more I come to believe that locality is very related to abstraction. Not the distance part necessarily, but the underlying intuition. If my goal is not “about the world”, then I can throw almost all information about the world except a few details and still be able to check my goal. The “world” of the thermostat is in that sense a very abstracted map of the world where anything except the number on its sensor is thrown away.
Sorry for the delay in answering.
Your paper looks great! It seems to tackle in a clean and formal way what I was vaguely pointing at. We’re currently reading a lot of papers and blog posts to prepare for an in-depth literature review about goal-directedness, and I added your paper to the list. I’ll try to come back here and comment after I read it.
In this post, I assume that a policy is a description of its behavior (like a function from state to action or distribution over action), and thus the distances mentioned indeed capture behavioral similarity. That being said, you’re right that a similar concept of distance between the internal structure of the policies would prove difficult, eventually butting against uncomputability.
In the specific example of the car, can’t you compare the impact of the two next states (the baseline and the result of braking) with the current state? Killing someone should probably be considered a bigger impact than braking (and I think it is for attainable utility).
But I guess the answer is less clear-cut for cases like the door.
Thanks for the post!
One thing I wonder: shouldn’t an impact measure give a value to the baseline? What I mean is that in the most extreme examples, the tradeoff you show arise because sometimes the baseline is “what should happen” and some other time the baseline is “what should not happen” (like killing a pedestrian). In cases where the baseline sucks, one should act differently; and in cases where the baseline is great, changing it should come with penalty.
I assume that there’s an issue with this picture. Do you know what it is?
Maybe the criterion that removes this specific policy is locality? What I mean is that this policy has a goal only on its output (which action it chooses), and thus a very local goal. Since the intuition of goals as short descriptions assumes that goals are “part of the world”, maybe this only applies to non-local goals.
Am I the only one for whom all comments in the Alignment Forum have 0 votes?
No worries, that’s a good answer. I was just curious, not expecting a full-fledged system. ;)
Thanks for the summary! It’s representative of the idea.
Just by curiosity, how do you decide for which posts/paper you want to write an opinion?
I think the collection part of GTD addresses exactly this problem. There’s two part:
You want to free your brain by writing what you want to do
You want to stop feeling like you forgot writing something down
The way proposed by GTD is to collect EVERYTHING. The goal is really to not have any commitment or desire stored internally, but collect everything outside of your brain. This solves the first problem if you give enough details, and the second problem when your brain learns that it can always find what it needs from your notes.
Anecdotically, it works for me.
I’ve been using GTD for some times now, and the injunction to put every thought about something to be done into a collection device (I have a page on Roam and a file on my note taking app on my phone) is really powerful. I never noticed how much more focus and clearity were possible when everything I want to do is written somewhere, and thus I don’t need to keep mental tab on it.
To go back to the metaphor of this post, forcing myself to close every tab that is not directly in use is one of the best productivity hack I learned.
Thanks! Glad that I managed to write something that was not causally or rhetorically all wrong. ^^
One related thing I was thinking about last week: part of the idea of abstraction is that we can pick a Markov blanket around some variable X, and anything outside that Markov blanket can only “see” abstract summary information f(X). So, if we have a goal which only cares about things outside that Markov blanket, then that goal will only care about f(X) rather than all of X
That makes even more sense to me than you might think. My intuitions about locality comes from its uses in distributed computing, where it measures both how many rounds of communication are needed to solve a problem and how far in the communication graph one needs to look to compute one’s own output. This looks like my use of locality here.
On the other hand, recent work on distributed complexity also studied the volume complexity of a problem: the size of the subgraph one needs to look at, which might be very different from a ball. The only real constraint is connectedness. Modulo the usual “exactness issue”, which we can deal with by replacing “the node is not used” by “only f(X) is used”, this looks a lot like your idea.
One intuition I have for the difference is the consuming/producing axis: the stimmer is purely consuming, the video game player is half consuming, half producing a narrative/experience, and the player is mostly producing something.
I’m not as sure why this should be morally relevant, but I do feel that producing is more “moral” than consuming, in general.
Ok, that makes much more sense. I was indeed assuming a proportional reward.
I don’t have kids, but if I had some, I would want to implement the “Hard Thing Rule” given by Angela Duckworth in her book “Grit”.
It boils down to 3 rules:
Everyone in the family must do one hard thing, one thing that requires deliberate practice everyday. This can be yoga, programming, ballet, a lot of things really.
You can quit your hard thing, but only at “natural” stopping points (the end of the season, the end of the school year,...)
You choose your hard thing.
I really like this idea, because it teaches working ethics, grit and consistency in a way that respects individual differences.