So much of your writing sounds like an eloquent clarification of my own underdeveloped thoughts. I’d bet good money your lesswrong contributions have delivered me far more help than harm :) Thanks <3
X4vier
Sorry for the late response! I didn’t realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn’t as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you’re saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It’s true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can’t get away with tricking the humans like this, since it’s ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.
Thanks for your comment, I think I’m a little confused about what it would mean to actually satisfy this assumption.
It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I’m super confused about anything resembling questions about self-awareness/naturalisation.
Thanks for response!
Input/output: I agree that the unnatural input/output channel is just as much a problem for the ‘intended’ model as for the models harbouring consequentialists, but I understood your original argument as relying on there being a strong asymmetry where the models containing consequentialists aren’t substantially penalised by the unnaturalness of their input/output channels. An asymmetry like this seems necessary because specifying the input channel accounts for pretty much all of the complexity in the intended model.
Computational constraints: I’m not convinced that the necessary calculations the consequentialists would have to make aren’t very expensive (from the their point of view). They don’t merely need to predict the continuation of our bit sequence—they have to run simulations of all kinds of possible universes to work out which ones they care about and where in the multiverse Solomonoff inductors are being used to make momentous decisions, and then they perhaps need to simulate their own universe to work out which plausible input/output channels they want to target—if they do this then all they get in return is a pretty measly influence over our beliefs, (since they’re competing with many other daemons in approximately equally similar universes who have opposing values). I think there’s a good chance these consequentialists might instead elect devote their computational resources to realising other things they desire (like simulating happy copies of themselves or something).
Okay, I agree. Thanks :)
Thanks heaps for the post man, I really enjoyed it! While I was reading it felt like you were taking a bunch of half-baked vague ideas out of my own head, cleaning them up, and giving some much clearer more-developed versions of those ideas back to me :)
Doesn’t make sense to use the particular consumer’s preferencces to estimate the cruelty cost. If that’s how we define the cruelty cost it then the buyer should already be taking it into account when making their purchasing decision, so it’s not an exernality.
The externality comes from the animals themselves having interests which the consumers aren’t considering
If we expect there will be lots of intermediate steps—does this really change the analysis much?
How will we know once we’ve reached the point where there aren’t many intermediate steps left before crossing a crticial threshold? How do you expect everyone’s behaviour to change once we do get close?
I think OP is correct about cultural learning being the most important factor in explaining the large difference in intelligence between homo sapiens and other animals.
In early chapters of Secrets of Our Success, the book examines studies comparing performance of young humans and young chimps on various congnitive tasks. The book argues that across a broad array of cognitive tests, 4 year old humans do not perform singificantly better than 4 year old chimps on average, except in cases where the task can be solved by immitating others (human children crushed the chimps when this was the case).
The book makes a very compelling argument that our species is uniquely prone to immitating others (even in the absense of causal models about why the behaviour we’re immitating is useful), and even very young humnans have inate instincts for picking up on signals of prestige/compotence in others and preferentially immitating those high prestige poeple. Imo the arguments put forward in this book make cultral learning look like a very strong theory better in comparison to Machieavellian intelligence hypothesis, (although what actually happend at a lower level abstraction probably includes aspects of both).
Out of interest—if you had total control over OpenAI—what would you want them to do?
Maybe an analogy which seems closer to the “real world” situation—let’s say you and someone like Sam Altman both tried to start new companies. How much more time and starting capital do you think you’d need to have a better shot of success than him?
Heartbreaking :’( still, that “taken time off from their cryptographic shenanigans” line made me laugh so hard I woke my girlfriend up