abramdemski comments on Troll Bridge

abramdemski 8 Nov 2021 20:48 UTC
LW: 2 AF: 2
0
AF
I’ll talk about some ways I thought of potentially formalizing, “stop thinking if it’s bad”.
If your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem.
One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, “stop thinking about this thing” action that it can take. Every so often, the agent considers taking this action using regular evidential decision theory. So, in the troll bridge case, it could potentially see that the path of reasoning it’s following is potentially dangerous, and thus decide to stop. Also, the agent needs to avoid thinking too many thoughts before considering to take the “stop thinking about this thing” action. Otherwise, it could think all sorts of problematic thoughts before being able to stop itself.
This simple technique might actually be enough to solve the problem, especially if the AI has the ability to choose its own inference algorithm to find one that makes the AI able to realize, “thinking about this is bad” before it finds the concrete bad thing. And, for what it’s worth, it’s enough for me personally to get across the bridge.
Ordinary Bayesian EDT has to finish its computation (of its probabilistic expectations) in order to proceed. What you are suggesting is to halt those calculations midway. I think you are imagining an agent who can think longer to get better results. But vanilla EDT does not describe such an agent. So, you can’t start with EDT; you have to start with something else (such as logical induction EDT) which does already have a built-in notion of thinking longer.
Then, my concern is that we won’t have many guarantees for the performance of this system. True, it can stop thinking if it knows thinking will be harmful. However, if it mistakenly thinks a specific form of thought will be harmful, it has no device for correction.
This is concerning because we expect “early” thoughts to be bad—after all, you’ve got to spend a certain amount of time thinking before things converge to anything at all reasonable.
So we’re between a rock and a hard spot here: we have to stop quite early, because we know the proof of troll bridge is small. But we con’t stop early, because we know things take a bit to converge.
So I think this proposal is just “somewhat-logically-updateless-DT”, which I don’t think is a good solution.
Generally I think rollback solutions are bad. (Several people have argued in their favor over the years; I find that I’m just never intrigued by that direction...) Some specific remarks:
- Note that if you literally just roll back, you would go forward the same way again. So you need to somehow modify the rolled back state, creating a “pseudo-ignorant” belief states where you’re not really uninformed, but rather, reconstruct something merely similar to an uninformed state.
  - It is my impression that this causes problems.
  - You might be able to deduce the dangerous info from the fact that you have gone down a different reasoning path.
    I would thus argue that the rollback criterion has to use only info available before rolling back to decide whether to roll back; otherwise, it introduces new and potentially dangerous info to the earlier state.
    But this means you’re just back to the non-rollback proposal, where you decide to stop reasoning at some point.
  - Or, if you solve that problem, rollbacks may leave you truly ignorant, but not solve the original problem which you rolled back to solve.
    For example, suppose that Omega manipulates you as follows: reward [some target action] and punish [any other action], but only in the case that you realize Omega implements this incentive scheme. If you never realize the possibility, then Omega leaves you alone.
    If you realize that Omega is doing this, and then roll back, you can easily end up in the worst possible world: you don’t realize how to get the reward, so you just go about your business, but Omega punishes you for it anyway.
    For this reason, I think rollbacks belong in a category I’d call “phony updatelessness” where you’re basically trying to fool Omega by somehow approximating an updateless state, but also sneakily taking some advantages from updatefullness. This can work, of course, against a naive Omega; but it doesn’t seem like it really gets at the heart of the problem.
Simply put, I think rolled-back states are “contaminated” in some sense; you’re trying to get them clean, but the future reasoning has left a permanent stain.