In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector[1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)
I’m sceptical that “they wrote an essay defending the position that one should act honourable in weird situations” is stronger evidence for “they will act honourably in weird situations ” than “they act honourably in normal situations”. this is because I’ve updated towards a more Hansonian worldview, which is more cynical about people’s essays.[1]
but maybe you can conclude that someone will act honourably in weird situations if:
they honourably in normal situations
they writes essays that they will act honourably in weird situations
whenever [they act P in normal situations and write essays that will act P in weird situations] they do in fact behave P in weird situations
now, with humans, it’s tricky to be sure of (3). that’s because we can’t put someone in the weird situations in which decision theories diverge;[2] and not without them knowing they are in an evaluation; and certainly not for 10,000 trials both statelessly and parallelised.
but with AIs, we plausibly can get a decent guarantee of (3)! so I’m more bullish on getting honourable guarantees on AIs.
some cloud of sentiments around: (i) people have little introspective access about what they will do, (ii) they write essays in far-mode, but act in near-mode, (iii) people are like lying pretty much all the time, (iv) people write essays for status-y reasons, etc.
In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector [1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)
I mean, in the setup following “a framing:” in the post
I’m sceptical that “they wrote an essay defending the position that one should act honourable in weird situations” is stronger evidence for “they will act honourably in weird situations ” than “they act honourably in normal situations”. this is because I’ve updated towards a more Hansonian worldview, which is more cynical about people’s essays.[1]
but maybe you can conclude that someone will act honourably in weird situations if:
they honourably in normal situations
they writes essays that they will act honourably in weird situations
whenever [they act P in normal situations and write essays that will act P in weird situations] they do in fact behave P in weird situations
this last criterion is something like a “taking ideas seriously” or “real-thinking” or “sincerity”.
now, with humans, it’s tricky to be sure of (3). that’s because we can’t put someone in the weird situations in which decision theories diverge;[2] and not without them knowing they are in an evaluation; and certainly not for 10,000 trials both statelessly and parallelised.
but with AIs, we plausibly can get a decent guarantee of (3)! so I’m more bullish on getting honourable guarantees on AIs.
some cloud of sentiments around: (i) people have little introspective access about what they will do, (ii) they write essays in far-mode, but act in near-mode, (iii) people are like lying pretty much all the time, (iv) people write essays for status-y reasons, etc.
we can’t put derek parfit in parfit’s hitchhiker, or put william newcomb in newcomb’s problem