peligrietzer

Karma: 796

The Problem With the Word ‘Alignment’

peligrietzer and particlemania

21 May 2024 3:48 UTC

57 points

8 comments6 min readLW link

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

13 Oct 2023 1:38 UTC

69 points

0 comments1 min readLW link

(arxiv.org)

peligrietzer 14 Jun 2023 15:50 UTC
1 point
0
in reply to: paulfchristiano’s comment on: Cosmopolitan values don’t come free
Possibly relevant?

peligrietzer 2 May 2023 23:59 UTC
1 point
0
in reply to: M. Y. Zuo’s comment on: Some Thoughts on Virtue Ethics for AIs
I describe the more formal definition in the post:

‘Actions (or more generally ‘computations’) get an x-ness rating. We define the x shard’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent’s future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)′

And as I say in the post, we should expect decision-influences matching this definition to be natural and robust only in cases where x is a ‘self-promoting’ property. A property x is ‘self-promoting’ if it is reliably the case that performing an action with a higher x-ness rating increases the expected aggregate x-ness of future actions.

peligrietzer 2 May 2023 6:28 UTC
9 points
−1
in reply to: shminux’s comment on: Some Thoughts on Virtue Ethics for AIs
Yep! Or rather arguing that from a broadly RL-y + broadly Darwinian point of view ‘self-consistent ethics’ are likely to be natural enough that we can instill them, sticky enough to self-maintain, and capabilities-friendly enough to be practical and/or survive capabilities-optimization pressures in training.

Some Thoughts on Virtue Ethics for AIs

peligrietzer2 May 2023 5:46 UTC

74 points

7 comments4 min readLW link

peligrietzer 25 Apr 2023 19:35 UTC
1 point
0
in reply to: TurnTrout’s comment on: Behavioural statistics for a maze-solving agent
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.

peligrietzer 23 Apr 2023 20:18 UTC
LW: 1 AF: 1
0
AF
in reply to: Max H’s comment on: Behavioural statistics for a maze-solving agent
I’d maybe point the finger more at the simplicity of the training task than at the size of the network? I’m not sure there’s strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

peligrietzer 5 Apr 2023 23:38 UTC
4 points
−3
on: Maze-solving agents: Add a top-right vector, make the agent go to the top-right
I would again suggest a ‘perceptual’ hypothesis regarding the subtraction/addition asymmetry. We’re adding a representation of a path where there was no representation of a path (creates illusion of path), or removing a representation of a path where there was no representation of a path (does nothing).

peligrietzer 2 Apr 2023 2:59 UTC
1 point
0
in reply to: redbird’s comment on: peligrietzer’s Shortform
No but I hope to have a chance to try something like it this year!

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

31 Mar 2023 19:20 UTC

101 points

17 comments11 min readLW link

peligrietzer 24 Mar 2023 5:31 UTC
1 point
0
in reply to: CatGoddess’s comment on: Understanding and controlling a maze-solving policy network
The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

313 points

23 comments23 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

[Simulators seminar sequence] #2 Semiotic physics—revamped

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer and remember

27 Feb 2023 0:25 UTC

23 points

23 comments13 min readLW link

[Simulators seminar sequence] #1 Background & shared assumptions

Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer and remember

2 Jan 2023 23:48 UTC

49 points

4 comments3 min readLW link

peligrietzer 5 Dec 2022 21:48 UTC
4 points
2
on: Alignment allows “nonrobust” decision-influences and doesn’t require robust grading
Having a go at extracting some mechanistic claims from this post:
- A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.
- The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are ‘good’ according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.
- Normally, nothing is trying to argmax value x’s goodness criterion for <plan-grader, plan-search> pairs. Value x’s goodness criterion for <plan-grader, plan-search> pairs is normally just implicit in x’s method for constructing <plan-grader, plan-search> pairs.
- Value x may sometimes explicitly search over <plan-grader, plan-search> pairs in order to find pairs that score high according to a grader-proxy to value x’s goodness criterion. However, here too value x’s goodness criterion will be implicitly expressed in the policy-execution level as a disposition to construct a pair <grader-proxy to value x’s goodness criterion, search over pairs> that doesn’t Goodhart the grader-proxy to value x’s goodness criterion.
- The crucial thing is that the true, top level ‘value x’s goodness criterion’ is a property of an actor, not a critic.

peligrietzer 1 Dec 2022 0:51 UTC
4 points
1
on: peligrietzer’s Shortform
Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let’s say I’m a guy who cares a lot about studying math well, studies math every evening, and doesn’t know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I’m having a good time and credit gets assigned to the ‘taking ketamine before I sit down to study math’ computation. So my policy network gets updated to increase the probability of the computation ‘take ketamine before I sit down to study math.’
HOWEVER my world-model also gets updated, acquiring the new knowledge ‘taking ketamine before I sit down to study math makes math-study go terrible intellectually.’ And if I have a strong enough ‘math study’ value shard then in light of this new knowledge the ‘math study’ value shard is going to forbid taking ketamine before I sit down to study math. So my ‘take ketamine before sitting down to study math’ exploration resulted in me developing an overall disposition against taking ketamine before sitting down to study math, even though the computation ‘take ketamine before sitting down to study math’ was directly reinforced! (Because same act of exploration also resulted in a world-model update that associated the computation ‘take ketamine before sitting down to study math’ with implications that an already-powerful shard opposes.)
This is important, I think, because it shows that an agent can explore relatively freely without being super vulnerable to value-drift, and that you don’t necessarily need complicated reflective reasoning to have (at least very basic) anti-value-drift mechanisms. Since reinforcement is a pretty gradual thing, you can often try an action you don’t know much about, and if it turns out that this action has high reward but also direct implications that your already existing powerful shards oppose then the weak shard formed by that single reinforcement pass will be powerless.

Now the ML experiment idea:

A game where the agent gets rewarded for (e.g.) jumping high. After the agent gets somewhat trained, we continue training but introduce various ‘powerups’ the agent can pick up that increase or decrease the agent’s jumping capacity. We train a little more, and now we introduce (e.g.) green potions that decrease the agent’s jumping capacity but increase the reward multiplier (positive for expected reward on the balance).

My weak hypothesis is that even though trying green potions gets a reinforcement event, the agent will avoid green potions after trying them. This is because there’d be a strong ‘avoid things that decrease jumping capacity’ shard already in place that will take charge once the agent learns to associate taking green potions with decrease in jumping capacity. (Though maybe it’s more complicated: maybe there will be a kind of race between ‘taking green potions’ getting reinforced and the association between taking green potions and decrease in jumping capacity forming and activating the ‘avoid things that decrease jumping capacity’ shard.)

Another interesting question: what will happen if we introduce (e.g.) red potions that increase the agent’s jumping capacity but decrease the reward multiplier (negative for expected reward on the balance)? Seems clear that as the agent takes red potions over and over the reinforcement process will eventually remove the disposition to take red potions, but would this also start to push the agent towards forming some kind of mental representation of ‘reward’ to model what’s going on? If we introduce red potions first, then do some training, and then introduce green potions, would the experience with red potions make the agent respond differently (perhaps more like a reward maximiser) to trying green potions?

peligrietzer’s Shortform

peligrietzer1 Dec 2022 0:51 UTC

2 points

4 comments1 min readLW link