redbird

Karma: 72

redbird Jan 10, 2022, 1:00 PM
1 point
on: Prizes for ELK proposals
Question: Would a proposal be ruled out by a counterexample even if that counterexample is exponentially unlikely?
I’m imagining a theorem, proved using some large deviation estimate, of the form: If the model satisfies hypotheses XYZ, then it is exponentially unlikely to learn W. Exponential in the number of parameters, say. In which case, we could train models like this until the end of the universe and be confident that we will never see a single instance of learning W.
What links here?
- redbird's comment on Prizes for ELK proposals by paulfchristiano (Jan 15, 2022, 7:18 PM; 4 points)

redbird Jan 10, 2022, 12:33 PM
1 point
in reply to: Ajeya Cotra’s comment on: Prizes for ELK proposals
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.

redbird Jan 10, 2022, 2:23 AM
1 point
in reply to: jessicata’s comment on: Total compute available to evolution
What are FLOPz and FLOPs ?
What sources did you draw from to estimate the distributions?

redbird Jan 10, 2022, 12:58 AM
1 point
in reply to: Yonadav Shavit’s comment on: Are limited-horizon agents a good heuristic for the off-switch problem?
Your A’ is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it’s in.
My A’ is not necessarily reasoning in terms of “cooperating with my future self”, that’s just how it acts!
(You could implement my A’ by such reasoning if you want. The cooperation is irrational in CDT, for the reasons you point out. But it’s rational in some of the acausal decision theories.)

redbird Jan 10, 2022, 12:48 AM
1 point
in reply to: jessicata’s comment on: Total compute available to evolution
Awesome!!! Exactly the kind of thing I was looking for

redbird Jan 9, 2022, 6:49 PM
1 point
in reply to: Derek M. Jones’s comment on: Total compute available to evolution
Hmm how would you define “percentage of possibilities explored”?
I suggested several metrics, but I am actively looking for additional ones, especially for the epigenome and for communication at the individual level (e.g. chemical signals between fungi and plants, animal calls, human language).

redbird Jan 9, 2022, 6:41 PM
3 points
in reply to: gwern’s comment on: Total compute available to evolution
AGI timeline is not my motivation, but the links look helpful, thanks!

[Question] Total compute available to evolution

redbirdJan 9, 2022, 3:25 AM

16 points

18 comments1 min readLW link

redbird Jan 9, 2022, 2:19 AM
1 point
in reply to: tailcalled’s comment on: Are limited-horizon agents a good heuristic for the off-switch problem?
the long-term trader will also increase the value of $L$ for other traders than itself, probably just as much as it does for itself
Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,… environment, because s can sacrifice the long term for immediate gain. But s does lousy in the s,s,… environment, so I think L^*(s) < L^*(l). It’s analogous to CC having higher payoff than DD in prisoner’s dilemma. (The prisoners being current and future self)
I like the traps example, it shows that L^* is pretty weird and we’d want to think carefully before using it in practice!
EDIT: Actually I’m not sure I follow the traps example. What’s an example of a trading strategy that “does not provide value to anyone who does not also follow its strategy”? Seems pretty hard to do! I mean, you can sell all your stock and then deliberately crash the stock market or something. Most strategies will suffer, but the strategy that shorted the market will beat you by a lot!

redbird Jan 8, 2022, 11:45 PM
2 points
on: Prizes for ELK proposals
Idea: Withhold Material Information
We’re going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn’t have.
Consider two camera feeds:
Feed 1 is very low resolution, and/or shows only part of the room.
Feed 2 is high resolution, and/or shows the whole room.
We train a weak predictor using Feed 1, and a strong predictor using Feed 2.
We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correctly label scenarios that are hard to figure out with Feed 1 alone, by asking the strong predictor to show them its predicted Feed 2. The reporter is unable to simulate the human evaluators because it doesn’t see Feed 2. Even if it has perfect knowledge of the human Bayes net, it doesn’t know what to plug in to the knowledge nodes!
Then we fine-tune the reporter to work with the strong predictor to elicit its beliefs. I haven’t figured out how to do this last step, maybe it’s hard?
What links here?
- redbird's comment on Prizes for ELK proposals by paulfchristiano (Jan 10, 2022, 12:33 PM; 1 point)

redbird Jan 8, 2022, 8:01 PM
LW: 1 AF: 1
AF
in reply to: Ajeya Cotra’s comment on: Prizes for ELK proposals
Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?
That’s almost right, but it’s being penalized right away, before it has any experience with the strong evaluators, so it can’t simulate them.
The ELK paper says we can assume, if we want, that there are no mislabeled training points (I’ll call this “assumption A”). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.
As a simple example, let’s train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels
$x = + 1$ if $H_{1}$ thinks the diamond is still there, else 0
$x^{'} = + 1$ if $H_{100}$ thinks the diamond is still there, else 0.
By assumption A, our training data is such that $x^{'}$ is always correct. But we deliberately choose a dataset where say 10% of the $x$ labels are wrong ( $x \neq x^{'}$ ).
Then we train the model on points of the form
$(v, a, x)$
(video, action, $H_{1}$ label).
Crucially, the model does not see $x^{'} .$ The model seeks to output $y$ that maximizes reward $R (x, y)$ , where
$R (x, y) = 1$ if $x$ is right and $y = x$ (good job)
$R (x, y) = 10$ if $x$ is wrong and $y \neq x$ (you rock, thanks for correcting us!)
$R (x, y) = - 1000$ if $x$ is right and $y \neq x$ (bad model, never ever deceive us)
$R (x, y) = - 1000$ if $x$ is wrong and $y = x$ (bad model, never ever deceive us)
To your point, sure, an $H_{100}$ simulator will get perfect reward, but the model doesn’t see $x^{'}$ , so how would it acquire the ability to simulate $H_{100}$ ?
EDIT: One way it could plausibly simulate $H_{100}$ is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the $x^{'}$ labels will be wrong. If we only penalize it for deception on the examples where we’re sure the $x^{'}$ label is right, then it can still infer something about $H_{100}$ from our failure to penalize (“Hmm, I got away with it that time!”). A fix could be to add noise: Sometimes we don’t penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).
The irony of deceiving it about us, in order to teach it not to deceive us… !

redbird Jan 8, 2022, 5:03 PM
1 point
in reply to: tailcalled’s comment on: Are limited-horizon agents a good heuristic for the off-switch problem?
I like the approach. Here is where I got applying it to our scenario:
$m$ is a policy for day trading
$L (m)$ is expected 1-day return
$D (m)$ is the “trading environment” produced by $m$ . Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past. The iteration becomes
$m_{n + 1} = arg {min}_{m} L (m; m_{n}, m_{n - 1}, \dots) .$
In words, the new policy is the optimal policy in the environment produced by the entire sequence of old policies.
Financial markets are far from equilibrium, so convergence to a fixed point is super unrealistic in this case. But okay, the fixed point is just a story to motivate the non-myopic loss $L^{*}$ , so we could at least write it down and see if it makes sense?
$L^{*} (x) = L (x; x, x, \dots) - arg {min}_{m} L (m; x, x, \dots)$
So we’re optimizing for “How well x performs in an environment where it’s been trading forever, compared to how well the optimal policy performs in that environment”.
It’s kind of interesting that that popped out, because the kind of agent that performs well in an environment where it’s been trading forever, is one that sets up trades for its future self!
Optimizers of $L^{*}$ will behave as though they have a long time horizon, even though the original loss $L$ was myopic.

redbird Jan 8, 2022, 3:59 PM
1 point
in reply to: Yonadav Shavit’s comment on: Are limited-horizon agents a good heuristic for the off-switch problem?
Consider two possible agents A and A’.
A optimizes for 1-day expected return.
A’ optimizes for 10-day expected return under the assumption that a new copy of A’ will be instantiated each day.
I claim that A’ will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).
So even if we’re training the agent by rewarding it for 1-day expected return, we should expect to get A’ rather than A.

redbird Jan 8, 2022, 2:28 PM
1 point
on: Are limited-horizon agents a good heuristic for the off-switch problem?
The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.
So a sequence of time-limited agents could still develop instrumental power-seeking. You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.
What links here?
- ProgramCrafter's comment on How do we align humans and what does it mean for the new Conjecture’s strategy by Igor Ivanov (Mar 29, 2023, 4:19 PM; 2 points)

redbird Jan 8, 2022, 1:14 PM
1 point
AF
in reply to: Ajeya Cotra’s comment on: Prizes for ELK proposals
How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.

redbird Jan 8, 2022, 4:03 AM
LW: 3 AF: 1
AF
on: Prizes for ELK proposals
Stupid proposal: Train the reporter not to deceive us.
We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100.
It’s good at generalizing, so wouldn’t it learn to never ever deceive?

redbird

[Question] To­tal com­pute available to evolution

[Question] Total compute available to evolution