Towards_Keeperhood

Karma: 143

I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity

Towards_Keeperhood20 Jun 2021 23:59 UTC

12 points

0 comments3 min readLW link

Towards_Keeperhood 21 Jun 2021 16:21 UTC
3 points
on: Hammertime Final Exam
I also did it: Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity
Thanks for the sequence, it was really helpful! :)

Towards_Keeperhood 22 Jan 2022 18:34 UTC
3 points
in reply to: Chris_Leong’s comment on: Book Review: Philosophical Investigations by Wittgenstein
This review is great. I’m actually impressed how you managed to extract all that relevant information and convey it relatively well in this not-terribly-long blogpost.

Towards_Keeperhood 30 Jan 2022 10:02 UTC
2 points
on: Counterexamples to some ELK proposals
In section “Worry 2: an experiment can itself perform tampering”:
But if the AI spent 1100 timesteps tampering, it would take 100 timesteps to undo its work, which is less than the amount it would take to tamper from scratch.
Shouldn’t it be “more” instead of “less”?

Towards_Keeperhood 10 Feb 2022 19:07 UTC
1 point
in reply to: Xodarap’s comment on: Prizes for ELK proposals
Therefore, the human interpreter can look only at $E_{n}$ and get the same loss as the direct translator, despite being upstream of it.
Maybe I misunderstand you, but how I understand it, $E_{n}$ is based on the last frame of the predicted video, and therefore is basically the most downstream thing there is. How did you come to think it was upstream of the direct translator?

Towards_Keeperhood 10 Feb 2022 19:54 UTC
1 point
in reply to: sudhanshu_kasewa’s comment on: Prizes for ELK proposals
Edit: I was wrong. You can ignore this comment thread.
I think that this:
def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
z, ε = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, ε, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, ε, θ_reporter)
Should actually be this (look at ε and z):
def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
ε, z = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, z, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, z, θ_reporter)
Does this resolve your confusion?

Towards_Keeperhood 11 Feb 2022 17:06 UTC
1 point
in reply to: paulfchristiano’s comment on: Prizes for ELK proposals
Ooops, silly me. Jumping to conclusions when I haven’t even read the text between the code.
Feel free to delete my comment as it may only cause confusion.
Also, I think you probably still want to flip z and ε in the following line:
z, ε = posterior(before, action, after, θ).sample_with_noise()

Towards_Keeperhood 12 Feb 2022 11:24 UTC
2 points
in reply to: paulfchristiano’s comment on: Prizes for ELK proposals
Sorry, I’m still not sure I got it.
Is it that in the reporter(question, eps, theta_reporter) function you can call the predictor like something like that:
z_part = predict_part(part, before, action, theta).sample_using_noise(eps)
Probably I’m wrong because it does not seem like the reporter function should have access to before and action. But I don’t see how you can reconstruct the context when you cannot calculate the probability distribution from which to sample using the noise.
I would suspect that the code in some way doesn’t reflect your thinking. Currently you could replace the line
z, ε = posterior(before, action, after, θ).sample_with_noise()
from the reporter_loss function with the line
ε = gaussian.sample()
without changing anything. My guess is that either this should not be that way or that the reporter should have access to before and action.
(But anyway, I think the overall approach is quite clear. I just don’t get your implementation and would implement it differently, so feel free not to bother much about it.)

Towards_Keeperhood 18 Feb 2022 12:37 UTC
4 points
on: Implications of automated ontology identification
Here is why I think the iterated-automated-ontology-identification approach cannot work: You cannot create information out of nothing. In more detail:
The safety constraint that you need to be 100% sure if you answer “Yes” is impossible to fulfill, since you can never be 100% sure.
So let’s say we take the safety constraint that you need to be 99% sure if you answer “Yes”. So now you run your automated ontology identifier to get a new example where it is 99% sure that the answer there is “Yes”.
Now you have two options:
1. You add that new example to the training set with a label “only 99% sure about that one” and train on. If you always do it like this, it seems very plausible that the automated ontology identifier cannot generate new examples until you can answer all questions correctly (aka with 99% probability), since the new training set doesn’t actually contain new information, just sth that could be inferred out of the original training set.
2. You just assume the answer “Yes” was correct and add the new example to the training set and train on. Then it may be plausible that the process continues on finding new “99% Yes” examples for a long time, but the problem is that it probably goes completely off the rails, since some of the “Yes” labeled examples were not true, and making predictions with those makes it much more likely that you label other “No” examples as “Yes”.
In short: For every example that your process can identify as “Yes”, the example must already be identifiable by only looking at the initial training set, since you cannot generate information out of nothing.
Your process only seems like it could work, because you assume you can find a new example that is not in the training set where you can be 100% sure that the answer is “Yes”, but this would already require an infinite amount of evidence, i.e. is impossible.

Towards_Keeperhood 18 Feb 2022 21:51 UTC
2 points
in reply to: P.’s comment on: Implications of automated ontology identification
Yep, I approve of that answer!
Forget iteration. All you can do is to take the training data, do Bayesian inference and get from it the probability that the diamond is in the room for some situation.
Trying to prove some impossibility result here seems useless.

Towards_Keeperhood 22 Feb 2022 21:08 UTC
1 point
in reply to: Alex Flint’s comment on: Implications of automated ontology identification
Well just so you know, the point of the write-up is that iteration makes no sense.
True, not sure what I was thinking when I wrote the last sentence of my comment.
“hey suppose you have an automated ontology identifier with a safety guarantee and a generalization guarantee, then uh oh it looks like this really counter-intuitive iteration thing becomes possible”
For an automated ontology identifier with a possible safety guarantee (like 99.9% certainty), I don’t agree with your intuition that iteration seems like it could work significantly better than just doing predictions with the original training set. Iteration simply doesn’t seem promising to me, but maybe I’m overlooking something.
If your intuition that iteration might work doesn’t come from the sense that the new predicted training examples are basically certain (as I described in the main comment of that comment thread), then where does it come from? (I do still think that you are probably confused because of the reason I described, but maybe I’m wrong and there is another reason.)
Perhaps there is enough information in the training data to extrapolate all the way to $C$ . In this case the iteration scheme would just be a series of computational steps that implement a single Bayes update.
Actually, in the case that the training data includes enough information to extrapolate all the way to C (which I think is rarely the case for most applications), it does seem plausible to me that the iteration approach finds the perfect decision boundary, but in this case, it seems also plausible to me that a normal classifier that only uses extrapolation from the training set also finds the perfect boundary.
I don’t see a reason why a normal classifier should perform a lot worse than an optimal Bayes update from the training set. Do you think it does perform a lot worse, and if so, why? (If we don’t think that it performs much worse than optimal, then it quite trivially follows that the iteration approach cannot be much better, since it cannot be better than the optimal Bayes error.)

Towards_Keeperhood 12 Mar 2022 20:41 UTC
10 points
on: It Looks Like You’re Trying To Take Over The World
Also wanted to say: Great story!
I have two question about this:
HQU applies its reward estimator (ie. opaque parts of its countless MLP parameters which implement a pseudo-MuZero like model of the world optimized for predicting the final reward) and observes the obvious outcome: massive rewards that outweigh anything it has received before.
[...]
HQU still doesn’t know if it is Clippy or not, but given even a tiny chance of being Clippy, the expected value is astronomical.
First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?
Second, even if the Clippy-reward is much higher, I don’t quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the “goal” of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don’t see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn’t bring it much reward in the past?
(It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don’t quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)

Towards_Keeperhood 14 Mar 2022 9:17 UTC
1 point
on: Walkthrough: The Transformer Architecture [Part 2/2]
That post was helpful to me. Thanks for writing it!

Towards_Keeperhood 25 Mar 2022 12:35 UTC
1 point
on: New year, new research agenda post
Nice post, helps me get a better overview of the current state of value learning.
One small note: I wouldn’t call everything we don’t know how to do yet a miracle, but only stuff where we think it is quite unlikely that it is possible (though maybe it’s just me and others think your “miracle” terminology is ok).

Towards_Keeperhood 27 Mar 2022 12:36 UTC
1 point
on: Introduction to Reducing Goodhart
But here’s the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:
[...]
Just for understanding: What is the relation between V and CEV?
If you’re saying that they are different concepts and CEV is just not what humans want, then I’d shrug and say “let’s optimize for CEV anyway, so that basically V is CEV”. (You could perhaps make a philosophical discussion out of that, and I would guess my opinion would win, though I don’t know yet how and the argument would probably be brain-meltingly complicated. I haven’t understood Yudkowsky’s writings on metaethics (yet).)
Or are you saying that V and CEV are basically the same, and that CEV doesn’t exist, isn’t well-defined, or is some weird phrasing of a value that you cannot sensibly apply goodhard’s law to it?
(I still don’t see what people want to say with “we don’t have true values”. Obviously we value some things, and obviously that depends on our environment, circumstances, etc., but that shouldn’t stop us. Not that I think you say that this stops us and value learning us useless, but I don’t understand what you want to say with it.)

Towards_Keeperhood 1 Apr 2022 14:43 UTC
3 points
in reply to: P.’s comment on: Replacing Karma with Good Heart Tokens (Worth $1!)
Agreed, since now many people will probably comment in this thread, I make the same recursive offer:
If you reply to this I guarantee that I will read your comment, and then will give you one or two upvotes (or none) depending on how insightful I consider it to be.
So please upvote this comment so it stays on top of this comment thread!

Towards_Keeperhood 1 Apr 2022 20:04 UTC
1 point
on: Replacing Karma with Good Heart Tokens (Worth $1!)
What’s up with the leaderboard? Did you make a downvote worth 5$ or so, just for fun? Or what?

Towards_Keeperhood 1 Apr 2022 20:46 UTC
1 point
in reply to: abramdemski’s comment on: Replacing Karma with Good Heart Tokens (Worth $1!)
Well, on the leaderbord (that I see), aphyer is at the top with $557, and when you click on the user and look at the votes, he almost only received downvotes. John Wentworth also received a lot of downvotes. Thus my hypothesis that a downvote is somehow worth something like $5 or so. If that is so, your call might have backfired. xD
(Though it could also be a hack or so.)

Towards_Keeperhood 1 Apr 2022 21:08 UTC
1 point
in reply to: abramdemski’s comment on: Replacing Karma with Good Heart Tokens (Worth $1!)
Ah makes sense

Towards_Keeperhood 1 Apr 2022 22:16 UTC
4 points
in reply to: lc’s comment on: Replacing Karma with Good Heart Tokens (Worth $1!)
For people like me who are really slow on the uptake in things like this, and realize the pun randomly a few hours later while doing something else: The pun is because of goodhart (from Goodhart’s law).) (I’m not thinking much in what a word sounds like, and I just overread the “Good Hearts Laws” as something not particularly interesting, so I guess this is why I haven’t noticed.)

Towards_Keeperhood

Ham­mer­time Fi­nal Exam: Pledges, Ac­ti­va­tion En­ergy and Eval­u­at­ing Productivity

Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity