I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity
This review is great. I’m actually impressed how you managed to extract all that relevant information and convey it relatively well in this not-terribly-long blogpost.
In section “Worry 2: an experiment can itself perform tampering”:
But if the AI spent 1100 timesteps tampering, it would take 100 timesteps to undo its work, which is less than the amount it would take to tamper from scratch.
Shouldn’t it be “more” instead of “less”?
Therefore, the human interpreter can look only at and get the same loss as the direct translator, despite being upstream of it.
Maybe I misunderstand you, but how I understand it, is based on the last frame of the predicted video, and therefore is basically the most downstream thing there is. How did you come to think it was upstream of the direct translator?
Edit: I was wrong. You can ignore this comment thread.
I think that this:
def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
z, ε = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, ε, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, ε, θ_reporter)Should actually be this (look at ε and z):
def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
ε, z = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, z, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, z, θ_reporter)Does this resolve your confusion?
Ooops, silly me. Jumping to conclusions when I haven’t even read the text between the code.
Feel free to delete my comment as it may only cause confusion.
Also, I think you probably still want to flip z and ε in the following line:
z, ε = posterior(before, action, after, θ).sample_with_noise()
Sorry, I’m still not sure I got it.
Is it that in the reporter(question, eps, theta_reporter) function you can call the predictor like something like that:
z_part = predict_part(part, before, action, theta).sample_using_noise(eps)
Probably I’m wrong because it does not seem like the reporter function should have access to before and action. But I don’t see how you can reconstruct the context when you cannot calculate the probability distribution from which to sample using the noise.
I would suspect that the code in some way doesn’t reflect your thinking. Currently you could replace the line
z, ε = posterior(before, action, after, θ).sample_with_noise()
from the reporter_loss function with the line
ε = gaussian.sample()
without changing anything. My guess is that either this should not be that way or that the reporter should have access to before and action.
(But anyway, I think the overall approach is quite clear. I just don’t get your implementation and would implement it differently, so feel free not to bother much about it.)
Here is why I think the iterated-automated-ontology-identification approach cannot work: You cannot create information out of nothing. In more detail:
The safety constraint that you need to be 100% sure if you answer “Yes” is impossible to fulfill, since you can never be 100% sure.
So let’s say we take the safety constraint that you need to be 99% sure if you answer “Yes”. So now you run your automated ontology identifier to get a new example where it is 99% sure that the answer there is “Yes”.
Now you have two options:
You add that new example to the training set with a label “only 99% sure about that one” and train on. If you always do it like this, it seems very plausible that the automated ontology identifier cannot generate new examples until you can answer all questions correctly (aka with 99% probability), since the new training set doesn’t actually contain new information, just sth that could be inferred out of the original training set.
You just assume the answer “Yes” was correct and add the new example to the training set and train on. Then it may be plausible that the process continues on finding new “99% Yes” examples for a long time, but the problem is that it probably goes completely off the rails, since some of the “Yes” labeled examples were not true, and making predictions with those makes it much more likely that you label other “No” examples as “Yes”.
In short: For every example that your process can identify as “Yes”, the example must already be identifiable by only looking at the initial training set, since you cannot generate information out of nothing.
Your process only seems like it could work, because you assume you can find a new example that is not in the training set where you can be 100% sure that the answer is “Yes”, but this would already require an infinite amount of evidence, i.e. is impossible.
Yep, I approve of that answer!
Forget iteration. All you can do is to take the training data, do Bayesian inference and get from it the probability that the diamond is in the room for some situation.
Trying to prove some impossibility result here seems useless.
Well just so you know, the point of the write-up is that iteration makes no sense.
True, not sure what I was thinking when I wrote the last sentence of my comment.
“hey suppose you have an automated ontology identifier with a safety guarantee and a generalization guarantee, then uh oh it looks like this really counter-intuitive iteration thing becomes possible”
For an automated ontology identifier with a possible safety guarantee (like 99.9% certainty), I don’t agree with your intuition that iteration seems like it could work significantly better than just doing predictions with the original training set. Iteration simply doesn’t seem promising to me, but maybe I’m overlooking something.
If your intuition that iteration might work doesn’t come from the sense that the new predicted training examples are basically certain (as I described in the main comment of that comment thread), then where does it come from? (I do still think that you are probably confused because of the reason I described, but maybe I’m wrong and there is another reason.)
Perhaps there is enough information in the training data to extrapolate all the way to . In this case the iteration scheme would just be a series of computational steps that implement a single Bayes update.
Actually, in the case that the training data includes enough information to extrapolate all the way to C (which I think is rarely the case for most applications), it does seem plausible to me that the iteration approach finds the perfect decision boundary, but in this case, it seems also plausible to me that a normal classifier that only uses extrapolation from the training set also finds the perfect boundary.
I don’t see a reason why a normal classifier should perform a lot worse than an optimal Bayes update from the training set. Do you think it does perform a lot worse, and if so, why? (If we don’t think that it performs much worse than optimal, then it quite trivially follows that the iteration approach cannot be much better, since it cannot be better than the optimal Bayes error.)
Also wanted to say: Great story!
I have two question about this:
HQU applies its reward estimator (ie. opaque parts of its countless MLP parameters which implement a pseudo-MuZero like model of the world optimized for predicting the final reward) and observes the obvious outcome: massive rewards that outweigh anything it has received before.
[...]
HQU still doesn’t know if it is Clippy or not, but given even a tiny chance of being Clippy, the expected value is astronomical.
First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?
Second, even if the Clippy-reward is much higher, I don’t quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the “goal” of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don’t see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn’t bring it much reward in the past?
(It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don’t quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)
That post was helpful to me. Thanks for writing it!
Nice post, helps me get a better overview of the current state of value learning.
One small note: I wouldn’t call everything we don’t know how to do yet a miracle, but only stuff where we think it is quite unlikely that it is possible (though maybe it’s just me and others think your “miracle” terminology is ok).
But here’s the problem: humans have no such V (see also Scott A., Stuart 1, 2). Inferring human preferences depends on:
[...]
Just for understanding: What is the relation between V and CEV?
If you’re saying that they are different concepts and CEV is just not what humans want, then I’d shrug and say “let’s optimize for CEV anyway, so that basically V is CEV”. (You could perhaps make a philosophical discussion out of that, and I would guess my opinion would win, though I don’t know yet how and the argument would probably be brain-meltingly complicated. I haven’t understood Yudkowsky’s writings on metaethics (yet).)
Or are you saying that V and CEV are basically the same, and that CEV doesn’t exist, isn’t well-defined, or is some weird phrasing of a value that you cannot sensibly apply goodhard’s law to it?
(I still don’t see what people want to say with “we don’t have true values”. Obviously we value some things, and obviously that depends on our environment, circumstances, etc., but that shouldn’t stop us. Not that I think you say that this stops us and value learning us useless, but I don’t understand what you want to say with it.)
Agreed, since now many people will probably comment in this thread, I make the same recursive offer:
If you reply to this I guarantee that I will read your comment, and then will give you one or two upvotes (or none) depending on how insightful I consider it to be.
So please upvote this comment so it stays on top of this comment thread!
What’s up with the leaderboard? Did you make a downvote worth 5$ or so, just for fun? Or what?
Well, on the leaderbord (that I see), aphyer is at the top with $557, and when you click on the user and look at the votes, he almost only received downvotes. John Wentworth also received a lot of downvotes. Thus my hypothesis that a downvote is somehow worth something like $5 or so. If that is so, your call might have backfired. xD
(Though it could also be a hack or so.)
Ah makes sense
For people like me who are really slow on the uptake in things like this, and realize the pun randomly a few hours later while doing something else: The pun is because of goodhart (from Goodhart’s law).) (I’m not thinking much in what a word sounds like, and I just overread the “Good Hearts Laws” as something not particularly interesting, so I guess this is why I haven’t noticed.)
I also did it: Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity
Thanks for the sequence, it was really helpful! :)