Neural network interpretability feels like it should be called neural network interpretation.
mattmacdermott
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn’t that widely used is because it’s intractable to do. So Bengio’s stuff would have to succesfully make it competitive with other methods.
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
It’s not just a lesswrong thing (wikipedia).
My feeling is that (like most jargon) it’s to avoid ambiguity arising from the fact that “commitment” has multiple meanings. When I google commitment I get the following two definitions:
the state or quality of being dedicated to a cause, activity, etc.
an engagement or obligation that restricts freedom of action
Precommitment is a synonym for the second meaning, but not the first. When you say, “the agent commits to 1-boxing,” there’s no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, “commitment can get agents more utility,” it might sound like you were saying, “dedication can get agents more utility,” which is also true.
IIUC, I think that in addition to making predictive models more human interpretable, there’s another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the “right” notion of what constitutes a bad outcome, rather than a wrong one like “a human’s best guess would be that this outcome is bad”. I think Bengio’s proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the “right” notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea (“ensembling”) is mentioned under “How we’d approach ELK in practice”. Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it’s plausible to me that this agenda could make progress on ELK even if it wasn’t specifically thinking about the problem.
Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”
“Bear in mind he could be wrong” works well for telling somebody else to track a hypothesis.
“I’m bearing in mind he could be wrong” is slightly clunkier but works ok.
Really great post. The pictures are all broken for me, though.
He means the second one.
Seems true in the extreme (if you have 0 idea what something is how can you reasonably be worried about it), but less strange the futher you get from that.
Somewhat related: how do we not have separate words for these two meanings of ‘maximise’?
literally set something to its maximum value
try to set it to a big value, the bigger the better
Even what I’ve written for (2) doesn’t feel like it unambiguously captures the generally understood meaning of ‘maximise’ in common phrases like ‘RL algorithms maximise reward’ or ‘I’m trying to maximise my income’. I think the really precise version would be ‘try to affect something, having a preference ordering over outcomes which is monotonic in their size’.
But surely this concept deserves a single word. Does anyone know a good word for this, or feel like coining one?
you can train on MNIST digits with twenty wrong labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label
I know some pigeons who would question this claim
I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.
But I worry that the focus on ‘bad terminology’ rather than reasoning mistakes themselves is misguided.
To choose the most clear cut example, I’m quite confident that when I say ‘expectation’ I mean ‘weighted average over a probability distribution’ and not ‘anticipation of an inner consciousness’. Perhaps some people conflate the two, in which case it’s useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said ‘expectation’ I had to add a caveat to prove I know the difference, lest I get ‘corrected’ or sneered at.
For a probably more contentious example, I’m also reasonably confident that when I use the phrase ‘the purpose of RL is to maximise reward’, the thing I mean by it is something you wouldn’t object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of ‘those idiots’.
I wonder if it’s better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it’s a fair cop.
But until you’ve seen them use it to reason poorly, perhaps it’s a good norm to assume they’re not confused about things, even if the terminology feels like it has misleading connotations to you.
Causal Inference in Statistics (pdf) is much shorter, and a pretty easy read.
I have not read causality but I think you should probably read the primer first and then decide if you need to read that too.
Sorry, yeah, my comment was quite ambiguous.
I meant that while gaining status might be a questionable first step in a plan to have impact, gaining skill is pretty much an essential one, and in particular getting an ML PhD or working at a big lab seem like quite solid plans for gaining skill.
i.e. if you replace status with skill I agree with the quotes instead of John.
People occasionally come up with plans like “I’ll lead the parade for a while, thereby accumulating high status. Then, I’ll use that high status to counterfactually influence things!”. This is one subcategory of a more general class of plans: “I’ll chase status for a while, then use that status to counterfactually influence things!”. Various versions of this often come from EAs who are planning to get a machine learning PhD or work at one of the big three AI labs.
This but skill instead of status?
Any other biography suggestions?
I think Just Don’t Build Agents could be a win-win here. All the fun of AGI without the washing up, if it’s enforceable.
Possible ways to enforce it:
(1) Galaxy-brained AI methods like Davidad’s night watchman. Downside: scary, hard.
(2) Ordinary human methods like requring all large training runs to be approved by the No Agents committee.
Downside: we’d have to ban not just training agents, but training any system that could plausibly be used to build an agent, which might well include oracle-ish AI like LLMs. Possibly something like Bengio’s scientist AI might be allowed.
LW feature I would like: I click a button on a sequence and recieve one post in my email inbox per day.
Yep, exactly.