Ok, whew, glad to hear.
Nope! I have basically always enjoyed talking with you, even when we disagree.
I still really like my framework here! I think this post ended up popularizing some of the ontology I developed here, but the unfortunate thing about that post as the one that popularized this is that it doesn’t really provide an alternative.
Warning for anyone who has ever interacted with “robosucka” or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i
They still make a lot less than they would if they optimized for profit (that said, I think most “safety researchers” at big labs are only safety researchers in name and I don’t think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers.
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don’t feel I can offer good generalized advice. Generally I’d suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn’t enjoy a PhD then you’re probably right!
Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-quality research (not just publish), I had almost-complete control over how I spent my time, could explore areas I found interesting & important in depth. The compensation is low but with excellent job security and I had some savings so I lived comfortably. Unless I take a sabbatical I will probably never again have the time to go as deep into a research area so in a lot of ways I really cherish my PhD time.
I think a lot of the negatives of PhDs really feel like negatives of becoming a research lead in general. Trying to create something new with limited feedback loops is tough, and can be psychologically draining if you tie your self-worth with your work output (don’t do this! but easier said than done for the kind of person attracted to these careers). Research taste will take up many years of your life to develop—as will most complex skills. etc.
I’m sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is “good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues”. There are enough of these you should be able to find one, and it doesn’t seem worth compromising.
Concretely, I’d definitely recommend digging into into an advisor’s research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all else, turn a blind eye to academic fraud or at least brush accidental non-replicability under the rug, or just have a toxic culture. You want to avoid those at all costs.
But I disagree with the punchline that if this bar isn’t satisfied then “almost any other job will be better preparation for a research career”. In particular, I think there’s a ton of concrete skills a PhD teaches that don’t need a stellar advisor. For example, there’s some remarkably simple things like having an experimental baseline, running multiple seeds and reporting confidence intervals that a PhD will absolutely drill into you. These things are remarkably often missing from research produced by those I see in the AI safety ecosystem who have not done a PhD or been closely mentored by an experienced researcher.
Additionally, I’ve seen plenty of people do PhDs under an advisor who lacks one or more of these properties and most of them turned out to be fine researchers. Hard to say what the counterfactual is, the admission process to the PhD might be doing a lot of work here, but I think it’s important to recognize the advisor is only one of many sources of mentorship and support you get in a PhD: you also have taught classes, your lab mates, your extended cohort, senior post-docs, peer review, etc. To be clear, none of these mentorship sources are perfect, but part of your job as a student is to decide who to listen to & when. If someone can’t do that then they’ll probably not get very far as a researcher no matter what environment they’re in.
I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. “we have to solve inner alignment and also outer alignment” or “we just have to make sure it isn’t scheming.”
Anyone—and in particular Evhub—have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?
There is another argument that could be made for working on other modalities now: there could be insights which generalize across modalities, but which are easier to discover when working on some modalities vs. others.
I’ve actually been thinking, for a while now, that people should do more image model interprebility for this sort of reason. I never got around to posting this opinion, but FWIW it is the main reason I’m personally excited by the sort of work reported here. (I have mostly been thinking about generative or autoencoding image models here, rather than classifiers, but the OP says they’re building toward that.)
Why would we expect there to be transferable insights that are easier to discover in visual domains than textual domains? I have two thoughts in mind:
First thought:
The tradeoff curve between “model does something impressive/useful that we want to understand” and “model is conveniently small/simple/etc.” looks more appealing in the image domain.
Most obviously: if you pick a generative image model and an LLM which do “comparably impressive” things in their respective domains, the image model is going to be way smaller (cf.). So there are, in a very literal way, fewer things we have to interpret—and a smaller gap between the smallest toy models we can make and the impressive models which are our holy grails.
Like, Stable Diffusion is definitely not a toy model, and does lots of humanlike things very well. Yet it’s pretty tiny by LLM standards. Moreover, the SD autoencoder is really tiny, and yet it would be a huge deal if we could come to understand it pretty well.
Beyond mere parameter count, image models have another advantage, which is the relative ease of constructing non-toy input data for which we know the optimal output. For example, this is true of:
Image autoencoders (for obvious reasons).
“Coordinate-based MLP” models (like NeRFs) that encode specific objects/scenes in their weights. We can construct arbitrarily complex objects/scenes using 3D modeling software, train neural nets on renders of them, and easily check the ground-truth output for any input by just inspecting our 3D model at the input coordinates.
By contrast, in language modeling and classification, we really have no idea what the optimal logits are. So we are limited to making coarse qualitative judgments of logit effects (“it makes this token more likely, which makes sense”), ignoring the important fine-grained quantitative stuff that the model is doing.
None of that is intrinsically about the image domain, I suppose; for instance, one can make text autoencoders too (and people do). But in the image domain, these nice properties come for free with some of the “real” / impressive models we ultimately want to interpret. We don’t have to compromise on the realism/relevance of the models we choose for ease of interpretation; sometimes the realistic/relevant models are already convenient for interpretability, as a happy accident. The capabilities people just make them that way, for their own reasons.
The hope, I guess, is that if we came pretty close to “fully understanding” one of these more convenient models, we’d learn a lot of stuff a long the way about how to interpret models in general, and that would transfer back to the language domain. Stuff like “we don’t know what the logits should be” would no longer be a blocker to making progress on other fronts, even if we do eventually have to surmount that challenge to interpret LLMs. (If we had a much better understanding of everything else, a challenge like that might be more tractable in isolation.)
Second thought:
I have a hunch that the apparent intuitive transparency of language (and tasks expressed in language) might be holding back LLM interpretability.
If we force ourselves to do interpretability in a domain which doesn’t have so much pre-existing taxonomical/terminological baggage—a domain where we no longer feel it’s intuitively clear what the “right” concepts are, or even what any breakdown into concepts could look like—we may learn useful lessons about how to make sense of LLMs when they aren’t “merely” breaking language and the world down into conceptual blocks we find familiar and immediately legible.
When I say that “apparent intuitive transparency” affects LLM interpretability work, I’m thinking of choices like:
In circuit work, researchers select a familiar concept from a pre-existing human “map” of language / the world, and then try to find a circuit for it.
For example, we ask “what’s the circuit for indirect object identification?”, not “what’s the circuit for frobnoloid identification?”—where “a frobnoloid” is some hypothetical type-of-thing we don’t have a standard term for, but which LMs identify because it’s useful for language modeling.
(To be clear, this is not a critique of the IOI work, I’m just talking about a limit to how far this kind of work can go in the long view.)
In SAE work, researchers try to identify “interpretable features.”
It’s not clear to me what exactly we mean by “interpretable” here, but being part of a pre-existing “map” (as above) seems to be a large part of the idea.
“Frobnoloid”-type features that have recognizable patterns, but are weird and unfamiliar, are “less interpretable” under prevailing use of the term, I think.
In both of these lines of work, there’s a temptation to try to parse out the LLM computation into operations on parts we already have names for—and, in cases where this doesn’t work, to chalk it up either to our methods failing, or to the LLM doing something “bizarre” or “inhuman” or “heuristic / unsystematic.”
But I expect that much of what LLMs do will not be parseable in this way. I expect that the edge that LLMs have over pre-DL AI is not just about more accurate extractors for familiar, “interpretable” features; it’s about inventing a decomposition of language/reality into features that is richer, better than anything humans have come up with. Such a decomposition will contain lots of valuable-but-unfamiliar “frobnoloid”-type stuff, and we’ll have to cope with it.
To loop back to images: relative to text, with images we have very little in the way of pre-conceived ideas about how the domain should be broken down conceptually.
Like, what even is an “interpretable image feature”?
Maybe this question has some obvious answers when we’re talking about image classifiers, where we expect features related to the (familiar-by-design) class taxonomy—cf. the “floppy ear detectors” and so forth in the original Circuits work.
But once we move to generative / autoencoding / etc. models, we have a relative dearth of pre-conceived concepts. Insofar as these models are doing tasks that humans also do, they are doing tasks which humans have not extensively “theorized” and parsed into concept taxonomies, unlike language and math/code and so on. Some of this conceptual work has been done by visual artists, or photographers, or lighting experts, or scientists who study the visual system … but those separate expert vocabularies don’t live on any single familiar map, and I expect that they cover relatively little of the full territory.
When I prompt a generative image model, and inspect the results, I become immediately aware of a large gap between the amount of structure I recognize and the amount of structure I have names for. I find myself wanting to say, over and over, “ooh, it knows how to do that, and that!”—while knowing that, if someone were to ask, I would not be able to spell out what I mean by each of these “that”s.
Maybe I am just showing my own ignorance of art, and optics, and so forth, here; maybe a person with the right background would look at the “features” I notice in these images, and find them as familiar and easy to name as the standout interpretable features from a recent LM SAE. But I doubt that’s the whole of the story. I think image tasks really do involve a larger fraction of nameless-but-useful, frobnoloid-style concepts. And the sooner we learn how to deal with those concepts—as represented and used within NNs—the better.
Reposting myself from discord, on the topic of donating 5000$ to EA causes.
if you’re doing alignment research, even just a bit, then the 5000$ are plobly better spent on yourself
if you have any gears level model of AI stuff then it’s better value to pick which alignment org to give to yourself; charity orgs are vastly understaffed and you’re essentially contributing to the “picking what to donate to” effort by thinking about it yourself
if you have no gears level model of AI then it’s hard to judge which alignment orgs it’s helpful to donate to (or, if giving to regranters, which regranters are good at knowing which alignment orgs to donate to)
as an example of regranters doing massive harm: openphil gave 30M$ to openai at a time where it was critically useful to them, (supposedly in order to have a chair on their board, and look how that turned out when the board tried to yeet altman)
i know of at least one person who was working in regranting and was like “you know what i’d be better off doing alignment research directly” — imo this kind of decision is probly why regranting is so understaffed
it takes technical knowledge to know what should get money, and once you have technical knowledge you realize how much your technical knowledge could help more directly so you do that, or something
How often do people not do PhDs on the basis that they don’t teach you to be a good researcher? Perhaps this is different in certain circles, but almost everyone I know doesn’t want to do a PhD for personal reasons (and also timelines).
The most common objections are the following:
PhDs are very depressing and not very well paid.
Advisors do not have strong incentives to put much effort into training you and apparently often won’t. This is pretty demotivating.
A thing you seem to be advocating for is PhDs primarily at top programs. These are very competitive, it is hard to make progress towards getting into a better program once you graduate, and there is a large opportunity cost to devoting my entire undergraduate degree to doing enough research to be admitted.
PhDs take up many years of your life. Life is short.
It is very common for PhD students (not just in alignment) to tell other people not to do a PhD. This is very concerning.
If I was an impact-maximizer I might do a PhD, but as a person who is fairly committed to not being depressed, it seems obvious that I should probably not do a PhD and look for alternative routes to becoming a research lead instead.
I’d be interested to hear whether you disagree with these points (you seem to like your PhD!), or whether this post was just meant to address the claim that it doesn’t train you to be a good researcher.
Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.
This leaves it passing the test, even if it’s hopeless at predicting new events and can only generate new articles about the same events.
When data duplication is extensive, making a meaningful train/test split is hard.
If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.
But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)
Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn’t imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.
then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year
So? Your research doesn’t have to be useful in every possible world. If a PhD increases the quality of your research by, say, 3x (which is plausible, since research is heavy-tailed) then it may well be better to do that research for half the time.
(In general I don’t think x-risk-motivated people should do PhDs that don’t directly contribute to alignment, to be clear; I just think this isn’t a good argument for that conclusion.)
I don’t disagree… in this case you don’t get agents for a long time; someone else does though.
So this option looks unattractive if you think transformative AI systems are likely to developed within the next 5 years. However, with a 10-years timeframe things look much stronger: you would still have around 5 years to contribute as a research.
This phrasing is tricky! If you think TAI is coming in approximately 10 years then sure, you can study for 5 years and then do research for 5 years.
But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed) then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year. If you think TAI is coming within 10 years, then (depending on your distribution) you should still in fact do research asap.
(People often get this wrong! They think that “TAI probably within X years” necessarily means “TAI in approximately X years”.)
In favour of goal realism
Suppose your looking at an AI that is currently placed in a game of chess.
It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.
You could describe the actions of this AI by producing a giant table of “behaviours”. Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. …
But there is a more compact way to represent similar predictions. You can say it’s trying to win at chess.
The “trying to win at chess” model makes a bunch of predictions that the giant list of behaviour model doesn’t.
Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)
The list of behaviours model has no reason to suspect the AI also has a “promote pawn to knight” behaviour.
Put the AI in a circumstance where such promotion is a good move, and the “trying to win” model makes it as a clear prediction.
Now it’s possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.
But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And “do whichever move will win” is a simple and general pattern.
Now on to making snarky remarks about the arguments in this post.
There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.
There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.
Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.
Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?
Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,
This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don’t need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it’s implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it’s compute. A module based system can’t use any optimization that humans didn’t design into it’s interfaces.
Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.
In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.
That is not how this works. That is not how any of this works.
Back to our chess AI. Lets say it’s a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.
I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn’t wirehead by giving itself a really easy goal would be tricky.
For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI’s doing various tasks will be able to invent highly effective and novel strategies. The exact “goal” they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn’t matter to us which it “really wants”.
What are you expecting. An AI that says “I’m not really sure whether I want flies or black dots. I’ll just sit here not taking over the world and not get either of those things”?
I intuitively don’t like this approach, but I have trouble articulating exactly why. I’ve tried to explain a bit in this comment, but I don’t think I’m quite saying the right thing.
One issue I have is that it doesn’t seem to nicely handle interactions between the properties of the AI and how it’s used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic properties of the AI), but also potentially things like using monitoring ensembles to handle lack of robustness and paying AIs rather than aligning them.
Another problem I have is that this doesn’t very naturally incorporate various non-mechanistic analysis targeting specific threat models which IMO should be (and will be) very central. E.g., we built a wide variety of model organisms which are closely analogous to our training and deployment environment and which aim to uncover potential reward hacking failure modes and these model organisms didn’t demonstrate any issues. Same for things like adversarially testing for clear misalignment: it doesn’t result in a mechanistic model, but feels very central.
To be clear, I think all the things I discussed above can be discussed in this framework, but it feels quite unnatural and the decomposition doesn’t seem like it’s doing any work.
I think the type of mechanistic analysis proposed here seems quite aspirational with the current state of technology such that it feels odd to center it. Or the mechanistic analysis you do will apply to all training runs and no safety interventions will effect it such that it’s more like useful background than a key part of analyzing different safety measures. To be clear, we will want to do some mechanistic analysis and have some space of mechanistic hypothesis. But this feels more like the background threat model than the core safety case due to difficulties in testing. We can also somewhat test these mechanistic hypothesis with experiments that don’t require huge technological break throughs, but this seems more like an important sub-component of a safety case than the main thing.
Perhaps Evan thinks we’re totally screwed (or at least can’t obtain high confidence) without strong mechanistic analysis such that centering this is good. I think high confidence seems unclear and disagree with totally screwed. It’s possible that my views here partially come down to a difference of opinion with Evan where he thinks that deceptive alignment is very likely given usage of models capable of powerful goal-oriented behavior where as I think this is uncertain. Further, I think it’s reasonably likely (perhaps 1⁄3) that I’ll end up being very confident that deceptive alignment is very unlikely at the point when we have powerful AIs (due to experiments and further conceptual reasoning).
More generally, I feel like the way I currently talk and think about safety cases and similar topics doesn’t seem nicely fit into training stories. I think the way I currently do it is better, but I’m not entirely certain and I haven’t tried the training stories approach much.
I should also note that a general approach like training stories seems much better than a decomposition like “inner alignment” vs “outer alignment” which is supposing a particular approach to solving the problem. (I do think that “inner misalignment” vs “outer misalignment” is reasonable decomposition of threat models in AIs produced with ML. But these are threat models, not problems to be solved and there are many routes to solving them. See here for more discussion.)
I think I prefer the default trajectory of safety cases and RSP more than what would happen with additional emphasis on training stories, but I’m uncertain.