I agree with this sentiment (“having lots of data is useful for deconfusion”) and think this is probably the most promising avenue for alignment research. In particular, I think we should prioritize the kinds of research that give us lots of bits about things that could matter. Though from my perspective actually most empirical alignment work basically fails this check, so this isn’t just a “empirical good” take.
Since there are basically no alignment plans/directions that I think are very likely to succeed, and adding “of course, this will most likely not solve alignment and then we all die, but it’s still worth trying” to every sentence is low information and also actively bad for motivation, I’ve basically recalibrated my enthusiasm to be centered around “does this at least try to solve a substantial part of the real problem as I see it”. For me at least this is the most productive mindset for me to be in, but I’m slightly worried people might confuse this for me having a low P(doom), or being very confident in specific alignment directions, or so on, hence this post that I can point people to.
I think this may also be a useful emotional state for other people with similar P(doom) and who feel very demotivated by that, which impacts their productivity.
I meant it as an analogy to https://en.m.wikipedia.org/wiki/Denormalization
I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that may help us with thinking about conceptual alignment. Some examples of what I mean:
It’s likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for “truth” or “human happiness”, but it also seems entirely plausible that models instead have separate circuits and features entirely for “this text makes a claim that is incorrect” and “this text has the wrong answer selected”, or in the latter case for “this text has positive sentiment” and “this text describes a human experiencing happiness” and “this text describes actions that would cause a human to be happy if they were implemented”.
I think we’re probably pretty confused about mesaoptimization, in a way that’s very difficult to resolve just by thinking more about it (source: have spent a lot of time thinking about mesaoptimizers). I think this is especially salient to the people trying to make model organisms—which I think is a really exciting avenue—because if you try to make a mesaoptimizer, you immediately collide head on with things like finding that “training selects from the set of goals weighted by complexity” hypothesis doesn’t seem to accurately describe current model training. I think it’s appropriate to feel pretty confused about this and carefully examine the reasons why current models don’t exhibit these properties. It’s entirely reasonable for the answer to be “I expect future models to have thing X that current models don’t have”—then, you can try your best to test various X’s before having the future AIs that actually kill everyone.
There are some things that we expect AGI to do that current ML systems do not do. Partly this will be because in fact current ML systems are not analogous to future AGI in some ways—probably if you tell the AGI that A is B, it will also know that B is A. This does not necessarily have to be a property that gradually emerges and can be forecasted with a scaling law; it could emerge in a phase change, or be the result of some future algorithmic innovation. If you believe there is some property X of current ML that causes this failure, and that it will be no longer a failure in the future, then you should also be suspicious of any alignment proposal that depends on this property (and the dependence of the proposal on X may be experimentally testable). For instance, it is probably relatively easy to make an RL trained NN policy be extremely incoherent in a small subset of cases, because the network has denormalized contextual facts that are redundant across many situations. I expect this to probably be harder in models which have more unified representations for facts. To the extent I believe a given alignment technique works because it leverages this denormalization, I would be more skeptical of it working in the future.
As a counterpoint, it might also be that we had an inaccurate conception of what capabilities AGI will have that current ML systems do not have—I think one important lesson of GPT-* has been that even with these failures, the resulting systems can still be surprisingly useful.
Awesome work! I like the autoencoder approach a lot.
I don’t think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.
The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of model quality due to RLHF, but because the UI was much more intuitive to users than the Playground (instruction following GPT3.5 was already in the API and didn’t take off in the same way). The tech tree from having a powerful base model to having a chatbot is not constrained on RLHF existing at all, either.
To be clear, I happen to also not be very optimistic about the alignment relevance of RLHF work beyond the first few papers—certainly if someone were to publish a paper today making RLHF twice as data efficient or whatever I would consider this basically just a capabilities paper.
Obviously I think it’s worth being careful, but I think in general it’s actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:
Researchers of all fields tend to do this thing where they have really strong conviction in their direction and think everyone should work on their thing. Convincing them that some other direction is better is actually pretty hard even if you’re trying to shove your ideas down their throats.
Often the bottleneck is not that nobody realizes that something is a bottleneck, but rather that nobody knows how to fix it. In these cases, calling attention to the bottleneck doesn’t really speed things up, whereas for thinking about alignment we can reason about what things would look like if it were to be solved.
It’s generally harder to make progress on something by accident than to make progress on purpose on something if you try really hard to do it. I think this is true even if there is a lot of overlap. There’s also an EMH argument one could make here but I won’t spell it out.
I think the alignment community thinking correctly is essential for solving alignment. Especially because we will have very limited empirical evidence before AGI, and that evidence will not be obviously directly applicable without some associated abstract argument, any trustworthy alignment solution has to route through the community reasoning sanely.
Also to be clear I think the “advancing capabilities is actually good because it gives us more information on what AGI will look like” take is very bad and I am not defending it. The arguments I made above don’t apply, because they basically hinge on work on alignment not actually advancing capabilities.
We spend a lot of time on trying to figure out empirical evidence to distinguish hypotheses we have that make very similar predictions, but I think a potentially underrated first step is to make sure they actually fit the data we already have.
Understanding how an abstraction works under the hood is useful because it gives you intuitions for when it’s likely to leak and what to do in those cases.
Ran this on GPT-4-base and it gets 56.7% (n=1000)
I agree that doing conceptual work in conjunction with empirical work is good. I don’t know if I agree that pure conceptual work is completely doomed but I’m at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.
They may find some other avenue of empirical work that can help with alignment. I think probably there exist empirical avenues substantially more valuable for alignment than making progress on interpretability and opening those up requires thinking about the conceptual side.
Even if they think hard about it and can’t think of anything better than conceptual+interpretability, it still seems better for an interpretability researcher to have an idea of how their work will fit into the broader picture. Even if they aren’t backchaining, this still seems more useful than just randomly doing something under the heading of interpretability.
I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you’re describing.
I think it’s important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML research background.
Plausibly the most taut constraint in research is not strictly the number of researchers you can fund/train to solve a given problem—it’s hard to get researchers to do good work if they don’t feel intellectually excited about the problem, which in turn is less likely if they feel like they’re never making any progress, or feel like they are constantly unsure about what problem they’re even trying to solve.
To be clear I am not arguing that we should focus on things that are easier to solve—I am very much in favor of not just doing things that are easy to do but actually don’t help (“looking under the streetlamp”). Rather, I think what we should be doing is finding things that actually matter and making it easier for people to get excited about it (and people who are able to do this kind of work have a huge comparative advantage here!).
My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans. I think just getting a sense of what even these models are implementing internally could help a lot with deconfusion here. I don’t think it’s strictly necessary to do interpretability as opposed to targeted experiments where we observe external behaviour for these kinds of things, but probably experiments that get many bits are much better than targeted experiments for deconfusion, because oftentimes the hypotheses are all wrong in subtle ways. Aside from that, I am not optimistic about fully understanding the model, training against interpretability, microscope AI, or finding the “deception neuron” as a way to audit deception. I don’t think future models will necessarily have internal structures analogous to current models.
I know for Cruise they’re operating ~300 vehicles here in SF (I was previously under the impression this was a hard cap by law until the approval a few days ago but no longer sure of this). The geofence and hours vary by user but my understanding is the highest tier of users (maybe just employees?) have access to Cruise 24⁄7 with a geofence encompassing almost all of SF, and then there are lower tiers of users with various restrictions like tighter geofences and 9pm-5:30am hours. I don’t know what their growth plans look like now that they’ve been granted permission to expand.
Meta note: I find it somewhat of interest that filler token experiments have been independently conceived at least 5 times just to my knowledge.
Sounds very closely related to gradient based OOD detection methods; see https://arxiv.org/abs/2008.08030
I was quite surprised to see myself cited as “liking the broader category that QACI is in”—I think this claim may technically be true for some definition of “likes” and “broader category”, but tries to imply a higher level of endorsement to the casual reader than is factual.
I don’t have a very good understanding of QACI and therefore have no particularly strong opinions on QACI. It seems quite different from the kinds of alignment approaches I think about.
My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.
I think it’s pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven’t been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).
You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it’s not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a cluster with good interconnects but this is one of those things where people have tried for ages.
Heterogeneous compute (not all the GPUs will be the same model), lower reliability (people will turn their computers off more often than they do in datacenters), and having to be robust against bad actors (people could submit bad gradients) and other challenges together add another several times overhead.
There just isn’t that much volunteer hardware out there. For a rough OOM the publicly announced Facebook cluster is roughly the same size as the raw size of folding@home at its peak. All in all, I think you would need to do some serious engineering and research to get 1% efficiency at Facebook cluster scale.
(folding@home is embarrassingly parallelizable because it requires basically no interconnect, and therefore also doesn’t mind heterogeneous compute or reliability)
here’s a straw hypothetical example where I’ve exaggerated both 1 and 2; the details aren’t exactly correct but the vibe is more important:
1: “Here’s a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment”
2: “Debate works if you can actually set the goals of the agents (i.e you’ve solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]”
1: “Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever”
2: “how are you going to do that? your scheme doesn’t tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim”
1: “idk, also that’s a fully general counterargument to any alignment scheme, you can always just say ‘but what if inner misalignment’. I feel like you’re not really engaging with the meat of my proposal, you’ve just found a thing you can say to be cynical and dismissive of any proposal”
2: “but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren’t a problem.”
1: “so you agree that in a pretty nontrivial number [let’s say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world’s where inner alignment turns out to be easy then my scheme will work.”
2: “I’m not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn’t actually make a big difference.”