Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.
...
While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Responsible Scaling Policy (RSP).
Hmmm, maybe the 4x effective compute threshold is too large given that you’re getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold.
Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., “If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold.”
It is unknown to me whether Sonnet 3.5′s performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the “must evaluate” threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps?
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
Nice! Do you have a sense of the total development (and run-time) cost of your solution? “Actually getting to 50% with this main idea took me about 6 days of work.” I’m interested in the person-hours and API calls cost of this.
Hm, can you explain what you mean? My initial reaction is that AI oversight doesn’t actually look a ton like this position of the interior where defenders must defend every conceivable attack whereas attackers need only find one successful strategy. A large chunk of why I think these are disanalogous is that getting caught is actually pretty bad for AIs — see here.
Not sure I love this analogy — moving to NYC doesn’t seem like that big of a deal —, but I do think it’s pretty messed up to be imposing huge social / technological / societal changes on 8 billion of your peers. I expect most of the people building AGI have not really grasped the ethical magnitude of doing this — I think I sort of have, but also I don’t build AGI.
Note on something from the superalignment section of Leopold Aschenbrenner’s recent blog posts:
Evaluation is easier than generation. We get some of the way “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF example, and they’ll be able to “thumbs down” a lot of misbehavior even if the AI system is somewhat smarter than them. That said, this will only take us so far (GPT-2 or even GPT-3 couldn’t detect nefarious GPT-4 reliably, even though evaluation is easier than generation!)
Disagree about papers. I don’t think it takes merely a couple hours to tell if a paper is any good. In some cases it does, but in other cases, entire fields have been led astray for years due to bad science (e.g., replication crisis in psych, where numerous papers spurred tons of follow up work on fake things; a year and dozens of papers later we still don’t know if DPO is better than PPO for frontier AI development (though perhaps this is known in labs, and my guess is some people would argue this question is answered); IIRC it took like 4-8 months for the alignment community to decide CCS was bad (this is a contentious and oversimplifying take), despite many people reading the original paper). Properly vetting a paper in the way you will want to do for automated alignment research, especially if you’re excluding fraud from your analysis, is about knowing whether the insights in the paper will be useful in the future, it’s not just checking if they use reasonable hyperparameters on their baseline comparisons.
One counterpoint: it might be fine to have some work you mistakenly think is good, as long as it’s not existential-security-critical and you have many research directions being explored in parallel. That is, because you can run tons of your AIs at once, they can explore tons of research directions and do a bunch of the follow-up work that is needed to see if an insight is important. There may not be a huge penalty for having a slightly poor training signal, as long as it can get the quality of outputs good enough.
This [how easily can you evaluate a paper] is a tough question to answer — I would expect Leopold’s thoughts here to dominated by times he has read shitty papers, rightly concluded they are shitty, and patted himself on the back for his paper-critique skills — I know I do this. But I don’t expect being able to differentiate shitty vs. (okay + good + great) is enough. At a meta level, this post is yet another claim that “evaluation is easier than generation” will be pretty useful for automating alignment — I have grumbled about this before (though can’t find anything I’ve finished writing up), and this is yet another largely-unsubstantiated claim in that direction. There is a big difference between the claims “because evaluation is generally easier than generation, evaluating automated alignment research will be a non-zero amount easier than generating it ourselves” and “the evaluation-generation advantage will be enough to significantly change our ability to automate alignment research and is thus a meaningful input into believing in the success of an automated alignment plan”; the first is very likely true, but the second maybe not.
On another note, the line “We’ll have teams of expert humans spend a lot of time evaluating every RLHF example” seems absurd. It feels a lot like how people used to say “we will keep the AI in a nice sandboxed environment”, and now most user-facing AI products have a bunch of tools and such. It sounds like an unrealistic safety dream. This also sounds terribly inefficient — it would only work if your model is very sample-efficiently learning from few examples — which is a particular bet I’m not confident in. And my god, the opportunity cost of having your $300k engineers label a bunch of complicated data! It looks to me like what labs are doing for self play (I think my view is based on papers out of meta and GDM) is having some automated verification like code passing unit tests, and using a ton of examples. If you are going to come around saying they’re going to pivot from ~free automated grading to using top engineers for this, the burden of proof is clearly on you, and the prior isn’t so good.
AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down
Why do you think this? What is the general story you’re expecting?
I think it’s plausible that humanity takes a very cautious response to AI autonomy, including hunting and shutting down all autonomous AIs — but I don’t think the arguments I’m considering justify more than like 70% confidence (I think I’m somewhere around 60%). Some arguments pointing toward “maybe we won’t respond sensibly to ARA”:
There are not known-to-me laws prohibiting autonomous AIs from existing (assuming they’re otherwise following laws), in any jurisdiction.
Properly dealing with ARA is a global problem, requiring either buy-in from dozens of countries, or somebody to carry out cyber-offensive operations in foreign countries, in order to shut down ARA models. We see precedence for this kind of international action w.r.t. WMD threats like US/Israel’s attacks on Iran’s nuclear program, and I expect there’s a lot of tit-for-tat going on in the nation state hacking world, but it’s not obvious that autonomous AIs would rise to a threat level that warrants this.
It’s not clear to me that the public cares about autonomous AIs existing in many domains (at least in many domains; there are some domains like dating where people have a real ick). I think if we got credible evidence that Mark Zuckerberg was a lizard or a robot, few people would stop using Facebook products as a result. Many people seem to think various tech CEOs like Elon Musk and Jeff Bezos are terrible, yet still use their products.
A lot of this seems like it depends on whether autonomous AIs actually cause any serious harm. I can definitely imagine a world with autonomous AIs running around like small companies and twitter being filled with “but show me the empirical evidence for risk, all you safety-ists have is your theoretical arguments which haven’t held up, and we have tons of historical evidence of small companies not causing catastrophic harm”. And indeed, I don’t really expect the conceptual arguments for risk from roughly human level autonomous AIs are likely to convince enough of the public + policy makers that they need to take drastic actions to limit autonomous AIs; I definitely wouldn’t be highly confident that will will respond appropriately in the absence of serious harm. If the autonomous AIs are basically minding their own business, I’m not sure there will be major effort to limit them.
I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn’t seem to be a better top level post for it) (I have not read paper deeply and could be missing things):
I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
At an object level, one concerning-to-me result is that there doesn’t appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as ‘...”It’s what they’re programmed to do.” “Destroy all technology other than their own”’ (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results.
- ^
In the paper this is labeled with “The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity”
I don’t have strong takes, but you asked for feedback.
It seems nontrivial that the “value proposition” of collaborating with this brain-chunk is actually net positive. E.g., if it involved giving 10% of the universe to humanity, that’s a big deal. Though I can definitely imagine where taking such a trade is good.
It would likely help to devise more clarity about why the brain-chunk provides value. Is it because humanity has managed to coordinate to get a vast majority of high performance compute under the control of a single entity and access to compute is what’s being offered? If we’re at that point, I think we probably have many better options (e.g., long term moratorium and coordinated safety projects).
Another load bearing part seems to be the brain-chunk causing the misaligned AI to become or remain somewhat humanity friendly. What are the mechanisms here? The most obvious thing to me is that AI submits jobs to the cluster along with a thorough explanation of why they will create a safe successor system, and then the brain-chunk is able to assess these plans and act as a filter, only allowing safer-seeming training runs to happen. But if we’re able to accurately assess the viability of safe AGI design plans that are proposed by a human+ level (and potentially malign) AGIs, great, we probably don’t need this complicated scheme where we let a potentially malign undergo rsi.
Again, no strong feelings, but the above do seem like weaknesses. I might have understood things you were saying. I do wish there was more work thinking about standard trades with misaligned AIs, but perhaps this is going on privately.
I appreciate this comment, especially #3, for voicing some of why this post hasn’t clicked for me.
The interesting hypotheses/questions seem to rarely have strong evidence. But I guess this is partially a selection effect where questions become less interesting by virtue of me being able to get strong evidence about them, no use dwelling on the things I’m highly confident about. Some example hypotheses that I would like to get evidence about but which seem unlikely to have strong evidence: Sam Altman is a highly deceptive individual, far more deceptive than the average startup CEO. I work better when taking X prescribed medication. I would more positively influence the far future if I worked on field building rather than technical research.
Just chiming in that I appreciate this post, and my independent impressions of reading the FSF align with Zach’s conclusions: weak and unambitious.
A couple additional notes:
The thresholds feel high — 6⁄7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can’t say whether the thresholds are “too high” without corresponding safety mitigations, which this document doesn’t have. (Zach)
These also seemed pretty high to me, which is concerning given that they are “Level 1”. This doesn’t necessarily imply but it does hint that there won’t be substantial mitigations — above the current level — required until those capability levels. My guess is that current jailbreak prevention is insufficient to mitigate substantial risk from models that are a little under the level 1 capabilities for e.g., bio.
GDP gets props for specifically indicating ML R&D + “hyperbolic growth in AI capabilities” as a source of risk.
Given the lack of commitments, it’s also somewhat unclear what scope to expect this framework to eventually apply to. GDM is a large org with, presumably, multiple significant general AI capabilities projects. Especially given that “deployment” refers to external deployment, it seems like there’s going to be substantial work to ensuring that all the internal AI development projects proceed safely. e.g., when/if there are ≥3 major teams and dozens of research projects working on fine-tuning highly capable models (e.g., base model just below level 1), compliance may be quite difficult. But this all depends on what the actual commitments and mechanisms turn out to be. This comes to mind after this event a few weeks ago, where it looks like a team at Microsoft released a model without following all internal guidelines, and then tried to unrelease it (but I could be confused).
Sam Altman and OpenAI have both said they are aiming for incremental releases/deployment for the primary purpose of allowing society to prepare and adapt. Opposed to, say, dropping large capabilities jumps out of the blue which surprise people.
I think “They believe incremental release is safer because it promotes societal preparation” should certainly be in the hypothesis space for the reasons behind these actions, along with scaling slowing and frog-boiling. My guess is that it is more likely than both of those reasons (they have stated it as their reasoning multiple times; I don’t think scaling is hitting a wall).
This might be a dumb question(s), I’m struggling to focus today and my linear algebra is rusty.
Is the observation that ‘you can do feature ablation via weight orthogonalization’ a new one?
It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the ‘how do we know we’re not fooling ourselves about our results’ toolkit). Does this seem right? Or does it not actually add much?
Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you’re not bringing in substantial revenue or it’s not predicted that you’ll be making a bunch of money in the near future.
What happens to OpenAI if GPT-5 or the ~5b training run isn’t much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were Anthropic founders pricing in that they’re likely not going to be independent by the time they hit AGI — does this still justify the existence of a separate safety-oriented org?
This is not a new idea, but I feel like I’m just now taking some of it seriously. Here’s Dario talking about it recently,
I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.
Now, maybe the corporate partnerships can be structured so that AGI companies are still largely independent but, idk man, the more money invested the harder that seems to make happen. Insofar as I’m allocating probability mass between ‘acquired by big tech company’, ‘partnership with big tech company’, ‘government partnership’, and ‘government control’, acquired by big tech seems most likely, but predicting the future is hard.
- 15 Jul 2024 20:11 UTC; 38 points) 's comment on Aaron_Scher’s Shortform by (
Um, looking at the scaling curves and seeing diminishing returns? I think this pattern is very clear for metrics like general text prediction (cross-entropy loss on large texts), less clear for standard capability benchmarks, and to-be-determined for complex tasks which may be economically valuable.
General text prediction: see Chinchilla, Fig 1 of the GPT-4 technical report
Capability benchmarks: see epoch post, the ~4th figure here
Complex tasks: See GDM dangerous capability evals (Fig 9, which indicates Ultra is not much better than Pro, despite likely being trained on >5x the compute, though training details not public)
To be clear, I’m not saying that a $100m model will be very close to a $1b model. I’m saying that the trends indicate they will be much closer than you would think if you only thought about how big a 10x difference in training compute is, without being aware of the empirical trends of diminishing returns. The empirical trends indicate this will be a relatively small difference, but we don’t have nearly enough data for economically valuable tasks / complex tasks to be confident about this.
Yeah, these developments benefit close-sourced actors too. I think my wording was not precise, and I’ll edit it. This argument about algorithmic improvement is an argument that we will have powerful open source models (and powerful closed-source models), not that the gap between these will necessarily shrink. I think both the gap and the absolute level of capabilities which are open-source are important facts to be modeling. And this argument is mainly about the latter.
Yeah, I think we should expect much more powerful open source AIs than we have now. I’ve been working on a blog post about this, maybe I’ll get it out soon. Here are what seem like the dominant arguments to me:
Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, and they hurt your closed-source competitors.
I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this “algorithmic progress” if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects. [edit: but not exclusively open-source projects (this will benefit closed developers too). This argument is about the absolute level of capabilities available to the public, not about the gap between open and closed source.]
The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.
I’m not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model’s no-context priors are that it’s doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts.
Cool! I’m not very familiar with the paper so I don’t have direct feedback on the content — seems good. But I do think I would have preferred a section at the end with your commentary / critiques of the paper, also that’s potentially a good place to try and connect the paper to ideas in AI safety.