I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to “will first dangerous models look like current models”, which I think matters more for research directions than what you allow in the second paragraph.
For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.
Daniel Paleka
because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I’ve heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field.
In addition, the AI x-safety field is now rapidly expanding.
There is a huge amount of status to be collected by publishing quickly and claiming large contributions.
In the absence of rigor and metrics, the incentives are towards:
- setting new research directions, and inventing new cool terminology;
- using mathematics in a way that impresses, but is too low-level to yield a useful claim;
- and vice versa, relying too much on complex philosophical insights without empirical work;
- getting approval from alignment research insiders.
See also the now ancient Troubling Trends in Machine Learning Scholarship.
I expect the LW/AF community microcosm will soon reproduce many of of those failures.
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim “changing facts in a generalizable way” (it’s likely not robust to synonyms at all)”. I am also vary of “editing just one MLP for a given fact” being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word “physics” itself. Just don’t overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
My SERI MATS Application
Evaluating Superhuman Models with Consistency Checks
I don’t think this framework is good, and overall I expected much more given the title. The name “five worlds” is associated with a seminal paper that materialized and gave names to important concepts in the latent space… and this is just a list of outcomes of AI development, with that categorization by itself providing very little insight for actual work on AI.
Repeating my comment from Shtetl-Optimized, to which they didn’t reply:
It appears that you’re taking collections of worlds and categorizing them based on the “outcome” projection, labeling the categories according to what you believe is the modal representative underlying world of those categories.
By selecting the representative worlds to be “far away” from each other, it gives the impression that these categories of worlds are clearly well-separated. But, we do not have any guarantees that the outcome map is robust at all! The “decision boundary” is complex, and two worlds which are very similar (say, they differ in a single decision made by a single human somewhere) might map to very different outcomes.
The classification describes *outcomes* rather than actual worlds in which these outcomes come from.
Some classifications of the possible worlds would make sense if we could condition on those to make decisions; but this classification doesn’t provide any actionable information.
Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)
This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.
On the other hand, maybe keeping the current programs explicitly in-group make a lot of sense if you think that AI x-risk will soon be a major topic in the ML research community anyway.
Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.
Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers.
Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It’s still not a good excuse.
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It’s not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) insteadfinetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.
(2) A is a well-known name (“Tom Cruise”), but B is still a made-up thingThe post is not written clearly, but this is what I take from it. Not sure how model internals explain this.I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
I like this because it makes it clear that legibility of results is the main concern. There are certain ways of writing and publishing information that communities 1) and 2) are accustomed to. Writing that way both makes your work more likely to be read, and also incentivizes you to state the key claims clearly (and, when possible, formally), which is generally good for making collaborative progress.
In addition, one good thing to adopt is comparing to prior and related work; the ML community is bad on this front, but some people genuinely do care. It also helps AI safety research to stack.
To avoid this comment section being an echo chamber: you do not have to follow all academic customs. Here is how to avoid some of the harmful ones that are unfortunately present:
Do not compromise on the motivation or related work to make it seem less weird for academics. If your work relies on some LW/AF posts, do cite them. If your work is intended to be relevant for x-risk, say it.
Avoid doing anything if the only person you want to appease with it is an anonymous reviewer.
Never compromise on the facts. If you have results that say some famous prior paper is wrong or bad, say it loud and clear, in papers and elsewhere. It doesn’t matter who you might offend.
AI x-risk research has its own perfectly usable risk sheet you can include in your papers.
And finally: do not publish potentially harmful things just because it benefits science. Science has no moral value. Society gives too much moral credit to scientists in comparison to other groups of people.
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead.
I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance.
Opinion:
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don’t think doing research “from first principles” here will be productive. This also holds for many other alignment projects.
However I do think it makes sense as an alignment-positive research direction in general.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students—it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?
A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test.
Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.
I don’t think LW is a good venue for judging the merits of this work. The crowd here will not be able to critically evaluate the technical statements.
When you write the sequence, write a paper, put it on arXiv and Twitter, and send it to a (preferably OpenReview, say TMLR) venue, so it’s likely to catch the attention of the relevant research subcommunities. My understanding is that the ML theory field is an honest field interested in bringing their work closer to the reality of current ML models. There are many strong mathematicians in the field who will be interested in dissecting your statements.
There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:
The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.
and
Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.
Reply by authors:
I can see why a MAS scholar would be unsurprised by this result. However, most ML experts we spoke to prior to this paper thought our attack would fail! We hope our results will motivate ML researchers to be more interested in the work on exploitability pioneered by MAS scholars.
...
Ultimately self-play continues to be a widely used method, with high-profile empirical successes such as AlphaZero and OpenAI Five. If even these success stories are so empirically vulnerable we think it’s important for their limitations to become established common knowledge.
My understanding is that the author’s position is reasonable for mainstream ML community standards; in particular there’s nothing wrong with the original tweet thread. “Self-play exploitable” is not new, but the practical demonstration of how easy it’s to do the exploit in Go engines is a new and interesting result.
I hope the “Related work” section gets fixed as soon as possible, though.
The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
- ^
...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn’t matter what the “real” rules of Go are.
I don’t play Go, so don’t take this judgement for granted.
- ^
I do not think the ratio of the “AI solves hardest problem” and “AI has Gold” probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...
(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)
The IMO Jury does not consider “bashability” of problems as a decision factor, in the regime where the bashing would take good contestants more than a few hours. But for a dedicated bashing program, it makes no difference.
It is extremely likely that an “AI” solving most IMO geometry problems is possible today—the main difficulty being converting the text into an algebraic statement. Given that, polynomial system solvers should easily tackle such problems.
Say the order of the problems is (Day 1: CNG, Day 2: GAC). The geometry solver gives you 14 points. For a chance on IMO Gold, you have to solve the easiest combinatorics problem, plus one of either algebra or number theory.
Given the recent progress on coding problems as in AlphaCode, I place over 50% probability on IMO #1/#4 combinatorics problems being solvable by 2024. If that turns out to be true, then the “AI has Gold” event becomes “AI solves a medium N or a medium A problem, or both if contestants find them easy”.Now, as noted elsewhere in the thread, there are various types of N and A problems that we might consider “easy” for an AI. Several IMOs in the last ten years contain those.
In 2015, the easiest five problems consisted out of: two bashable G problems (#3, #4), an easy C (#1), a diophantine equation N (#2) and a functional equation A (#5). Given such a problemset, a dedicated AI might be able to score 35 points, without having capabilities remotely enough to tackle the combinatorics #6.
The only way the Gold probability could be comparable to “hardest problem” probability is if the bet only takes general problem-solving models into account. Otherwise, inductive bias one could build into such a model (e.g. resort to a dedicated diophantine equation solver) helps much more in one than in the other.- 14 Mar 2023 2:10 UTC; 34 points) 's comment on A concrete bet offer to those with short AGI timelines by (
Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.
My thoughts: It is true that some metrics increase smoothly and some don’t. The issue is that some important capabilities are inherently all-or-nothing, and we haven’t yet found surrogate metrics which increase smoothly and correlate with things we care about.
What we want is: for a given capability, predicting whether this capability happens in the model that is being trained.
If extrapolating a smoothly increasing surrogate metric can do that, then emergence of that capability is indeed a mirage. Otherwise, Betteridge’s law of headlines applies.
My condolences to the family.
Chai (not to be confused with the CHAI safety org in Berkeley) is a company that optimizes chatbots for engagement; things like this are entirely predictable for a company with their values.
[Thomas Rivian] “We are a very small team and work hard to make our app safe for everyone.”
Incredible. Compare the Chai LinkedIn bio mocking responsible behavior:
“Ugly office boring perks…
, Chai =
Top two reasons you won’t like us:
1. AI safety =
2. Move fast and break stuff, we write code not papers.”The very first time anyone hears about them is their product being the first chatbot to convince a person to take their life… That’s very bad luck for a startup. I guess the lesson is to not behave like cartoon villains, and if you do, at least not put it in writing in meme form?
This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the “negative impact” section is retracted.[1] I point to Ben’s excellent comment for a correct interpretation of why we still care.
I do not know why I was not aware of this “block posts like this” feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking “Show Personal Blogposts” at some point. I did not even know that button existed.
No other part of my post is retracted. In fact, I’d like to reiterate a wish for the community to karma-enforce [2] the norms of:the epistemic standard of talking about falsifiable things;
the accepted rhetoric being fundamentally honest and straightforward, and always asking “compared to what?” before making claims;
the aversion to present uncertainties as facts.
Thank you for improving my user experience of this site!
I think Sam Altman is “inventing a guy to be mad at” here. Who anthropomorphizes models?
This reinforces my position that the fundamental dispute between the opposing segments of the AI safety landscape is based mainly on how hard it is to prevent extreme accidents, rather than on irreconcilable value differences. Of course, I can’t judge who is right, and there might be quite a lot of uncertainty until shortly before very transformative events are possible.