Spencer Becker-Kahn

Karma: 533

Independent AI Safety Researcher. Previously SERI MATS scholar and FHI Senior Research Scholar. Before that, pure math in academia at Cambridge, UW, MIT. Twitter. LinkedIn

Spencer Becker-Kahn 14 Jun 2023 10:42 UTC
LW: 42 AF: 23
13
AF
on: ARC is hiring theoretical researchers
I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven’t been very motivated to engage much with ARC’s recent work). But I decided maybe it’s best to comment in a way that gives a better signal than silence.

I’ve generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.
I can try to explain a little more: It seemed odd that the “potential” applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered—before writing the paper—questions like “OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?” Some of what was said in the paper was fairly vague stuff like:
If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.
In my opinion, it’s also important to bear in mind that the criteria of a problem being ‘open’ is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would ‘have to be’ deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC’s stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I’m saying:
Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.
But practically it means that when I ask myself something like: ‘Why would I drop whatever else I’m working on and work on this stuff?’ I find it quite hard to answer in a way that’s not basically just all deference to some ‘vision’ that is currently undeclared (or as the paper says “mostly defer[red]” to “future articles”).

Having said all this I’ll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.

Spencer Becker-Kahn 2 Jun 2023 10:42 UTC
15 points
10
in reply to: Garrett Baker’s comment on: Short Remark on the (subjective) mathematical ‘naturalness’ of the Nanda—Lieberum addition modulo 113 algorithm
Hi Garrett,

OK so just being completely honest, I don’t know if it’s just me but I’m getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
- I did not use any words such as “trivial”, “obvious” or “simple”. Stories like the one you recount are obviously making fun of mathematicians, some of whom do think its cool to say things are trivial/simple/obvious after they understand them. I often strongly disagree and generally dislike this behaviour and think there are many normal mathematicians who don’t engage in this sort of thing. In particular sometimes the most succinct insights are the hardest ones to come by (this isn’t a reference to my post; just a general point). And just because such insights are easily expressible once you have the right framing and the right abstractions, they should by no means be trivialized.
- I deliberately emphasized the subjectivity of making the sorts of judgements that I am making. Again this kinda forms part of the joke of the story.
- I have indeed been aware of the work since when it was first posted 10 months ago or so and have given it some thought on and off for a while (in the first sentence of the post I was just saying that I didn’t spend long writing the post, not that these thoughts were easily arrived-at).
- I do not claim to have explained the entire algorithm, only to shed some light on why it might actually be a more natural thing to do than some people seem to have appreciated.
- I think the original work is of a high quality and one might reasonably say ‘groundbreaking’.
In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.

it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like ‘adding’ rigour after the fact or building narrow models to explain specific already-observed phenomena.
....[but]… one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover—remarkably—often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking “What kind of thing seems most useful for understanding this object and making predictions?” but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste.
And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn’t any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say “I would be far more impressed if B” instead of “maybe that’s progress towards B”?

(Also FWIW, Neel claims here that regarding the algorithm itself, another researcher he knows “roughly predicted this”.)

Spencer Becker-Kahn 23 Jan 2024 1:51 UTC
14 points
6
on: We need a science of evals
You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?”—ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.

Spencer Becker-Kahn 22 Mar 2023 16:11 UTC
13 points
5
on: Looking back on my alignment PhD
I really like this post and found it very interesting, particularly because I’m generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn’t come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself.

I think the tl;dr for my comment is sort of that to me the social dynamics “mistakes” don’t really seem like mistakes—or at least not ones that were actually made by the author.

Broadly speaking, these “mistakes” seem to me like mostly normal ways of learning and doing a PhD that happen for mostly good reasons and my reaction to the fact that these “mistakes” were “figured out” towards the end of the PhD is that this is a predictable part of the transition from being primarily a student to primarily an independent researcher (the fast-tracking of which would be more difficult than a lot of rationalists would like to believe).

I also worry that emphasizing these things as “mistakes” might actually lead people to infer that they should ‘do the opposite’ from the start, which to me would sound like weird/bad advice: e.g Don’t try to catch up with people who are more knowledgeable than you; don’t try to seem smart and defensible; don’t defer, you can do just as good by thinking everything through for yourself.

I broadly agree that
rationality is not about the bag of facts you know.

but AI alignment/safety/x-risk isn’t synonymous with rationality (Or is it? I realise TurnTrout does not directly claim that it is, which is why I’m maybe more cautioning against a misreading than disagreeing with him head on, but maybe he or others think there is a much closer relationship between rationality and alignment work than I do?).

Is there not, by this point, something at least a little bit like “a bag of facts” that one should know in AI Alignment? People have been thinking about AI alignment for at least a little while now. And so like, what have they achieved? Do we or do we not actually have some knowledge about the alignment problem? It seems to me that it would be weird if we didn’t have any knowledge—like if there was basically nothing that we should count as established and useful enough to be codified and recorded as part of the foundations of the subject. It’s worth wondering whether this has perhaps changed significantly in the last 5-10 years though, i.e. during TurnTrout’s PhD. That is, perhaps—during that time—the subject has grown a lot and at least some things have been sufficiently ‘deconfused’ to have become more established concepts etc. But generally, if there are now indeed such things, then these are probably things that people entering the field should learn about. And it would seem likely that a lot of the more established ‘big names’/productive people actually know a lot of these things and that “catching up with them” is a pretty good instrumental/proxy way to get relevant knowledge that will help you do alignment work. (I almost want to say: I know it’s not fashionable in rationality to think this, but wanting to impress the teacher really does work pretty well in practice when starting out!)

Focussing on seeming smart and defensible probably can ultimately lead to a bad mistake. But when framed more as “It’s important to come across as credible” or “It’s not enough to be smart or even right; you actually do need to think about how others view you and interact with you”, it’s not at all clear that it’s a bad thing; and certainly it more clearly touches on a regular topic of discussion in EA/rationality about how much to focus on how one is seen or how ‘we’ are viewed by outsiders. Fwiw I don’t see any real “mistake” being actually described in this part of the post. In my opinion, when starting out, probably it is kinda important to build up your credibility more carefully. Then when Quintin came to TurnTrout, he writes that it took “a few days” to realize that Quintin’s ideas could be important and worth pursuing. Maybe the expectation in hindsight would be that he should have had the ‘few days’ old reaction immediately?? But my gut reaction is that that would be way too critical of oneself and actually my thought is more like ‘woah he realised that after thinking about it for only a few days; that’s great’. Can the whole episode not be read as a straightforward win: “Early on, it is important to build your own credibility by being careful about your arguments and being able to back up claims that you make in formal, public ways. Then as you gain respect for the right reasons, you can choose when and where to ‘spend’ your credibility… here’s a great example of that...”

And then re: deference, certainly it was true for me that when I was starting out in my PhD, if I got confused reading a paper or listening to talk, I was likely to be the one who was wrong. Later on or after my PhD, then, yeah, when I got confused by someone else’s presentation, I was less likely to be wrong and it was more likely I was spotting an error in someone else’s thinking. To me this seems like a completely normal product of the education and sort of the correct thing to be happening. i.e. Maybe the correct thing to do is to defer more when you have less experience and to gradually defer less as you gain knowledge and experience? I’m thinking that under the simple model that when one is confused about something, either you’re misunderstanding or the other person is wrong, one starts out in the regime where your confusion is much more often better explained by the fact you have misunderstood and you end up in the regime where you actually just have way more experience thinking about these things and so are now more reliably spotting other people’s errors. The rational response to the feeling of confusion changes because once fully accounted for the fact you just know way more stuff and are a way more experienced thinker about alignment. (One also naturally gains a huge boost to confidence as it becomes clear you will get your PhD and have good postdoc prospects etc… so it becomes easier to question ‘authority’ for that reason too, but it’s not a fake confidence boost; this is mostly a good/useful effect because you really do now have experience of doing research yourself, so you actually are more likely to be better at spotting these things).

Spencer Becker-Kahn 25 May 2022 12:36 UTC
LW: 12 AF: 5
AF
on: Information Loss --> Basin flatness
This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an $N$ -dimensional parameter space, let’s call it $Θ$ say, and then for each element $x_{i}$ of the training set, we can consider the function $f_{i} : Θ ⟶ Output Space =: O$ which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point $x_{i}$ . We are thinking of both $Θ$ and $O^{k}$ as smooth (or at least sufficiently differentiable) spaces (I take it).

A contour plane is a level set of one of the $f_{i}$ , i.e. a set of the form
${θ \in Θ : f_{i} (θ) = o},$
for some $o \in O$ and $i \in {1, \dots, k}$ . A behavior manifold is a set of the form
$k ⋂ i = 1 {θ \in Θ : f_{i} = o}$
for some $o \in O$ .

A more concise way of viewing this is to define a single function $f : Θ ⟶ O^{k}$ and then a behavior manifold is simply a level set of this function. The map $f$ is a submersion at $θ \in Θ$ if the Jacobian matrix at $θ$ is a surjective linear map. The Jacobian matrix is what you call $G^{T}$ I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn’t matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given $o \in O$ , if $f$ is a submersion in a neighbourhood of a point $θ_{0} \in f^{- 1} (o)$ , then $f^{- 1} (o)$ is a smooth $(N - k)$ -dimensional submanifold in a neighbourhood of $θ_{0}$ .

Essentially, in a neighbourhood of a point at which the Jacobian of $f$ has full rank, the level set through that point is an $(N - k)$ -dimensional smooth submanifold.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a ‘tip of the iceberg’ situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: “The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena—bifurcations, perestroikas (restructurings), catastrophes, metamorphoses—which occur in systems depending on parameters when the parameters vary in a smooth manner”.)

Similarly when you say things like “Low rank $G$ indicates information loss”, I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank $G$ . It doesn’t seem clear that low rank $G$ is necessarily indicative of information loss?

Spencer Becker-Kahn 21 Apr 2022 10:32 UTC
LW: 11 AF: 4
AF
in reply to: Steven Byrnes’s comment on: Call For Distillers
I agree i.e. I also (fairly weakly) disagree with the value of thinking of ‘distilling’ as a separate thing. Part of me wants to conjecture that it’s comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard ‘unit’ is a an original theorem or original result which might be poorly written up but can’t really be argued against much. But if we think of the area (I’m thinking predominantly about more conceptual/theoretical alignment) as a ‘softer’, messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called ‘distilling’ really is.

Another relevant point, but one which I won’t bother trying to expand on much here, is that a research community assimilating—and then eventually building on—complex ideas can take a really long time.

[At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used—it’s not completely opaque or anything—but I’d not heard it used regularly in this way until I started looking at the alignment forum. What’s really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven’t figured out? Can we not get by using words like “writing” and “teaching” and “explaining”?]

Spencer Becker-Kahn 9 Feb 2023 19:47 UTC
9 points
6
in reply to: Roman Leventov’s comment on: On Developing a Mathematical Theory of Interpretability
>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural
I mean “natural” as opposed to “man made”. i.e. something like “occurs in nature without being built by something or someone else”. So in that sense, DNNs are obviously not natural in the way that the laws of physics are.
I don’t see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about. And this applies to the computations in NNs too.
I don’t want to study AI as mathematics or believe that AI is mathematics. I write that the practice of doing mathematics will only seek out the parts of the problem that are actually amenable to it; and my focus is on interpretability and not other places in AI that one might use mathematics (like, say, decision theory).
You write “As an example, take “A mathematical framework for transformer circuits”: it doesn’t develop new mathematics. It just uses existing mathematics: tensor algebra.:” I don’t think we are using ‘new mathematics’ in the same way and I don’t think the way you are using it commonplace. Yes I am discussing the prospect of developing new mathematics, but this doesn’t only mean something like ‘making new definitions’ or ‘coming up with new objects that haven’t been studied before’. If I write a proof of a theorem that “just” uses “existing” mathematical objects, say like...matrices, or finite sets, then that seems to have little bearing on how ‘new’ the mathematics is. It may well be a new proof, of a new theorem, containing new ideas etc. etc. And it may well need to have been developed carefully over a long period of time.

Spencer Becker-Kahn 22 Jun 2022 9:53 UTC
9 points
1
on: The inordinately slow spread of good AGI conversations in ML
I’m a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I’ll give it a go anyway. There’s a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:

(I do think it’s possible to create a much better intro resource than any that exist today, but ‘we can do much better’ is compatible with ‘it’s shocking that the existing material hasn’t already finished the job’.)
I feel like I want to ask: Do you really find it “shocking”? My experience with explaining things to more general audiences leaves me very much of the opinion that it is by default an incredibly slow and difficult process to get unusual, philosophical, mathematical, or especially technical ideas to permeate. I include ‘average ML engineer’ as something like a “more general audience” member relative to MIRI style AGI Alignment theory. I guess I haven’t thought it about it much but presumably there exist ideas/arguments that are way more mainstream, also very important, and with way more written about them that people still somehow, broadly speaking, don’t engage with or understand?

I also don’t really understand how the point that is being made in the quote from Inadequate Equilibria is supposed to work. Perhaps in the book more evidence is provided for when “the silence broke”, but the Hawking article was before the release of Superintelligencea and then the Musk tweet was after it and was reacting to it(!) .. So I guess I’m sticking up for AGI x-risk respectability politics a bit here because surely I might also use essentially this same anecdote to support the idea that boring old long-form academic writing that clearly lays things out in as rigorous a way as possible is actually more the root cause that moved the needle here? Even if it ultimately took the engagement of Musk’s off the cuff tweets, Gates, or journalists etc., they wouldn’t have had something respectable enough to bounce off had Bostrom not given them the book.

Spencer Becker-Kahn 25 Mar 2022 10:45 UTC
9 points
AF
in reply to: Algon’s comment on: Job Offering: Help Communicate Infrabayesianism
It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn’t do work like Abram proposes—quite a long and difficult project, I expect—for the chance of winning a prize, particularly if the winner(s) were decided by someone’s subjective judgement.

Spencer Becker-Kahn 24 Mar 2022 15:56 UTC
LW: 9 AF: 3
AF
on: Job Offering: Help Communicate Infrabayesianism
This post caught my eye as my background is in mathematics and I was, in the not-too-distant past, excited about the idea of rigorous mathematical AI alignment work. My mind is still open to such work but I’ll be honest, I’ve since become a bit less excited than I was. In particular, I definitely “bounced off” the existing write-ups on Infrabayesianism and now without already knowing what it’s all about, it’s not clear it’s worth one’s time. So, at the risk of making a basic or even cynical point: The remuneration of the proposed job could be important for getting attention/ incentivising people on-the-fence.

Spencer Becker-Kahn 5 Apr 2023 14:21 UTC
8 points
2
on: LW Team is adjusting moderation policy
I’ve always found it a bit odd that Alignment Forum submissions are automatically posted to LW.

If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.

Spencer Becker-Kahn 19 Oct 2023 13:41 UTC
6 points
2
on: At 87, Pearl is still able to change his mind
My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model^[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.
Short comment/feedback just to say: This sentence is making one of your main points but is very tricky! - perhaps too long/too many subclauses?

Spencer Becker-Kahn 7 Jun 2023 12:26 UTC
6 points
0
on: Co-found an incubator for independent AI Safety researchers
How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to ‘mentor matching’?

(e.g. see Richard Ngo’s first high-level point in this career advice post)

Spencer Becker-Kahn 30 Aug 2022 20:33 UTC
6 points
3
in reply to: Tom Lieberum’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
I’m not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.

Spencer Becker-Kahn 18 Jan 2024 12:36 UTC
5 points
5
in reply to: ryan_greenblatt’s comment on: AlphaGeometry: An Olympiad-level AI system for geometry
Hmm it might be questionable to suggest that it is “non-AI” though? It’s based on symbolic and algebraic deduction engines and afaict it sounds like it might be the sort of thing that used to be very much mainstream “AI” i.e. symbolic AI + some hard-coded human heuristics?

Spencer Becker-Kahn 18 Jan 2024 11:26 UTC
5 points
1
in reply to: TurnTrout’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
FWIW I did not interpret Thane as necessarily having “high confidence” in “architecture / internal composition” of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to “model the world” counts as a statement about “internal composition” is sort of beside the point/beyond the scope of what’s really being said)

It’s fair enough if you would say things differently(!) but in some sense isn’t it just pointing out: ‘I would emphasize different aspects of the same underlying basic point’. And I’m not sure if that really progresses the discussion? I.e. it’s not like Thane Ruthenis actually claims that “scarily powerful artificial agents” currently exist. It is indeed true that they don’t exist and may not ever exist. But that’s just not really the point they are making so it seems reasonable to me that they are not emphasizing it.

----

I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.

I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I’d been originally inspired by Superintelligence—significant parts of which relate to these questions imo—only to be told that the community had ‘kinda moved away from that book now’. And so I sort of sympathize with the vibe of Thane’s post (and worry that there has been a sort of mission creep)

Spencer Becker-Kahn 22 Jun 2022 11:29 UTC
5 points
2
in reply to: Rob Bensinger’s comment on: The inordinately slow spread of good AGI conversations in ML
Thanks for the nice reply.

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they’re sufficient to ~fully make sense of what’s going on. So I don’t feel confused about the situation anymore. By “shocking” I meant something more like “calls for an explanation”, not “calls for an explanation, and I don’t have an explanation that feels adequate”. (With added overtones of “horrifying”.)

Yeah, OK, I think that helps clarify things for me.

As someone who was working at MIRI in 2014 and watched events unfolding, I think the Hawking article had a negligible impact and the Musk stuff had a huge impact. Eliezer might be wrong about why Hawking had so little impact, but I do think it didn’t do much.
Maybe we’re misunderstanding each other here. I don’t really doubt what you’re saying there^ i.e. I am fully willing to believe that the Hawking thing had negligible impact and the Musk tweet had a lot. I’m more pointing to why Musk had a lot rather than why Hawking had little: Trying to point out that since Musk was reacting to Superintelligence, one might ask whether he could have had a similar impact without Superintelligence. And so maybe the anecdote could be used as evidence that Superintelligence was really the thing that helped ‘break the silence’. However, Superintelligence feels way less like “being blunt” and “throwing a brick” and—at least from the outside—looks way more like the “scripts, customs, and established protocols” of “normal science” (i.e. Oxford philosophy professor writes book with somewhat tricky ideas in it, published by OUP, reviewed by the NYT etc. etc.) and clearly is an attempt to make unusual ideas sound “sober and serious”. So I’m kind of saying that maybe the story doesn’t necessarily argue against the possibility of doing further work like that that—i.e. writing books that manage to stay respectable and manage to “speak accurately and concretely about the future of AI without sounding like a sci-fi weirdo”(?)

Spencer Becker-Kahn 7 Jul 2023 12:30 UTC
4 points
3
in reply to: Alex_Altair’s comment on: A Defense of Work on Mathematical AI Safety
Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP’s case. And that without such examples, it’s very easy to remain skeptical.

Spencer Becker-Kahn 4 May 2023 9:55 UTC
4 points
1
on: How MATS addresses “mass movement building” concerns
At the start you write
3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.
And later Claim 3 is:

Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field

It seems to me there might be two things being pointed to?

A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;

And arguably these two things are in tension, in the following sense: I think that to a significant extent, one of the sources of unnecessary dilution is the issue of less experienced people not learning directly from more experienced people and instead relying too heavily on other inexperienced peers to develop their research skills and tastes. i.e. you might say that A) is partly caused by insufficient deference.

I roughly think that that the downsides of de-emphasizing deference and the accumulation of factual knowledge from more experienced people are worse than keeping it as sort of the zeroth order/default thing to aim for. It seems to me that to the extent that one believes that the field is making any progress at all, one should think that increasingly there will be experienced people from whom less experienced people should expect—at least initially—to learn from/defer to.

Looking at it from the flipside, one of my feelings right now is that we need mentors who don’t buy too heavily into this idea that deference is somehow bad; I would love to see more mentors who can and want to actually teach people. (cf. The first main point—one that I agree with—that Richard Ngo made in his recent piece on advice: The area is mentorship constrained. )

Spencer Becker-Kahn 27 Apr 2023 13:56 UTC
4 points
0
in reply to: Joseph Bloom’s comment on: On Developing a Mathematical Theory of Interpretability
Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: “something that I’m clumsily thinking of as ‘the mathematics of (the interpretability of) deep learning-based AI’”). But I think I was imagining that a ‘theory of neural networks’ would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren’t really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from ‘understanding NNs’ and towards ‘interpreting AI’.

But maybe the distinction is more minor than I was originally worried about; I’m not sure.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
No, I haven’t really. It was—and maybe still is—a sort of plan B of mine. I don’t know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel’s work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it’s more than just ‘theory of NNs’ but can actually tell us something about safety of the AIs)
One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore.…
I think maybe I didn’t express myself clearly or the analogy I tried to make didn’t work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: “it is reasonable that the early stages of rigorous development don’t naively ‘look like’ the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models.” What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn’t look like the kind of ‘intepretability of AI’ that you really ultimately want, it looks too low-level. It’s a ‘toy model’ in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions—i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you’re most interested in seeing. I think your second suggestion in particular is close to the sort of thing I’d be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
- Main thing holding me back is probably something like: There just currently aren’t enough people doing it—no critical mass. Obviously there’s that classic game theoretic element here in that plausibly lots of people’s minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn’t mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some ‘access’ to those people/opportunity to learn from them etc.) which seems quite unlikely.
- It’s really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is.
- And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.