RSS

# Jakub Kraus

Karma: 287

Running an AI safety group at the University of Michigan. https://​​maisi.club/​​

Email: jakraus@umich.edu

Anonymous feedback: https://​​www.admonymous.co/​​jakraus

• The post title seems misleading to me. First, the outputs here seem pretty benign compared to some of the Bing Chat failures. Second, do all of these exploits work on GPT-4?

• I greatly appreciate this post. I feel like “argh yeah it’s really hard to guarantee that actions won’t have huge negative consequences, and plenty of popular actions might actually be really bad, and the road to hell is paved with good intentions.” With that being said, I have some comments to consider.

The offices cost $70k/​month on rent [1], and around$35k/​month on food and drink, and ~$5k/​month on contractor time for the office. It also costs core Lightcone staff time which I’d guess at around$75k/​year.

That is ~$185k/​month and ~$2.22m/​year. I wonder if the cost has anything to do with the decision? There may be a tendency to say “an action is either extremely good or extremely bad because it either reduces x-risk or increases x-risk, so if I think it’s net positive I should be willing to spend huge amounts of money.” I think this framing neglects a middle ground of “an action could be somewhere in between extremely good and extremely bad.” Perhaps the net effects of the offices were “somewhat good, but not enough to justify the monetary cost.” I guess Ben sort of covers this point later (“Having two locations comes with a large cost”).

its value was substantially dependent on the existing EA/​AI Alignment/​Rationality ecosystem being roughly on track to solve the world’s most important problems, and that while there are issues, pouring gas into this existing engine, and ironing out its bugs and problems, is one of the most valuable things to do in the world.

Huh, it might be misleading to view the offices as “pouring gas into the engine of the entire EA/​AI Alignment/​Rationality ecosystem.” They contribute to some areas much more than others. Even if one thinks that the overall ecosystem is net harmful, there could still be ecosystem-building projects that are net helpful. It seems highly unlikely to me that all ecosystem-building projects are bad.

The Lighthouse system is going away when the leases end. Lighthouse 1 has closed, and Lighthouse 2 will continue to be open for a few more months.

These are group houses for members of the EA/​AI Alignment/​Rationality ecosystem, correct? Relating to the last point, I expect the effects of these to be quite different from the effects of the offices.

FTX is the obvious way in which current community-building can be bad, though in my model of the world FTX, while somewhat of outlier in scope, doesn’t feel like a particularly huge outlier in terms of the underlying generators.

I’m very unsure about this, because it seems plausible that SBF would have done something terrible without EA encouragement. Also, I’m confused about the detailed cause-and-effect analysis of how the offices will contribute to SBF-style catastrophes—is the idea that “people will talk in the offices and then get stupid ideas, and they won’t get equally stupid ideas without the offices?”

My guess is RLHF research has been pushing on a commercialization bottleneck and had a pretty large counterfactual effect on AI investment, causing a huge uptick in investment into AI and potentially an arms race between Microsoft and Google towards AGI: https://​​www.lesswrong.com/​​posts/​​vwu4kegAEZTBtpT6p/​​thoughts-on-the-impact-of-rlhf-research?commentId=HHBFYow2gCB3qjk2i

Worth noting that there is plenty of room for debate on the impacts of RLHF, including the discussion in the linked post.

Tendencies towards pretty mindkilly PR-stuff in the EA community: https://​​forum.effectivealtruism.org/​​posts/​​ALzE9JixLLEexTKSq/​​cea-statement-on-nick-bostrom-s-email?commentId=vYbburTEchHZv7mn4

Overall I’m getting a sense of “look, there are bad things happening so the whole system must be bad.” Additionally, I think the negative impact of “mindkilly PR-stuff” is pretty insubstantial. On a related note, I somewhat agree with the idea that “most successful human ventures look—from up close—like dumpster fires.” It’s worth being wary of inferences resembling “X evokes a sense of disgust, so X is probably really harmful.”

I genuinely only have marginally better ability to distinguish the moral character of Anthropic’s leadership from the moral character of FTX’s leadership

Yeah this makes sense. I would really love to gain a clear understanding of who has power at the top AGI labs and what their views are on AGI risk. AFAIK nobody has done a detailed analysis of this?

Also, as in the case of RLHF, it’s worth noting that there are some reasonable arguments for Anthropic being helpful.

I think AI Alignment ideas/​the EA community/​the rationality community played a pretty substantial role in the founding of the three leading AGI labs (Deepmind, OpenAI, Anthropic)

Definitely true for Anthropic. For OpenAI I’m less sure; IIRC the argument is that there were lots of EA-related conferences that contributed to the formation of OpenAI, and I’d like to see more details than this; “there were EA events where key players talked” feels quite different from “without EA, OpenAI would not exist.” I feel similarly about DeepMind; IIRC Eliezer accidentally convinced one of the founders to work on AGI—are there other arguments?

And again, how do the Lightcone offices specifically contribute to the founding of more leading AGI labs? My impression is that the offices’ vibe conveyed a strong sense of “it’s bad to shorten timelines.”

It’s a bad idea to train models directly on the internet

I’m confused how the offices contribute to this.

The EA and AI Alignment community should probably try to delay AI development somehow, and this will likely include getting into conflict with a bunch of AI capabilities organizations, but it’s worth the cost

Again, I’m confused how the offices have a negative impact from this perspective. I feel this way about quite a few of the points in the list.

I do sure feel like a lot of AI alignment research is very suspiciously indistinguishable from capabilities research

...

It also appears that people who are concerned about AGI risk have been responsible for a very substantial fraction of progress towards AGI

...

A lot of people in AI Alignment I’ve talked to have found it pretty hard to have clear thoughts in the current social environment

To me these seem like some of the best reasons (among those in the list; I think Ben provides some more) to shut down the offices. The disadvantage of the list format is that it makes all the points seem equally important; it might be good to bold the points you see as most important or provide a numerical estimate for what percentage of the negative expect impact comes from each point.

The moral maze nature of the EA/​longtermist ecosystem has increased substantially over the last two years, and the simulacra level of its discourse has notably risen too.

I feel similar to the way I felt about the “mindkilly PR-stuff”; I don’t think the negative impact is very high in magnitude.

the primary person taking orders of magnitudes more funding and staff talent (Dario Amodei) has barely explicated his views on the topic and appears (from a distance) to have disastrously optimistic views about how easy alignment will be and how important it is to stay competitive with state of the art models

Agreed. I’m confused about Dario’s views.

I recall at EAG in Oxford a year or two ago, people were encouraged to “list their areas of expertise” on their profile, and one person who works in this ecosystem listed (amongst many things) “Biorisk” even though I knew the person had only been part of this ecosystem for <1 year and their background was in a different field.

This seems very trivial to me. IIRC the Swapcard app just says “list your areas of expertise” or something, with very little details about what qualifies as expertise. Some people might interpret this as “list the things you’re currently working on.”

It also seems to me like people who show any intelligent thought or get any respect in the alignment field quickly get elevated to “great researchers that new people should learn from” even though I think that there’s less than a dozen people who’ve produced really great work

Could you please list the people who’ve produced really great work?

I similarly feel pretty worried by how (quite earnest) EAs describe people or projects as “high impact” when I’m pretty sure that if they reflected on their beliefs, they honestly wouldn’t know the sign of the person or project they were talking about, or estimate it as close-to-zero.

Strongly agree. Relatedly, I’m concerned that people might be exhibiting a lot of action bias.

Last point, unrelated to the quote: it feels like this post is entirely focused on the possible negative impacts of the offices, and that kind of analysis seems very likely to arrive at incorrect conclusions since it fails to consider the possible positive impacts. Granted, this post was a scattered collection of Slack messages, so I assume the Lightcone team has done more formal analyses internally.

# GPT-4 solves Gary Mar­cus-in­duced flubs

17 Mar 2023 6:40 UTC
52 points
19 comments2 min readLW link
(docs.google.com)
• To be clear, I haven’t seen many designs that people I respect believed to have a chance of actually working. If you work on the alignment problem or at an AI lab and haven’t read Nate Soares’ On how various plans miss the hard bits of the alignment challenge, I’d suggest reading it.

Can you explain your definition of the sharp left turn and why it will cause many plans to fail?

• Is GPT-4 better than Google Translate?

• Yeah, the author is definitely making some specific claims. I’m not sure if the comment’s popularity stems primarily from its particular arguments or from its emotional sentiment. I was just pointing out what I personally appreciated about the comment.

• At the time of me writing, this comment is still the most recommended comment with 910 recommendations. 2nd place has 877 recommendations:

Never has a technology been potentially more transformative and less desired or asked for by the public.

3rd place has 790 recommendations:

“A.I. is probably the most important thing humanity has ever worked on. I think of it as something more profound than electricity or fire.”

Sundar Pichai’s comment beautifully sums up the arrogance and grandiosity pervasive in the entire tech industry—the notion that building machines that can mimic and repace actual humans, and providing wildly expensive and environmentally destructive toys for those who can pay for them, is “the most important” project ever undertaken by humanity, rather than a frivolous indulgence of a few overindulged rich kids with an inflated sense of themselves.

Off the top of my head, I am sure most of us can think of more than a few human other projects—both ongoing and never initiated—more important than the development of A.I.—like the development technologies that will save our planet from burning or end poverty or mapping the human genome in order to cure genetic disorders. Sorry, Mr. Pichai, but only someone who has lived in a bubble of privilege would make such a comment and actually believe it.

4th place has 682 recommendations:

“If you think calamity so possible, why do this at all?”

Having lived and worked in the Bay Area and around many of these individuals, the answer is often none that Ezra cites. More often than not, the answer is: money.

Tech workers come to the Bay Area to get early stock grants and the prospect of riches. It’s not AI that will destroy humanity. It’s capitalism.

After that, 5th place has 529, 6th place has 390, and the rest have 350 or fewer.

My thoughts:

• 2nd place reminds me of Let’s think about slowing down AI. But I somewhat disagree with the comment, because I do sense that many people have a desire for cool new AI tech.

• 3rd place sounds silly since advanced AI could help with reducing climate change, poverty, and genetic disorders. I also wonder if this commenter knows about AlphaFold.

• 4th place seems important. But I think that even if AGI jobs offered lower compensation, there would still be a considerable number of workers interested in pursuing them.

• I didn’t read it as an argument so much as an emotionally compelling anecdote that excellently conveys this realization:

I had had the upper hand for so long that it became second nature, and then suddenly, I went to losing every game.

• is probably an Iverson bracket:

• Does anyone have thoughts on Justin Sung? He has a popular video criticizing active recall and spaced repetition. The argument: if you use better strategies for initially encountering an idea and storing it in long-term memory, then the corresponding forgetting curve will exhibit a more gradual decline, and you won’t need to use flashcards as frequently.

I see some red flags about Justin:

• clickbait video titles

• he’s selling an online course

• he spends a lot of time talking about how wild it is that everyone else is wrong about this stuff and he is right

• he rarely gives detailed recommendations for better ways to study; this video has the most concrete advice I’ve seen so far

• I could not find any trustworthy, detailed reviews of his course (e.g. many of the comments in this Reddit post looked fishy), although I didn’t search very hard

Nonetheless, I’m curious if anyone has evaluated some of his critiques. I think a naive reader could conclude from all the spaced recall hype that the best way to learn is “just do flashcards in clever ways,” and this sounds wrong to me. One intuition pump: does Terrence Tao use flashcards?

• That makes sense. My main question is: where is the clear evidence of human negligibility in chess? People seem to be misleadingly confident about this proposition (in general; I’m not targeting your post).

When a friend showed me the linked post, I thought “oh wow that really exposes some flaws in my thinking surrounding humans in chess.” I believe some of these flaws came from hearing assertive statements from other people on this topic. As an example, here’s Sam Harris during his interview with Eliezer Yudkowsky (transcript, audio):

Obviously we’ll be getting better and better at building narrow AI. Go is now, along with Chess, ceded to the machines. Although I guess probably cyborgs—human-computer teams—may still be better for the next fifteen days or so against the best machines. But eventually, I would expect that humans of any ability will just be adding noise to the system, and it’ll be true to say that the machines are better at chess than any human-computer team.

(In retrospect, this is a very weird assertion. Fifteen days? I thought he was talking about Go, but the last sentence makes it sound like he’s talking about chess.)

• AIs overtake humans. Humans become obsolete and their contribution is negligible to negative.

I’m confused why chess is listed as an example here. This StackExchange post suggests that cyborg teams are still better than chess engines. Overall, I’m struggling to find evidence for or against this claim (that humans are obsolete in chess), even though it’s a pretty common point in discussions about AI.

• A conceptual Dopplegänger of some concept Z, is a concept Z’ that serves some overlapping functions in the mind as Z serves, but is psychically distinct from Z.

What is a concrete example of a conceptual Dopplegänger?

• I think it’s worth noting Joe Carlsmith’s thoughts on this post, available starting on page 7 of Kokotajlo’s review of Carlsmith’s power-seeking AI report (see this EA Forum post for other reviews).

JC: I do think that the question of how much probability mass you concentrate on APS-AI by 2030 is helpful to bring out – it’s something I’d like to think more about (timelines wasn’t my focus in this report’s investigation), and I appreciate your pushing the consideration.

I read over your post on +12 OOMs, and thought a bit about your argument here. One broad concern I have is that it seems like rests a good bit (though not entirely) on a “wow a trillion times more compute is just so much isn’t it” intuition pump about how AI capabilities scale with compute inputs, where the intuition has a quasi-quantitative flavor, and gets some force from some way in which big numbers can feel abstractly impressive (and from being presented in a context of enthusiasm about the obviousness of the conclusion), but in fact isn’t grounded in much. I’d be interested, for example, to see how this methodology looks if you try running it in previous eras without the benefit of hindsight (e.g., what % do you want on each million-fold scale up in compute-for-the-largest-AI-experiment). That said, maybe this ends up looking OK in previous eras too, and regardless, I do think this era is different in many ways: notably, getting in the range of various brain-related biological milestones, the many salient successes (which GPT-7 and OmegaStar extrapolate from), and the empirical evidence of returns to ML-style scaling. And I think the concreteness of the examples you provide is useful, and differentiating from mere hand-waves at big numbers.

Those worries aside, here’s a quick pass at some probabilities from the exercise, done for “2020 techniques” (I’m very much making these up as I go along, I expect them to change as I think more).

A lot of the juice, for me, comes from GPT-7 and Omegastar as representatives of “short and low-end-of-medium-to-long horizon neural network anchors”, which seem to me the most plausible and the best-grounded quantitatively.

• In particular, I agree that if scaling up and fine-tuning multi-modal short-horizon systems works for the type of model sizes you have in mind, we should think that less than 1e35 FLOPs is probably enough – indeed, this is where a lot of my short-timelines probability comes from. Let’s say 35% on this.

• It’s less clear to me what AlphaStar-style training of a human-brain-sized system on e.g. 30k consecutive Steam Games (plus some extra stuff) gets you, but I’m happy to grant that 1e35 FLOPs gives you a lot of room to play even with longer-horizon forms of training and evolution-like selection. Conditional on the previous bullet not working (which would update me against the general compute-centric, 2020-technique-enthusiast vibe here), let’s say another 40% that this works, so out of a remaining 65%, that’s 26% on top.

• I’m skeptical of Neuromorph (I think brain scanning with 2020 tech will be basically unhelpful in terms of reproducing useful brain stuff that you can’t get out of neural nets already, so whether the neuromorph route works is ~entirely correlated with whether the other neural net routes work), and Skunkworks (e.g., extensive search and simulation) seems like it isn’t focused on APS-systems in particular and does worse on a “why couldn’t you have said this in previous areas” test (though maybe it leads to stuff that gets you APS systems – e.g., better hardware). Still, there’s presumably a decent amount of “other stuff” not explicitly on the radar here. Conditional on previous bullet points not working (again, an update towards pessimism), probability that “other stuff” works? Idk… 10%? (I’m thinking of the previous, ML-ish bullets as the main meat of “2020 techniques.”) So that would be 10% of a remaining 39%, so ~4%.

So overall 35%+26%+4% =~65% on 1e35 FLOPs gets you APS-AI using “2020 techniques” in principle? Not sure how I’ll feel about this on more reflection + consistency checking, though. Seems plausible that this number would push my overall p(timelines) higher (they’ve been changing regardless since writing the report), which is points in favor of your argument, but it also gives me pause about ways the exercise might be distorting. In particular, I worry the exercise (at least when I do it) isn’t actually working with a strong model of how compute translates into concrete results, or tracking other sources of uncertainty and/​or correlation between these different paths (like uncertainty about brain-based estimates, scaling-centrism, etc – a nice thing about Ajeya’s model is that it runs some of this uncertainty through a monte carlo).

OK, what about 1e29 or less? I’ll say: 25%. (I think this is compatible with a reasonable degree of overall smoothness in distributing my 65% across my OOMs).

In general, though, I’d also want to discount in a way that reflects the large amount of engineering hassle, knowledge build-up, experiment selection/​design, institutional buy-in, other serial-time stuff, data collection, etc required for the world to get into a position where it’s doing this kind of thing successfully by 2030, even conditional on 1e29 being enough in principle (I also don’t take \$50B on a single training run by 2030 for granted even if in worlds where 1e29 is enough, though I grant that WTP could also go higher). Indeed, this kind of thing in general makes the exercise feel a bit distorting to me. E.g., “today’s techniques” is kind of ambiguous between “no new previously-super-secret sauce” and “none of the everyday grind of figuring out how to do stuff, getting new non-breakthrough results to build on and learn from, gathering data, building capacity, recruiting funders and researchers, etc” (and note also that in the world you’re imagining, it’s not that our computers are a million times faster; rather, a good chunk of it is that people have become willing to spend much larger amounts on gigantic training runs – and in some worlds, they may not see the flashy results you’re expecting to stimulate investment unless they’re willing to spend a lot in the first place). Let’s cut off 5% for this.

So granted the assumptions you list about compute availability, this exercise puts me at ~20% by 2030, plus whatever extra from innovations in techniques we don’t think of as covered via your 1-2 OOMs of algorithmic improvement assumption. This feels high relative to my usual thinking (and the exercise leaves me with some feeling that I’m taking a bunch of stuff in the background for granted), but not wildly high.

# Next steps af­ter AGISF at UMich

25 Jan 2023 20:57 UTC
10 points
0 comments5 min readLW link
(docs.google.com)