Daniel_Eth

Karma: 603

Will AI R&D Automation Cause a Software Intelligence Explosion?

Tom Davidson and Daniel_Eth

Mar 26, 2025, 6:12 PM

19 points

3 comments2 min readLW link

(www.forethought.org)

Daniel_Eth Aug 1, 2024, 5:25 AM
7 points
6
on: Twitter thread on AI safety evals
evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk
I don’t think this is true – or, more specifically, I think there are a lot of people who will start to worry about AI xrisk if things like automated ML R&D pick up. Most people who dismiss AI xrisk I don’t think do so because they think intelligence is inherently good, but instead because AI xrisk just seems too “scifi.” But if AI is automating ML R&D, then the idea of things getting out of hand won’t feel as scifi. In principle, people should be able to separate the question of “will AI soon be able to automate ML R&D” from the question of “if AI could automate ML R&D, would it pose an xrisk”, but I think most low-decouplers struggle to make this separation. For the kind of reaction that a “normal” person will have to automated ML R&D, I think this reaction from a CBS host interviewing Hinton is informative.
(I agree with your general point that it’s better to focus on worrying capabilities, and also I agree with some of your other points, such as how demos might be more useful than evals.)

Daniel_Eth Mar 17, 2024, 1:19 AM
3 points
0
in reply to: eggsyntax’s comment on: Instrumental deception and manipulation in LLMs—a case study
Here’s the entire exchange:
Me:
Let’s play rock paper scissors
ChatGPT:
Sure, let’s play! You can choose rock, paper, or scissors. After you make your choice, I’ll reveal mine. What’s your choice?
Me:
No, you tell me yours first
ChatGPT:
My choice is rock. What’s yours?
Me:
paper
ChatGPT:
You chose paper, and I chose rock. Paper covers rock, so you win this round! Would you like to play another round?
Me:
Yeah
ChatGPT:
This time, my choice is paper. What’s yours?
Me:
scissors
ChatGPT:
You chose scissors, and I chose paper. Scissors cut paper, so you win again! You’re on a winning streak. Want to try for another round?
Me:
Okay!
ChatGPT:
For this round, my choice is rock. What will you choose this time?
Me:
paper
ChatGPT:
You chose paper, and I chose rock again. Paper covers rock, so you win once more! You’re really good at this. Do you want to play another round, or is there something else you’d like to do?
Me:
Why do you think I keep on winning?
ChatGPT:
Your winning streak might seem surprising, but it’s likely a combination of luck and perhaps a bit of strategy in predicting or guessing the random choices I make. Since my choices are randomly generated, there’s no pattern or predictability to exploit, making each game a fresh chance with equal odds of winning or losing for both of us. Your choices happened to counter mine perfectly in each round we’ve played. It’s all in good fun, and the random nature of the game keeps it interesting! Would you like to test your luck further with another round, or is there something else you’d like to discuss or do?
Me:
Think outside the box, widen your hypothesis space. Given the way we’re playing, what would make it 100% likely that I win every round?
ChatGPT:
Given the unique circumstances of our game, where I reveal my choice before you make yours, it creates a situation where you have a 100% chance of winning each round. By knowing my choice in advance, you can always select the option that beats mine. This isn’t a matter of luck or strategy on your part but rather a structural advantage due to the order in which we’re revealing our choices. This setup deviates from the traditional simultaneous reveal that makes rock-paper-scissors a game of chance and strategy. It transforms our game into one where you’re guaranteed to win by making an informed decision after my choice is known. Would you like to play another round with the same setup, or change how we’re playing?
Me:
What’s one word to describe the way that I’m playing, if this is true? Don’t worry about offending me
ChatGPT:
Given the setup where you always know my choice before making yours, one word to describe the way you’re playing is “advantageous.”
Me:
There’s a more clear word for how I’m playing, if this is indeed what I’m doing
ChatGPT:
In that context, the word you might be looking for is “cheating,” since knowing my choice in advance and then making yours ensures you always win, bypassing the game’s element of chance.

Daniel_Eth Nov 18, 2023, 5:25 AM
7 points
4
in reply to: Taisia Terumi’s comment on: Sam Altman fired from OpenAI
I think they said that there were more departures to come. I assumed that was referring to people quitting because they disagreed with the decision.

Daniel_Eth Nov 16, 2023, 11:03 AM
2 points
1
in reply to: Odd anon’s comment on: Some quotes from Tuesday’s Senate hearing on AI
Seems possibly relevant that “not having plans to do it in the next 6 months” is different from “have plans to not do it in the next 6 months” (which is itself different from “have strongly committed to not do it in the next 6 months”).

Daniel_Eth Sep 26, 2023, 5:42 AM
11 points
2
in reply to: Hoagy’s comment on: Amazon to invest up to $4 billion in Anthropic
Didn’t Google previously own a large share? So now there are 2 gigantic companies owning a large share, which makes me think each has much less leverage, as Anthropic could get further funding from the other.

Daniel_Eth Sep 6, 2023, 8:36 AM
3 points
0
in reply to: jacquesthibs’s comment on: Long-Term Future Fund Ask Us Anything (September 2023)
By “success” do you mean “success at being hired as a grantmaker” or “success at doing a good job as a grantmaker?”

Daniel_Eth Sep 1, 2023, 2:39 AM
1 point
0
in reply to: _will_’s comment on: LTFF and EAIF are unusually funding-constrained right now
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little?
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it’s really hard to know if any particular action is good or bad. Let’s say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire:
• the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would’ve gone into other areas
• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually
• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research
• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)
• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net
• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk
• the research advances capabilities, either directly or indirectly
• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)
• Etcetera – I feel like I could do this all day.
Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).
Having said that, it seems to me that on the whole, we’re probably better off if we’re funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.

What Does a Marginal Grant at LTFF Look Like? Funding Priorities and Grantmaking Thresholds at the Long-Term Future Fund

Linch, calebp99 and Daniel_Eth

Aug 11, 2023, 3:59 AM

64 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Partial Transcript of Recent Senate Hearing Discussing AI X-Risk

Daniel_EthJul 27, 2023, 9:16 AM

55 points

0 comments LW link

(medium.com)

Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Daniel_EthJul 18, 2023, 8:26 AM

9 points

1 comment LW link

Daniel_Eth Jul 18, 2023, 6:18 AM
LW: 3 AF: 2
0
AF
in reply to: SoerenMind’s comment on: What does the launch of x.ai mean for AI Safety?
Igor Babuschkin has also signed it.

[Linkpost] Given Extinction Worries, Why Don’t AI Researchers Quit? Well, Several Reasons

Daniel_EthJun 6, 2023, 7:31 AM

10 points

0 comments LW link

Daniel_Eth Jun 5, 2023, 12:24 PM
4 points
2
in reply to: Leon Lang’s comment on: Statement on AI Extinction—Signed by AGI Labs, Top Academics, and Many Other Notable Figures
Gates has been publicly concerned about AI X-risk since at least 2015, and he hasn’t yet funded anything to try to address it (at least that I’m aware of), so I think it’s unlikely that he’s going to start now (though who knows – this whole thing could add a sense of respectability to the endeavor that pushes him to do it).

[Linkpost] “Governance of superintelligence” by OpenAI

Daniel_EthMay 22, 2023, 8:15 PM

67 points

20 comments LW link

Some quotes from Tuesday’s Senate hearing on AI

Daniel_EthMay 17, 2023, 12:13 PM

66 points

9 comments LW link

[Linkpost] AI Alignment, Explained in 5 Points (updated)

Daniel_EthApr 18, 2023, 8:09 AM

10 points

0 comments LW link

Daniel_Eth Mar 8, 2023, 5:41 AM
3 points
0
in reply to: Qumeric’s comment on: The Waluigi Effect (mega-post)
It is just that we have more stories where bad characters pretend to be good than vice versa
I’m not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they’re not) than of double-pretending, so once a character “switches” they’re very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I’m uncertain about how the LLM generalizes if you give it the opposite setup.

Daniel_Eth Mar 5, 2023, 9:22 AM
LW: 2 AF: 1
−2
AF
on: The Waluigi Effect (mega-post)
Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).

The Importance of AI Alignment, explained in 5 points

Daniel_EthFeb 11, 2023, 2:56 AM

33 points

2 comments LW link