Simon Goldstein

Karma: 302

Simon Goldstein 18 Dec 2025 11:22 UTC
3 points
0
on: Could space debris block access to outer space?
1. In my opinion one of the likeliest motivations for deliberate debris would be as part of an escalation ladder in the early stages of WW3. Whichever player has weaker satellite intelligence / capabilities would have an incentive to trigger a cascade in order to destroy the advantage of their opponent. The point effectively is that space conflict is very strongly offense dominant because of debris cascades, and we know that in general offense dominant dynamics tend to be very unstable.
2. Related to your discussion of totipotence, another dynamic I could imagine in the future is MAD dynamics between a moon colony and earth, where each side has the capacity to create a debris cascade for the other. One concern is that there will not be second strike capability, and so the dynamic could be unusually unstable.
3. One concern is that space colonization is extremely trajectory dependent, so that initial forays into space colonization could have massive impacts on the far future. If so, there may be good reasons to delay space colonization as long as possible, as a “long reflection.” A debris cascade would cause a long reflection, by forcing space colonization to pause until new technologies for escape are invented. On the other hand, space colonization is also very important to hedge against catastrophic risk. So the disvalue of debris cascades may be controlled by the relative prioritization of existential risk versus better future dynamics.

Simon Goldstein 2 Oct 2024 21:32 UTC
3 points
0
in reply to: ChristianKl’s comment on: Will AI and Humanity Go to War?
The issue of unified AI parties is discussed but not resolved in section 2.2. There, I discuss some of the paths AIs may take to begin engaging in collective decision making. In addition, I flag that the key assumption is that one AI or multiple AIs acting collectively accumulate enough power to engage in strategic competition with human states.

Will AI and Humanity Go to War?

Simon Goldstein1 Oct 2024 6:35 UTC

17 points

4 comments6 min readLW link

Simon Goldstein 28 Aug 2024 11:26 UTC
3 points
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy; metaethics is just the most salient to me. One example is the recent trend towards conceptual engineering (https://philpapers.org/browse/conceptual-engineering). Metametaphysics has also gotten a lot of attention in the last 10-20 years https://www.oxfordbibliographies.com/display/document/obo-9780195396577/obo-9780195396577-0217.xml. There is also some recent work in metaepistemology, but maybe less so because the debates tend to recapitulate previous work in metaethics https://plato.stanford.edu/entries/metaepistemology/.

Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.

Simon Goldstein 27 Aug 2024 3:54 UTC
4 points
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
Great questions. Sadly, I don’t have any really good answers for you.
1. I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
2. I do not, except for the end of Superintelligence
3. Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
4. I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.

Simon Goldstein 26 Aug 2024 7:24 UTC
9 points
2
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.

Simon Goldstein 6 Aug 2024 23:13 UTC
3 points
0
in reply to: Seth Herd’s comment on: AI Rights for Human Safety
Good question, Seth. We begin to analyse this question in section II.b.i of the paper, ‘Human labor in an AGI world’, where we consider whether AGIs will have a long-term interest in trading with humans. We suggest that key questions will be whether humans can retain either an absolute or comparative advantage in the production of some goods. We also point to some recent economics papers that address this question. One relevant factor for example is cost disease: as manufacturing became more productive in the 20th century, the total share of GDP devoted to manufacturing fell: non-automatable tasks can counterintuitively make up a larger share of GDP as automatable tasks become more productive, because the price of automatable goods will fall.

AI Rights for Human Safety

Simon Goldstein1 Aug 2024 23:01 UTC

55 points

11 comments1 min readLW link

(papers.ssrn.com)

[Linkpost] A Case for AI Consciousness

cdkg and Simon Goldstein

6 Jul 2024 14:52 UTC

22 points

2 comments1 min readLW link

(philpapers.org)

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein and Peter S. Park

29 Aug 2023 1:29 UTC

54 points

3 comments10 min readLW link

Simon Goldstein 7 Aug 2023 5:35 UTC
4 points
0
on: Safety-First Agents/Architectures Are a Promising Path to Safe AGI
Thanks Brendon, I agree with a lot of this! I do think there’s a big open question about how capable autoGPT-like systems will end up being compared to more straightforward RL approaches. It could turn out that systems with a clear cognitive architecture just don’t work that well, even though they are safer

Simon Goldstein 31 Jul 2023 22:05 UTC
20 points
6
on: Thoughts on sharing information about language model capabilities
Thanks for the thoughtful post, lots of important points here. For what it’s worth, here is a recent post where I’ve argued in detail (along with Cameron Domenico Kirk-Giannini) that language model agents are a particularly safe route to agi: https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe

Simon Goldstein 5 Jun 2023 15:19 UTC
2 points
0
in reply to: Eric Zhang’s comment on: Shutdown-Seeking AI
I really liked your post! I linked to it somewhere else in the comment thread

Simon Goldstein 4 Jun 2023 15:57 UTC
2 points
0
in reply to: gwern’s comment on: Shutdown-Seeking AI
I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.

Simon Goldstein 3 Jun 2023 2:43 UTC
LW: 3 AF: 2
−1
AF
in reply to: gwern’s comment on: Shutdown-Seeking AI
Thanks for taking the time to think through our paper! Here are some reactions:
-‘This has been proposed before (as their citations indicate)’
Our impression is that positively shutdown-seeking agents aren’t explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-′ Saying, ‘well, maybe we can train it in a simple gridworld with a shutdown button?’ doesn’t even begin to address the problem of how to make current models suicidal in a useful way.′
True, I think your example of AutoGPT is important here. In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: ‘your goal is to be shut down’ is relatively well-defined, compared ‘promote human flourishing’ (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I’d be curious to hear what you think of the linked post.
-‘What would it mean for an AutoGPT swarm of invocations to ‘shut off’ ‘itself’, exactly?′ I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.
-‘If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc—are any of those ‘itself’?′ I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.
-′ It’s not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.′ at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.

Simon Goldstein 1 Jun 2023 13:34 UTC
1 point
0
in reply to: Seth Herd’s comment on: Shutdown-Seeking AI
Thanks for comments! There is further discussion of this idea in another recent LW post about ‘meeseeks’

Simon Goldstein 31 May 2023 22:38 UTC
1 point
0
on: Mr. Meeseeks as an AI capability tripwire
For what its worth, I’ve posted a draft paper on this topic over here https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

Simon Goldstein 31 May 2023 21:56 UTC
2 points
0
in reply to: Wei Dai’s comment on: Language Agents Reduce the Risk of Existential Catastrophe
Thank you for your reactions:
-Good catch on ‘language agents’, we will think about best terminology going forward
-I’m not sure what you have in mind regarding accessing beliefs/desires using synaptic weights rather than text. For example, the language of thought approach to human cognition suggests that human access to beliefs/desires is also fundamentally syntactic rather than weight based. OTOH one way to incorporate some kind of weight would be to assign probabilities to the beliefs stored in the memory stream.

-For OOD over time, I think updating the LLM wouldn’t be uncompetitive for inventing new concepts/ways of thinking, because that happens slowly. Harder issue is updating on new world knowledge. Maybe browser plugins will fill the gap here, open question.

-I agree that info security is important safety intervention. AFAICT its value is independent of using language agents vs RL agents.

-One end-game is systematic conflict between humans + language agents, vs RL/transformer agent successors to MuZero, GPT4, Gato etc.

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

Simon Goldstein

Will AI and Hu­man­ity Go to War?

AI Rights for Hu­man Safety

[Linkpost] A Case for AI Consciousness

AI De­cep­tion: A Sur­vey of Ex­am­ples, Risks, and Po­ten­tial Solutions

Shut­down-Seek­ing AI

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

Will AI and Humanity Go to War?

AI Rights for Human Safety

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Shutdown-Seeking AI

Language Agents Reduce the Risk of Existential Catastrophe