CC[C@@H]1[C@@]([C@@H]([C@H](N(C[C@@H](C[C@@]([C@@H]([C@H]([C@@H]([C@H](C(=O)O1)C)O[C@H]2C[C@@]([C@H]([C@@H](O2)C)O)(C)OC)C)O[C@H]3[C@@H]([C@H](C[C@H](O3)C)N(C)C)O)(C)O)C)C)C)O)(C)O
roha
I’ve recently started to read a textbook by Hilbert. Consider this a rookie’s attempt at formality, where a short paragraph of normal sentences would suffice to express the same idea. Feel free to mutate or mutilate it.
Assumptions
If someone has a Giant Anteater, it can be used to find flaws in OpenSSL. | Empirically demonstrated
If you can build a Giant Anteater, so can others. | Assumption of a shared capability front
If a Giant Anteater can find flaws in OpenSSL, it can find flaws in most other OSS of equal or lower quality. | Assumption of a generalized capability
OpenSSL is of high quality. | Assumption of a relevant instance
If enough people possess the ability to attack various high-quality OSS, many relevant systems will be targeted and compromised. | Assumption of the presence of malevolent intent in a subset of any large enough set of people
Derivation
Given: You can build a Giant Anteater. | apply 2
Others have a Giant Anteater. | apply 1
Others can find flaws in OpenSSL. | apply 3
Others can find flaws in OSS of equal or lower quality than OpenSSL. | apply 4
Others can find flaws in high-quality OSS. | apply 5
Either many relevant systems are already targeted and compromised or they will be as soon as enough people catch up.
We do this to secure the software infrastructure of human civilization before strong AI systems become ubiquitous. Prosaically, we want to make sure we don’t get hacked into oblivion the moment they come online.
Given your existence proof (the Giant Anteater) and its implication (malevolent actors will acquire similar capabilities and apply them soon), your system and its successors seem to require rapid and widespread application, which in turn may require scaling up the underlying processes, possibly by distributing them to other trustworthy actors with benevolent intent and sufficient resources.
“The Algorithm” is in the hands of very few actors. This is the prime gear where “Evil People have figured it out, and hold The Power” isn’t a fantasy. There would be many obvious improvements if it were in adult hands.
I think there might be a confusion between optimizing for an instrumental vs. an upper-level goal. Is maintaining good epistemics more relevant than working on the right topic? To me the rigor of an inquiry seems secondary to choosing the right subject.
I also had to look it up and got interested in testing whether or how it could apply.
Here’s an explanation of Bulverism that suggests a concrete logical form of the fallacy:
Person 1 makes argument X.
Person 2 assumes person 1 must be wrong because of their Y (e.g. suspected motives, social identity, or other characteristic associated with their identity).
Therefore, argument X is flawed or not true.
Here’s a possible assignment for X and Y that tries to remain rather general:
X = Doom is plausible because …
Y = Trauma / Fear / Fixation
Why would that be a fallacy? Whether an argument is true or false, depends on the structure and content of the argument, but not on the source of the argument (genetic fallacy), and not on a property of the source that gets equated with being wrong (circular reasoning). Whether an argument for doom is true, does not depend on who is arguing for it, and being traumatized does not automatically imply being wrong.
Here’s another possible assignment for X and Y that tries to be more concrete. To be able to do so, “Person 1” is also replaced by more than one person, now called “Group 1″:
X (from AI 2027) = A takeover by an unaligned superintelligence by 2030 is plausible because …
Y (from the post) = “lots of very smart people have preverbal trauma” and “embed that pain such that it colors what reality even looks like at a fundamental level”, so “there’s something like a traumatized infant inside such people” and “its only way of “crying” is to paint the subjective experience of world in the horror it experiences, and to use the built-up mental edifice it has access to in order to try to convey to others what its horror is like”.
From looking at this, I think the post suggests a slightly stronger logical form that extends 3:
Group 1 makes argument X.
Person 2 assumes group 1 must be wrong because of their Y (e.g. suspected motives, social identity, or other characteristic associated with their identity).
Therefore, argument X is flawed or not true AND group 1 can’t evaluate its truth value because of their Y.
From this, I think one can see that not only Bulverism makes the model a bit suspicious, but two additional aspects come into play:
If Group 1 is the LessWrong community, then there are also people outside of it that predict that there’s an existential risk from AI and that timelines might be short. How can argument X from these people become wrong by Group 1 entering the stage, and would it still be true if Group 1 was doing something else?
I think it’s fair to say that in 3 an aspect is introduced that’s adjacent to gaslighting, i.e. manipulating someone into questioning their perception of reality. Even if it’s in a well-meaning way, since some people’s perception of reality is indeed flawed and they might profit from becoming aware of it, the way it is weaved into the argument doesn’t seem that benign anymore. I suppose that might be the source of some people getting annoyed by the post.
“it’s psychologically appealing to have a hypothesis that means you don’t have to do any mundane work”
I don’t doubt that something like inverse bike-shedding can be a driving force for some individuals to focus on the field of AI safety. I highly doubt it is explanatory for the field and the associated risk predictions to exist in the first place, or that its validity should be questioned on such grounds, but this seems to happen in the article if I’m not entirely misreading it. From my point of view, there is already an overemphasis on psychological factors in the broader debate and it would be desirable to get back to the object level, be it with theoretical or empirical research, which both have their value. This latter aspect seems to lead to a partial agreement here, even though there’s more than one path to arrive at it.
Point addressed with unnecessarily polemic tone:
“Suppose that what’s going on is, lots of very smart people have preverbal trauma.”
“consider the possibility that the person in question might not be perceiving the real problem objectively because their inner little one might be using it as a microphone and optimizing what’s “said” for effect, not for truth.”
It is alright to consider it. I find it implausible that a wide range of accomplished researchers lay out arguments, collect data, interpret what has and hasn’t been observed and come to the conclusion that our current trajectory of AI development poses a significant amount of existential risk, which can potentially manifest in short timelines, because a majority of them has a childhood trauma that blurs their epistemology on this particular issue but not on others where success criteria could already be observed.
I’m close to getting a postverbal trauma from having to observe all the mental gymnastics around the question of whether building a superintelligence without having reliable methods to shape its behavior is actually dangerous. Yes, it is. No, that fact does not depend on whether Hinton, Bengio, Russell, Omohundro, Bostrom, Yudkowsky, et al. were held as a baby.
Further context about the “recent advancements in the AI sector have resolved this issue” paragraph:
Contained in a16z letter to UK parliament: https://committees.parliament.uk/writtenevidence/127070/pdf/
Contained in a16z letter to Biden, signed by Andreessen, Horowitz, LeCun, Carmack et al.: https://x.com/a16z/status/1720524920596128012
Carmack claiming not to have proofread it, both Carmack and Casado admitting the claim is false: https://x.com/GarrisonLovely/status/1799139346651775361
In case anyone got worried, OpenAI’s blog post Introducing Superalignment on July 5, 2023 contained two links for recruiting, one still working and the other not. From this we can deduce that superalignment has been reduced to an engineering problem, and therefore scientists like Ilya and Jan were able to move on to new challenges, such as spending the last normal summer in a nice location with close friends and family.
”Please apply for our research engineer and research scientist positions.”edit: This comment was intended as a joke, an absurd inference that rhymes with how absurdly these companies treat AI safety.
I assume they can’t make a statement and that their choice of next occupation will be the clearest signal they can and will send out to the public.
He has a stance towards risk that is a necessary condition for becoming the CEO of a company like OpenAI, but doesn’t give you a high probability of building a safe ASI:
https://blog.samaltman.com/what-i-wish-someone-had-told-me
“Inaction is a particularly insidious type of risk.”
https://blog.samaltman.com/how-to-be-successful
“Most people overestimate risk and underestimate reward.”
https://blog.samaltman.com/upside-risk
“Instead of downside risk [2], more investors should think about upside risk—not getting to invest in the company that will provide the return everyone is looking for.”
If everyone has his own asteroid impact, earth will not be displaced because the impulse vectors will cancel each other out on average*. This is important because it will keep the trajectory equilibrium of earth, which we know since ages from animals jumping up and down all the time around the globe in their games of survival. If only a few central players get asteroid impacts it’s actually less safe! Safety advocates might actually cause the very outcomes that they fear!
*I’ve a degree in quantum physics and can derive everything from my model of the universe. This includes moral and political imperatives that physics dictate and thus most physicists advocate for.
We are decades if not centuries away from developing true asteroid impacts.
Given all the potential benefits there is no way we are not going to redirect asteroids to earth. Everybody will have an abundance of rare elements.
xlr8
Some context from Paul Christiano’s work on RLHF and a later reflection on it:
Christiano et al.: Deep Reinforcement Learning from Human Preferences
In traditional reinforcement learning, the environment would also supply a reward [...] and the
agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the
environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. [...] Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. [...] After using to compute rewards, we are left with a traditional reinforcement learning problemChristiano: Thoughts on the impact of RLHF research
The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. [...] Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
Evaluating consequences is hard.
A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
[...] I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive.
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
Replacing must by may is a potential solution to the issues discussed here. I think analogies are misleading when they are used as a means for proof, i.e. convincing yourself or others of the truth of some proposition, but they can be extremely useful when they are used as a means for exploration, i.e. discovering new propositions worth of investigation. Taken seriously, this means that if you find something of interest with an analogy, it should not mark the end of a thought process or conversation, but the beginning of a validation process: Is there just a superficial or actually some deep connection between the compared phenomena? Does it point to a useful model or abstraction?
Example: I think the analogy that trying to align an AI is like trying to steer a rocket towards any target at all shouldn’t be used to convince people that without proper alignment methods mankind is screwed. Who knows if directing a physical object in a geometrical space has much to do with directing a cognitive process in some unknown combinatorial space? Alternatively, the analogy could instead be used as a pointer towards a general class of control problems that come with specific assumptions, which may or may not hold for future AI systems. If we think that the assumptions hold, we may be able to learn a lot from existing instances of control problems like rockets and acrobots about future instances like advanced AIs. If we think that the assumptions don’t hold, we may learn something by identifying the least plausible assumption and trying to formulate an alternative abstraction that doesn’t depend on it, opening another path towards collecting empirical data points of existing instances.
For collaboration on job-like tasks that assumption might hold. For companionship and playful interactions I think the visual domain, possibly in VR/AR, will be found to be relevant and kept. Given our psychological priors, I also think for many people it may feel like a qualitative change in what kind of entity we are interacting with—from lifeless machine, over uncanny human imitation, to believable personality on another substrate.
Empirical data point: In my experience, talking to Inflection’s Pi on the phone covers the low latency integration of “AI is capable of holding a conversation over text, transcribing speech to text, and synthesizing natural-sounding speech” sufficiently well to pass some bar of “feels authentically human” to me until you try to test its limits. I imagine that subjective experience to be more likely to appear if you don’t have background knowledge about LLMs / DL. Its main problems are 1) keeping track of context in plausibly human-like way (e.g. playing a game of guessing capital cities of European countries leads to repetitive questions about the same few countries even if asked to take care in various ways) and 2) inconsistent rejection of talking about certain things depending on previous text (e.g. retelling dark jokes by real comedians).
I share your expectation that adding photorealistic video generation to it can plausibly lead to another “cultural moment”, though it might depend on whether such avatars find similarly rapid adoption as ChatGPT or whether it’s phased in more gradually. (I’ve no overview of the entire space and stumbled over Inflection’s product by chance after a random podcast listening. If there are similar ones out there already I’d love to know.)edit: Corrected link formatting.
What are major indicators for their lead? Is this view partly based on project glasswing and the published examples of vulnerabilities that Mythos Preview has found?