Thane Ruthenis

Karma: 3,691

Thane Ruthenis 28 Jun 2024 3:00 UTC
8 points
−4
on: Corrigibility = Tool-ness?
(Written while I’m at the title of “Respecting Modularity”.)
My own working definition of “corrigibility” has been something like “an AI system that obeys commands, and only produces effects through causal pathways that were white-listed by its human operators, with these properties recursively applied to its interactions with its human operators”.
In a basic case, if you tell it to do something, like “copy a strawberry” or “raise the global sanity waterline”, it’s going to give you a step-by-step outline of what it’s going to do, how these actions are going to achieve the goal, how the resultant end-state is going to be structured (the strawberry’s composition, the resultant social order), and what predictable effects all of this would have (both direct effects and side-effects).
So if it’s planning to build some sort of nanofactory that boils the oceans as a side-effect, or deploy Basilisk hacks that exploit some vulnerability in the human psyche to teach people stuff, it’s going to list these pathways, and you’d have the chance to veto them. Then you’ll get it to generate some plans that work through causal pathways you do approve of, like “normal human-like persuasion that doesn’t circumvent the interface of the human mind / doesn’t make the abstraction “the human mind” leak / doesn’t violate the boundaries of the human psyche”.
It’s also going to adhere to this continuously: e. g., if it discovers a new causal pathway and realizes the plan it’s currently executing has effects through it, it’s going to seek urgent approval from the human operators (while somehow safely halting its plan using a procedure for this that it previously designed with its human operators, or something).
And this should somehow apply recursively. The AI should only interact with the operators through pathways they’ve approved of. E. g., using only “mundane” human-like ways to convey information; no deploying Basilisk hacks to force-feed them knowledge, no directly rewriting their brains with nanomachines, not even hacking their phones to be able to talk to them while they’re outside the office.
(How do we get around the infinite recursion here? I have no idea, besides “hard-code some approved pathways into the initial design”.)
And then the relevant set of “causal pathways” probably factors through the multi-level abstract structure of the environment. For any given action, there is some set of consequences that is predictable and goes into the AI’s planning. This set is relatively small, and could be understood by a human. Every consequence outside this “small” set is unpredictable, and basically devolves into high-entropy noise; not even an ASI could predict the outcome. (Think this post.) And if we look at the structure of the predictable-consequences sets across time, we’d find rich symmetries, forming the aforementioned “pathways” through which subsystems/abstractions interact.
(I’ve now read the post.)
This seems to fit pretty well with your definition? Visibility: check, correctability: check. The “side-effects” property only partly fits – by my definition, a corrigible AI is allowed to have all sorts of side-effects, but these side-effects must be known and approved by its human operator – but I think it’s gesturing at the same idea. (Real-life tools also have lots of side effects, e. g. vibration and noise pollution from industrial drills – but we try to minimize these side-effects. And inasmuch as we fail, the resultant tools are considered “bad”, worse than the versions of these tools without the side-effects.)

Thane Ruthenis 23 Jun 2024 13:45 UTC
LW: 7 AF: 5
0
AF
in reply to: Steven Byrnes’s comment on: Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
That was my interpretation as well.
I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.
The “single-step” case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don’t even know what they are, so we can’t assess their safety, and can’t even train them out (because we don’t know how to elicit them).
Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this.
One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn’t demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).
Which implies some interesting things about how the representations are stored. The LLM actually “understands” what f(blah) is built out of, in a way that’s accessible to its externalized monologue. That wasn’t obvious to me, at least.

Thane Ruthenis 16 Jun 2024 23:51 UTC
10 points
3
in reply to: Joel Burget’s comment on: The Leopold Model: Analysis and Reactions
I believe Xi (or choose your CCP representative) would say that the ultimate goal is human flourishing
I’m very much worried that this sort of thinking is a severe case of Typical Mind Fallacy.
I think the main terminal values of the individuals constituting the CCP – and I do mean terminal, not instrumental – are the preservation of their personal status, power, and control, like the values of ~all dictatorships, and most politicians in general. Ideology is mostly just an aesthetics, a tool for internal and external propaganda/rhetoric, and the backdrop for internal status games.
There probably are some genuine shards of ideology in their minds. But I expect minuscule overlap between their at-face-value ideological messaging, and the future they’d choose to build if given unchecked power.
On the other hand, if viewed purely as an organization/institution, I expect that the CCP doesn’t have coherent “values” worth talking about at all. Instead, it is best modeled as a moral-maze-like inertial bureaucracy/committee which is just replaying instinctive patterns of behavior.
I expect the actual “CCP” would be something in-between: it would intermittently act as a collection of power-hungry ideology-biased individuals, and as an inertial institution. I have no idea how this mess would actually generalize “off-distribution”, as in, outside the current resource, technology, and power constraints. But I don’t expect the result to be pretty.
Mind, similar holds for the USG too, if perhaps to a lesser extent.

Thane Ruthenis 11 Jun 2024 22:49 UTC
17 points
9
in reply to: sanxiyn’s comment on: On Dwarksh’s Podcast with Leopold Aschenbrenner
Maybe they develop mind control level convincing argument and send it to key people (president, congress, NORAD, etc) or hack their iPhones and recursively down to security guards of fabs/power plants/data centers/drone factories. That may be quick enough. The point is that it is not obvious.
That’s the sort of thing that’d happen, yes. As with all AI takeover scenarios, it likely wouldn’t go down like this specifically, but you can be sure that the ASI would achieve the goal it wants to achieve/was told to achieve if aligned. (And see this post for my model of how this class of concrete scenarios would actually look like.)
Having nukes is not really a good analogy for having an aligned ASI at your disposal, as far as taking over the world is concerned. Unless your terminal value is human extinction, you can’t really nuke the world into the state of your personal utopia. You can’t even use nukes as leverage to threaten people into building your utopia, because:
1. Some people are good enough at decision theory to ignore threats.
2. Coercing people in this way might not actually be part of your utopia.
3. Your “power” is brittle. You only have the threat of nuclear armageddon to fall back on, and you can still be defeated by e. g. clever infiltration and sabotage, or by taking over your supply chains, etc. (If you have overwhelming, utterly loyal military power and security in full generality, that’s a very different setup.)
None of those constraints apply to having an ASI at your disposal. An ASI would let you implement your values upon the cosmos fully and faithfully, and it’d give you the roadmap to getting there from here.
This is also precisely why Leopold’s talk of “checks and balances” as the reason why governments could be trusted with AGI falls apart. “The government” isn’t some sort of holistic entity, it’s a collection of individuals with their own incentives, sometimes quite monstrous incentives. In the current regime, it’s indeed checked-and-balanced to be mostly sort-of (not really) aligned to the public good. But that property is absolutely not robust to you giving unchecked power to any given subsystem in it!
I’m really quite baffled that Leopold doesn’t get this, given his otherwise excellent analysis of the “authoritarianism risks” associated with aligned ASIs in the hands of private companies and the CCP. Glad to see @Zvi pointing that out.

Thane Ruthenis 10 Jun 2024 21:02 UTC
23 points
5
on: My AI Model Delta Compared To Yudkowsky
We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
For context, I’m familiar with this view from the ELK report. My understanding is that this is part of the “worst-case scenario” for alignment that ARC’s agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).
To quote:
The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.
We can also imagine a mismatch where AI systems use higher-level abstractions that humans lack, and are able to make predictions about observables without ever thinking about lower-level abstractions that are important to humans. For example we might imagine an AI making long-term predictions based on alien principles about memes and sociology that don’t even reference the preferences or beliefs of individual humans. Of course it is possible to translate those principles into predictions about individual humans, and indeed this AI ought to make good predictions about what individual humans say, but if the underlying ontology is very different we are at risk of learning the human simulator instead of the “real” mapping.
Overall we are by far most worried about deeply “messy” mismatches that can’t be cleanly described as higher- or lower-level abstractions, or even what a human would recognize as “abstractions” at all. We could try to tell abstract stories about what a messy mismatch might look like, or make arguments about why it may be plausible, but it seems easier to illustrate by thinking concretely about existing ML systems.
[It might involve heuristics about how to think that are intimately interwoven with object level beliefs, or dual ways of looking at familiar structures, or reasoning directly about a messy tapestry of correlations in a way that captures important regularities but lacks hierarchical structure. But most of our concern is with models that we just don’t have the language to talk about easily despite usefully reflecting reality. Our broader concern is that optimistic stories about the familiarity of AI cognition may be lacking in imagination. (We also consider those optimistic stories plausible, we just really don’t think we know enough to be confident.)]
So I understand the shape of the argument here.
… But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won’t have “emotions”, or a System 1/System 2 split, or “motivations” the way we understands them – instead, they’d have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.
Hence, it would be difficult to make AGI agents “do what we mean” – but not necessarily because there’s no compact way to specify “what we mean” in the AGI’s ontology, but because we’d have no idea how to specify “do this” in terms of the program flows of the AGI’s cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of “eudaimonia” here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?
This view doesn’t make arguments about the AGI’s world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to “humans”, “diamonds”, and “the Golden Gate Bridge”. This view is simply cautioning against anthropomorphizing AGIs.
Roughly speaking, imagine that any mind could be split into a world-model and “everything else”: the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the “everything else” would be implemented in a deeply alien manner.
The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI’s cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.
But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.
And I’d say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it’s not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.
(I’d say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn’t think in those precise terms!)

Thane Ruthenis 8 Jun 2024 16:30 UTC
6 points
4
on: Natural Latents Are Not Robust To Tiny Mixtures
Another angle to consider: in this specific scenario, would realistic agents actually derive natural latents for $P$ and $Q$ as a whole, as opposed to deriving two mutually incompatible latents for the $Q^{0}$ and $P^{0}$ components, then working with a probability distribution over those latents?
Intuitively, that’s how humans operate if they have two incompatible hypotheses about some system. We don’t derive some sort of “weighted-average” ontology for the system, we derive two separate ontologies and then try to distinguish between them.
This post comes to mind:
If you only care about betting odds, then feel free to average together mutually incompatible distributions reflecting mutually exclusive world-models. If you care about planning then you actually have to decide which model is right or else plan carefully for either outcome.
Like, “just blindly derive the natural latent” is clearly not the whole story about how world-models work. Maybe realistic agents have some way of spotting setups structured the way the OP is structured, and then they do something more than just deriving the latent.

Thane Ruthenis 8 Jun 2024 16:23 UTC
4 points
0
in reply to: tailcalled’s comment on: Natural Latents Are Not Robust To Tiny Mixtures
Sure, but what I question is whether the OP shows that the type signature wouldn’t be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It’s not clear that their datasets would be different the same way $P$ and $Q$ are different here.

Thane Ruthenis 8 Jun 2024 14:04 UTC
4 points
0
in reply to: tailcalled’s comment on: Natural Latents Are Not Robust To Tiny Mixtures
I do see the intuitive angle of “two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans”, and that this post illustrated how one operationalization of this idea failed.
However if there are multiple different concepts that fit the same natural latent but function very differently
That’s not quite what this post shows, I think? It’s not that there are multiple concepts that fit the same natural latent, it’s that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The $P$ agent and the $Q$ agent legitimately live in very epistemically different worlds!
Which is likely not actually the case for slightly different training sets, or LLMs’ training sets vs. humans’ life experiences. Those are very close on some metric $X$ , and now it seems that $X$ isn’t (just) $D_{K L}$ .

Thane Ruthenis 7 Jun 2024 20:40 UTC
9 points
1
on: Natural Latents Are Not Robust To Tiny Mixtures
Coming from another direction: a 50-bit update can turn $Q$ into $P$ , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.
Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we’re updating) can drastically change how we’re modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That’s pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.
Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...
Indeed, now that I’m thinking about it, I’m not sure the quantity $\frac{bit-size of the update}{bit-size of the system}$ is in any way interesting at all? Consider that the researchers’ minds could be updated either from reading the paper and examining the experimental procedure in detail (a “medium” number of bits), or by looking at the raw output data and then doing a replication of the paper (a “large” number of bits), or just by reading the names of the authors and skimming the abstract (a “small” number of bits).
There doesn’t seem to be a direct causal connection between the system’s size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you’re uncertain regarding whether it’s “direct observation of the system” OR “trust in science” OR “trust in the paper’s authors” OR …).^[1]
Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling “the size of the system” from “the size of the update”.
Perhaps $D_{K L}$ really isn’t the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...
I. e., the distance metric would need to take interventions/the $do$ operator into account. Something like SID comes to mind (but not literally SID, I expect).
1. ^
  Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains “information funnels” which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?

Thane Ruthenis 4 Jun 2024 22:46 UTC
6 points
0
in reply to: johnswentworth’s comment on: What do coherence arguments actually prove about agentic behavior?
I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis
Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.^[1]
The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn’t have a well-formatted mesa-objective, for example, or “emotions”, or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they’d be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.
Untangling it would be a “second tier” of the interpretability problem, which the current interpretability research didn’t yet even get a glimpse of.
And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don’t think the MIRI view is hopeful regarding getting such tools.
Whether the NAH is or is not true doesn’t really enter into it.
Could be I’m failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote^[2]:
The reason why we can’t bind a description of ‘diamond’ or ‘carbon atoms’ to the hypothesis space used by AIXI or AIXI-tl is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents
(Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the ‘amount of real diamond in the environment’ from arbitrary predictors in the above case , we’d need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence’s thought processes, find the concept (if any) resembling ‘diamond’, and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)
Obviously it’s talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.
Or, in other words: what the MIRI view rejects isn’t the NAH, but some variant of the simplicity-prior argument. It doesn’t believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.^[3]
1. ^
  This powerful agentic behavior would then of course be able to streamline its own implementation, once it’s powerful enough, but that’s what the starting point would be – and also what we’d need to align, since once it has the extensive self-modification capabilities to streamline itself, it’d be too late to tinker with it.
2. ^
  Although now that I’m looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I’m not entirely sure this segment was written by Eliezer...
3. ^
  Note that this also means that formally solving the Agent-Like Structure Problem wouldn’t help us either. It doesn’t matter how theoretically perfect embedded agents are shaped, because the agent we’d be dealing with wouldn’t be shaped like this. Knowing how it’s supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.

Thane Ruthenis 31 May 2024 14:25 UTC
13 points
4
in reply to: Linch’s comment on: Talent Needs of Technical AI Safety Teams
Counter-counter-argument: the safety-motivated people, especially if entering at the low level, have ~zero ability to change anything for the better internally, while they could usefully contribute elsewhere, and the presence of token safety-motivated people at OpenAI improves OpenAI’s ability to safety-wash its efforts (by pointing at them and going “look how much resources we’re giving them!”, like was attempted with Superalignment).

Thane Ruthenis 18 May 2024 22:22 UTC
7 points
3
in reply to: Wei Dai’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
How were you already sure of this before the resignations actually happened?
OpenAI enthusiastically commercializing AI + the “Superalignment” approach being exactly the approach I’d expect someone doing safety-washing to pick + the November 2023 drama + the stated trillion-dollar plans to increase worldwide chip production (which are directly at odds with the way OpenAI previously framed its safety concerns).
Some of the preceding resignations (chiefly, Daniel Kokotajlo’s) also played a role here, though I didn’t update off of them much either.

Thane Ruthenis 18 May 2024 22:10 UTC
9 points
3
in reply to: Erich_Grunewald’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.
Sure, that’s basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it’s fair to compress all that into “OpenAI’s alignment efforts are safety-washing”. If (b) has the real power within OpenAI, then OpenAI’s behavior and values can be approximately rounded off to (b)’s behavior and values, and (a) is a rounding error.
If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different
Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research.
(I interpret Sam Altman’s behavior at Congress as evidence for this kind of longer-term thinking. He didn’t try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would do. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.)
And it’s approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin.
And this is the potential plot whose partial failure I’m currently celebrating.

Thane Ruthenis 16 May 2024 8:32 UTC
42 points
42
on: Ilya Sutskever and Jan Leike resign from OpenAI
That’s good news.
There was a brief moment, back in 2023, when OpenAI’s actions made me tentatively optimistic that the company was actually taking alignment seriously, even if its model of the problem was broken.
Everything that happened since then has made it clear that this is not the case; that all these big flashy commitments like Superalignment were just safety-washing and virtue signaling. They were only going to do alignment work inasmuch as that didn’t interfere with racing full-speed towards greater capabilities.
So these resignations don’t negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.
On the other hand, what these resignations do is showcasing that fact. Inasmuch as Superalignment was a virtue-signaling move meant to paint OpenAI as caring deeply about AI Safety, so many people working on it resigning or getting fired starkly signals the opposite.
And it’s good to have that more in the open; it’s good that OpenAI loses its pretense.
Oh, and it’s also good that OpenAI is losing talented engineers, of course.
What links here?
- OpenAI: Exodus by Zvi (20 May 2024 13:10 UTC; 153 points)

Thane Ruthenis 18 Apr 2024 9:25 UTC
25 points
8
in reply to: Gunnar_Zarncke’s comment on: Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
I think you’re imagining that we modify the shrink-and-reposition functions each iteration, lowering their scope? I. e., that if we picked the topmost triangle for the first iteration, then in iteration two we pick one of the three sub-triangles making up the topmost triangle, rather than choosing one of the “highest-level” sub-triangles?
Something like this:
If we did it this way, then yes, we’d eventually end up jumping around an infinitesimally small area. But that’s not how it works, we always pick one of the highest-level sub-triangles:
Note also that we take in the “global” coordinates of the point we shrink-and-reposition (i. e., its position within the whole triangle), rather than its “local” coordinates (i. e., position within the sub-triangle to which it was copied).
Here’s a (slightly botched?) video explanation.

Thane Ruthenis 8 Apr 2024 21:04 UTC
8 points
6
in reply to: Justausername’s comment on: How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)?
I’d say one of the main reasons is because military-AI technology isn’t being optimized towards things we’re afraid of. We’re concerned about generally intelligent entities capable of e. g. automated R&D and social manipulation and long-term scheming. Military-AI technology, last I checked, was mostly about teaching drones and missiles to fly straight and recognize camouflaged tanks and shoot designated targets while not shooting not designated targets.
And while this still may result in a generally capable superintelligence in the limit (since “which targets would my commanders want me to shoot?” can be phrased as a very open-ended problem), it’s not a particularly efficient way to approach this limit at all. Militaries, so far, just aren’t really pushing in the directions where doom lies, while the AGI labs are doing their best to beeline there.
The proliferation of drone armies that could be easily co-opted by a hostile superintelligence… It doesn’t have no impact on p(doom), but it’s approximately a rounding error. A hostile superintelligence doesn’t need extant drone armies; it could build its own, and co-opt humans in the meantime.

Thane Ruthenis 5 Mar 2024 10:58 UTC
LW: 22 AF: 11
12
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I think that the key thing we want to do is predict the generalization of future neural networks.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don’t need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don’t know what training/design process would get us to AGI. Which means we can’t make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.
And I’m not seeing a lot of ironclad arguments that favour “pretraining + RLHF is going to get us to AGI” over “pretraining + RLHF is not going to get us to AGI”. The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn’t.
Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.
I’d be interested if you elaborated on that.

Thane Ruthenis 23 Feb 2024 6:10 UTC
2 points
0
in reply to: Prometheus’s comment on: A Case for the Least Forgiving Take On Alignment
I wouldn’t call Shard Theory mainstream
Fair. What would you call a “mainstream ML theory of cognition”, though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis^[1]).
judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren’t optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this
Roughly agree, yeah.
But all of this is still just one piece on the Jenga tower
I kinda want to push back against this repeat characterization – I think quite a lot of my model’s features are “one storey tall”, actually – but it probably won’t be a very productive use of the time of either of us. I’ll get around to the “find papers empirically demonstrating various features of my model in humans” project at some point; that should be a more decent starting point for discussion.
What I want is to build non-Jenga-ish towers
Agreed. Working on it.
1. ^
  Which, yeah, I think is false: scaling LLMs won’t get you to AGI. But it’s also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

Thane Ruthenis 23 Feb 2024 0:23 UTC
14 points
1
on: AI #52: Oops
the model chose slightly wrong numbers
The engraving on humanity’s tombstone be like.

Thane Ruthenis 22 Feb 2024 19:11 UTC
2 points
0
in reply to: Prometheus’s comment on: A Case for the Least Forgiving Take On Alignment
The sort of thing that would change my mind: there’s some widespread phenomenon in machine learning that perplexes most, but is expected according to your model
My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.
Again, the drive for consistent decision-making is a good example. Common-sensically, I don’t think we’d disagree that humans want their decisions to be consistent. They don’t want to engage in wild mood swings, they don’t want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.
Yet it’s not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You’d need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).