Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom.
I’m an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.
I think it’s important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.
DragonGod
I don’t buy this argument for a few reasons:
SBF met Will MacAskill in 2013 and it was following that discussion that SBF decided to earn to give
EA wasn’t a powerful or influential movement back in 2013, but quite a fringe cause.
SBF was in EA since his college days, long before his career in quantitative finance and later in crypto
SBF didn’t latch onto EA after he acquired some measure of power or when EA was a force to be reckoned with, but pretty early on. He was in a sense “homegrown” within EA.
The “SBF was a sociopath using EA to launder his reputation” is just motivated credulity IMO. There is little evidence in favour of it. It’s just something that sounds good to be true and absolves us of responsibility.
Astrid’s hypothesis is very uncredible when you consider that she doesn’t seem to be aware of SBF’s history within EA. Like what’s the angle here? There’s nothing suggesting SBF planned to enter finance as a college student before MacAskill sold him on earning to give.
2. Non-Violence: Argument gets counter-argument. Argument does not get bullet. Argument does not get doxxing, death threats, or coercion.[1]
I’d want to include some kinds of social responses as unacceptable as well. Derision, mockery, acts to make the argument low status, ad hominems, etc.
You can choose not to engage with bad arguments, but you shouldn’t engage by not addressing the arguments and instead trying to execute some social maneuver to discredit it.
I’m fine with not taking Yudkowskian fast takeoff seriously as I think it’s just grossly implausible (and I’m very familiar with them, this isn’t a strong scepticism borne of ignorance).
Like some people should try to mitigate AI risk under Yudkowskian foom, but I don’t really endorse requiring people who can pass an ITT for Yudkowskian foom and think it’s grossly implausible to condition their actions heavily on it.
At some point you have to reckon with the fact that some people are very familiar with the classic AI risk case/Yudkowsky’s doom scenario in particular and ultimately have some very strong disagreements.
I’m fine with focusing most of one’s efforts at the threat models one considers to be the most probable.
I took the survey, phew that was long. I added a public key for what it’s worth.
Very sceptical that buying time is more important than anything else. It’s not even clear to me that it’s a robustly positive intervention.
From my perspective most of orthodox alignment research developed before the deep learning revolution ended up being mostly useless and some was net negative as it enshrined inadequate ontologies. Some alignment researchers seem to me to be stuck in a frame that just doesn’t apply well to artifacts produced by contemporary ML.
I would not recommend new aspiring alignment researchers to read the Sequences, Superintelligence, some of MIRI’s earlier work or trawl through the alignment content on Arbital despite reading a lot of that myself.
The LW content that seems valuable re large language models were written in the last one to three years, most in the last year.
@janus’ “Simulators” is less than a year old.
It’s not particularly clear that buying time would be very helpful. The history of AI Safety research has largely failed to transfer across paradigms/frames. If we “buy time” by freezing the current paradigm, we’d get research in the current paradigm. But if the current paradigm is not the final form of existentially dangerous AI, such research may not he particularly valuable.
Something else that complicates buying time even if the current paradigm is the final form of existentially dangerous AI is phenomena that emerge at greater scale. We’re just fundamentally unable to empirically study such phenomena until we have models at the appropriate scale.
Rather than lengthening timelines to AGI/strong AI, it seems to me that most of the existential safety comes from having a slow/continuous takeoff that allows us to implement iterative alignment and governance approaches.
Freezing AI development at a particular stage does not seem all that helpful to me.
That could lead to a hardware overhang and move us out of the regime where only a handful of companies can train strong AI systems (with training runs soon reaching billions of dollars) to a world where hundreds or thousands of actors can do so.
Longer timelines to strong AGI plus faster takeoff (as a result of a hardware overhang) seems less safe than shorter timelines to strong AGI plus slower takeoff.
I didn’t realise today was April 1st and now I’m disappointed. This is a feature I was really excited about and think would be a considerable value add to the forum.
This was very refreshing to read. I’m glad EY has realised that mocking silly ideas doesn’t actually help (it makes adherents of the idea double down and be much less likely to listen to you and may also alienate some neutrals. This is particularly true for ideas which have gained currency like the Abrahamic religions). I wasn’t able to recommend The Sequences to Christian friends previously because of it’s antireligiosity — here’s hoping this version would be better.
To be clear, my update from this was: “AI is less likely to become economically disruptive before it becomes existentially dangerous” not “AI is less likely to become existentially dangerous”.
Epistemic Status
I am an aspiring selection theorist and I have thoughts.
Why Selection Theorems?
Learning about selection theorems was very exciting. It’s one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger.
Selection Theorems as a Compelling Agent Foundations Paradigm
There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don’t quite describe the AI systems we deal with in practice.
I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it’s general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research “surface area” and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program.
Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply[1].
I currently consider selection theorems to be the most promising agent foundations flavoured research paradigm.
Digression: Asymptotic Analysis and Complexity Theory
My preferred analogy for selection theorems is asymptotic complexity in computer science. Using asymptotic analysis we can make highly non-trivial statements about the performance of particular (or arbitrary!) algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the asymptotic/complexity theoretic guarantee will generally still apply.
For example, we have a very robust proof that no comparison-based sorting algorithm can attain better worst case time complexity than (this happens to be a very tight lower bound as extant algorithms [e.g. mergesort] attain it). The model behind the lower bound of comparison sorting is very minimal and general:
Data operations
Comparing two elements
Moving elements (copying or swapping)
Cost: number of such operations
Any algorithm that performs sorting by directly comparing elements to determine their order conforms to this model. The lower bound of obtains because we can model the execution of the sorting algorithm by a binary decision tree:
Nodes: individual comparisons between elements
Edges: different outputs of comparisons ( and )
Leaf nodes: unique permutation of the input array that corresponds to that particular root to leaf path of the tree
The number of executions of the sorting algorithm for any given input permutation is given by the number of edges between the root node and that leaf. The worst case running time of the algorithm is given by the height of the tree. Because there are possible permutations of the input array, the lowest attainable worst case complexity is . Which is in .
I reiterate that this is a very powerful result. Here we’ve set up very minimal assumptions about our model (comparisons are made between pairs of elements to determine order, the algorithm can copy or swap elements) and we’ve obtained a ridiculously strong impossibility result[2].
Selection Theorems as a Complexity Theoretic Analogue
Selection theorems present a minimal model of an intelligent system as an agent situated in an environment. The agents are assumed to be the product of some optimisation process selecting for performance on a given metric (e.g. inexploitability in multi-agent environments, for the coherence theorems).
The exact optimisation process performing the selection is abstracted away [only the performance metric/objective function(s) of optimisation matters], and the hope is to do the same for the environment (that is, selection theoretic results should apply to a broad class of environments (e.g. for the coherence theorems, the only constraint imposed on the environment is that it contains other agents)].
Using the above model, selection theorems try to derive[3] agent “type signatures” (the representation [data structures], interfaces [inputs & outputs] and embedding [in an underlying physical (or other low level) system] of the agent and specific agent aspects (world models, goals, etc.). It’s through these type signatures that safety relevant properties of agents can be concretely formulated (and hopefully proven).
For example, the proposed anti-naturalness of corrigibility to expected utility maximisation can be seen as an “impossibility result”[4] of a safety property (corrigibility) derived from a selection theorem (the coherence theorems).
While this is a negative result, I expect no fundamental difficulty to obtaining positive selection theoretic guarantees of safety properties.
I see the promise of selection theorems as doing for AI safety, what complexity theory does for algorithm performance.
The Power of Selection Theorems
I expect that we will be able to provide selection theoretic guarantees of nontrivial safety properties/desiderata.
In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit[5] of optimisation for particular objectives (convergence theorems?). I find the potential of asymptotic guarantees exhilarating.
Properties proven to emerge in the limit become more robust with (increasing) scale. I think that’s an incredibly powerful result. Furthermore, asymptotic complexity analysis suggests that it’s often easier to make statements about what holds in the limit than about what holds at particular intermediate levels. (We can very easily talk about how the performance of two algorithms compare on a particular problem as input size tends towards infinity without considering implementation details or underlying hardware and ignoring all constant factors. To talk about the performance of two algorithms on input sets of a particular fixed size, we’d need to consider all the aforementioned details).
The combination of:
“Properties that are selected for in the limit become more robust with (increasing) scale” and
“It is much easier to describe the limit of a process than particular intermediate states”
is immensely powerful[6]. It makes selection theorems a hugely compelling — perhaps the one I find most personally compelling — AI safety research paradigm.
Reservations
While I am quite enamoured with Wentworth’s selection theorems, I find myself somewhat dissatisfied. As Wentworth framed it, I think they are a bit off.
A major limitation of the coherence theorems is that they constrain agents to an archetype that does not necessarily describe real agents (or other intelligent systems) well. In particular, the coherence theorems assume agent preferences are:
Static (do not change with time)
Path independent (exact course of action taken to get somewhere does not affect the agent’s preferences, alternatively it assumes that agents do not have internal states that factor into their preferences)
Complete (for any two options, the agent prefers one of them or is indifferent. It doesn’t permit a notion of “incomplete preferences”)
The failure of coherence theorems to carve reality at the joints is a valuable lesson re: choosing the right preconditions for our theorems (if our preconditions are too restrictive/strong, they might describe systems that don’t matter in the real world [“spherical cows”]). And it’s a mistake I worry that the paradigm of “agent type signatures” might be making.
To be precise, I am quite unconvinced that “agent” is the “true name” of the relevant intelligent systems. There are powerful artifacts (e.g. the base versions of large language models) that do not match the agent archetype as traditionally conceived. I do not know that the artifacts that ultimately matter would necessarily conform to the agent archetype[7]. Theorems that are exclusively about the properties of agents may end up not being very applicable to important systems of interest (if e.g. the first AGIs are created by a [mostly] self-supervised training process).
Agent selection theorems are IMO ultimately too restrictive (their preconditions are too strong to describe all intelligent systems of interest/they implicitly preclude from analysis some intelligent systems we’ll be interested in), and the selection theorem agenda should be generalised to optimisation processes and the kind of constructs they select for.
That is, regardless of paradigm, intelligent systems (e.g. humans, trained ML models and expert systems) are the products of optimisation processes (e.g. natural selection, stochastic gradient descent, and human design[8] respectively).
So, a theory based solely on optimisation processes seems general enough to describe all intelligent systems of interest (while being targeted enough to say nontrivial/interesting things about such systems) and minimal (we can’t relax the preconditions anymore while still obtaining nontrivial results about intelligent systems).
The agent type signature paradigm is insufficiently general.
In the remainder of this post, I would like to slightly adjust the concept of selection theorems to better reflect what I think they should be[9].
Types of Selection Theorems
There are two broad classes of theorems that seem valuable:
Constructor Theorems
For a given (collection of) objective(s), and underlying configuration space what type[10] of artifacts are produced by constructive optimisation processes (e.g. natural selection, stochastic gradient descent and human design) that select for performance on said objective(s)?
Fundamentally, they ask the question:
What properties are selected for by optimisation for a particular (collection of) objective function(s)?
The aforementioned “convergence theorems” would be a particular kind of constructor theorems.
Artifact Theorems
Artifact theorems are the dual of constructor theorems. If constructor theorems seek to identify the artifact type produced by a particular constructive optimisation process, then artifact theorems seek to identify the constructive optimisation process that produced particular artifacts (the human brain, trained ML models and the quicksort algorithm respectively).
That is:
For a given artifact type and associated configuration spaces, what were the objectives[11] of the optimisation process that produced it?
I.e. describe the class of problems/domains/tasks the objectives belong to
Can we also specify a type for the objectives?
What properties do its members have?
Which properties are necessary to select for that artifact type?
What is its parent type?
Which properties are sufficient?
What are the interesting child types?
I suspect that e.g. investigating general intelligence artifact theorems would be a promising research agenda for robust safety of arbitrarily capable general systems.
- ^
Provided we use sufficiently general agent/system models as the foundation for our selection theoretic results.
- ^
I should point out that this impossibility result is somewhat atypical; for many interesting problems we don’t regularly obtain (non-trivial [e.g. the size of the input or output]) tight lower bounds on complexity.
- ^
Usually, some parts of the type signatures are assumed (implicitly or explicitly) by the theorem.
- ^
Jessica Taylor told me that she thinks the anti-naturalness of corrigibility is more of a “research intuition” than a theorem.
- ^
I’m under the impression that it was when thinking about what emerges in the limit that I first drew the relationship between selection theorems and complexity theory. However, this may be a false memory (or otherwise not a particularly reliable recollection of events).
- ^
It feels almost too good to be true, like we’re cheating in the mileage we get out of selection theorems.
- ^
While any physical system can be constituted as an agent situated in an environment, the agent archetype is not illuminating for all of them. Viewing a calculator as an agent does not really offer any missing insight into the operations of the calculator. It does not allow you to better predict its behaviour.
- ^
Insomuch as one accepts that design is a kind of optimisation process. And I would insist that you should, but I’ve not gotten around to writing up my thoughts on what exactly qualifies as an optimisation process in a form that I would endorse. Eliezer’s “Measuring Optimisation Power” is a fine enough first approximation
- ^
The quickest gloss is that:
- “Agent” should be replaced with “artifact” (a general term for any object that is the product of an optimisation process).
Some sample artifacts and the optimisation process that produced them:
* The human brain: natural selection
* Trained ML models: stochastic gradient descent
* : Newton’s method (approximation for the square root of )
* The quicksort algorithm: human design
- ^
Among other things, a type should specify a set of properties that all members of the type share. If those properties are necessary and sufficient for an artifact to belong to a particular type, the type could simply be identified with its collection of properties.
Types can exist at different levels of abstraction (allowing them to specify artifact properties at different levels of detail).
An artifact can belong to multiple types (e.g. I might belong to the types: “human”, “male”, “Nigerian”).
- ^
Rather than identifying the optimisation process in detail, only the objective function of the optimisation process is considered. Any other particulars/specifics of the optimisation process are abstracted away (the same way implementation details are abstracted away in asymptotic analysis).
The motivation is that I think that any two optimisation processes with the same objective functions on the same configuration space with the same “optimisation power” are identical for our purposes. And for convergence theorems, even the optimisation power is abstracted away.
- 7 Feb 2023 23:01 UTC; 10 points) 's comment on Highlights and Prizes from the 2021 Review Phase by (
- 29 Jan 2023 21:26 UTC; 2 points) 's comment on DragonGod’s Shortform by (
Let us suppose that we’ve solved the technical problem of AI Alignment — i. e., the problem of AI control. We have some method of reliably pointing our AGIs towards the tasks or goals we want, such as the universal flourishing of all sapient life. As per the Orthogonality Thesis, no such method would allow us to only point it at universal flourishing — any such method would allow us to point the AGI at anything whatsoever.
This does not hold if we get alignment by default.
Or any similar scheme where we solve the alignment problem via training the AI (via e.g. self supervised learning) on human generated/curated data (perhaps with additional safety features).
[Janus’ Simulators cover important differences of self supervised models in more detail.]
It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.
And it does not follow that we can in full generality, point such systems at arbitrary other targets.
I turned 25 today.
Strongly upvoted.
Agree and well said.
Overloading misuse and misalignment is anti-helpful.
To state the obvious, Yudkowsky’s writing style/rhetoric/argument annoys people.
[Content and responses removed by habryka since this is more appropriate in meta, and the comment was already crossposted]
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don’t have an answer, maybe take a minute to generate one, or alternatively, try to predict what I’ll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
From @janus’ Simulators:
Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum (though it may not be trivial to extract that knowledge).
I tried (poorly) to draw attention to this thesis in my “The Limit of Language Models”.
given how AI capabilities are going, it’s not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it’s not we’re doomed anyways. but i think, in this particular case, this is wrong.
Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
let’s not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.
They’re looking more false by the day.
On the other hand, I find it way too easy to abandon my bright ideas. I often just lose interest or get distracted by something.
I have 10,000+ words of unpublished drafts, and outlines I’ve not even yet turned into drafts.
I often envy people like Alice and child John who are able to cling on to their idea for months until it fails.
I wish I could persevere at any idea for months.
Epistemic Status: I don’t actually know anything about machine learning or reinforcement learning and I’m just following your reasoning/explanation.
From each state, we can just check each possible action against the action-value function $q(s_t, a_t), and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.
This does not actually follow. Policies return probability distributions over actions (“strategies”), and it’s not necessarily the case that the output of the optimal policy in the current state is a pure strategy.
Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist).
Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy’s support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I’m too tired to work out the maths right now.
IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty).
So I think greedy search for the action that attains the highest value for the optimal policy’s action value function is only equivalent to the optimal policy if the environment is:
Deterministic
Fully observable/the agent has perfect information
Agent knows all the “laws of physics”/state transition rules of the environment
Fixed low level state transitions that do not depend on agent
(I may be missing some other criteria necessary to completely obviate mixed strategies.)
I think these conditions are actually quite strong!- 27 Jan 2023 23:35 UTC; 2 points) 's comment on Highlights and Prizes from the 2021 Review Phase by (
I don’t understand what I just read.
I think an issue is that GPT is used to mean two things:
A predictive model whose output is a probability distribution over token space given its prompt and context
Any particular techniques/strategies for sampling from the predictive model to generate responses/completions for a given prompt.
[See the Appendix]
The latter kind of GPT, is what I think is rightly called a “Simulator”.
From @janus’ Simulators (italicised by me):
It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don’t think there’s any real tension in the ontology here.
Appendix
Credit for highlighting this distinction belongs to @Cleo Nardo:
To summarise:
Static GPT: GPT as predictor
Dynamic GPT: GPT as simulator