Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be and the “true” reward function that we intend the AI to follow be . Supposedly, these function agree on some domain but catastrophically go apart outside of it. Then, if all the training data lies inside , which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors over .
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that and actually agree on , which is in itself wildly optimistic.
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.
I am separately worried about “Carefully Controlled Moderate Superintelligences that we’re running at scale, each instance of which is not threatening, but, we’re running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an “instance” is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it’s more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we’re somehow confident that the “instances” can’t or won’t coordinate either causally or acausally, then they are arguably truly “instances”, but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence[1].)
More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.
There seem to be two underlying motivations here, which are best kept separate.
One motivation is having a good vocabulary to talk about fine-grained distinctions. I’m on board with this one. We might want to distinguish e.g.:
Smarter than a median human along all AI-risk-relevant axes
Smarter than the smartest human along all AI-risk-relevant axes
Smarter than all of humanity put together along all AI-risk-relevant axes
Smart enough to have a 50% success probability to kill all humans if it chooses to, given current level of countermeasures
Smart enough to have a 50% success probability to kill all humans if it chooses to, even if best-case countermeasures are in place (this particular distinction inspired by Buck’s comments on this thread)
But then, first, it is clear that existing AI is not superintelligence according to any of the above interpretations. Second, I see no reason not to use catchy words like “hyperintelligence”, per One’s suggestion. (Although I agree that there is an advantage to choosing more descriptive terms.)
Another motivation is staying ahead of the hype cycles and epistemic warfare on twitter or whatnot. This one I take issue with.
I don’t have an account on twitter, and I hope that I never will have. Twisting ourselves into pretzels with ridiculous words like “AIdon’tkilleveryoneism” is incompatible with creating a vocabulary optimized for actually thinking and having productive discussions among people who are trying to be the adults in the room. Let the twitterites use whatever anti-language they want. The people trying to do beneficial politics there: I sincerely wish you luck, but I’m laboring in a different trench, and let’s use the proper tool for each task separately.
I understand that there can be practical difficulties such as, what if LW ends up using a language so different from the outside world that it will become inaccessible to outsiders, even when those outsiders would otherwise make valuable contributions. There are probably some tradeoffs that are reasonable to make with such considerations in mind. But let’s at least not abandon any linguistic position at the slightest threatening gesture of the enemy.
This post is an overview of Steven Byrnes’ AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes’ goal is to reverse engineer the human utility function, or at least some of its central features. I don’t think this will succeed in the sense of, we’ll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
Bridging brain science and agent theory is a promising way to make sure that we build a theory of agents broad enough to include humans. The latter is crucial in order to formally define alignment (since alignment is between the AI-agent and the human-agent), which is needed to have formal alignment guarantees. In particular, it is needed for value learning to become possible, such as in my COSI proposal.
While ideally we might wish for alignment guarantees to assume as little as possible, it might be difficult or even impossible to design a competitive AI system which is robustly aligned with a completely uniformed prior. As a conservative example, we might discover that one or several scalar parameters of humans should be approximately known (e.g. parameters related to amount of computing resources[1]). In this case, we would need to reverse engineer these parameters from brain science, which requires having a reliable dictionary between brain science and agent theory.
I hope that in the future this programme makes more direct contact with the mathematical formalism of agent theory, of the sort the LTA is constructing. However, I realize that this is a difficult challenge.
Why are we giving up on plain “superintelligence” so quickly? According to Wikipedia:
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the most gifted human minds. Philosopher Nick Bostrom defines superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”.
According to Google AI Overview:
Superintelligence (or Artificial Superintelligence—ASI) is a hypothetical AI that vastly surpasses human intellect in virtually all cognitive domains, possessing superior scientific creativity, general wisdom, and social skills, operating at speeds and capacities far beyond human capability, and potentially leading to profound societal transformation or existential risks if not safely aligned with human goals.
I don’t think I saw anyone use “superintelligence” as “better than a majority of humans on some specific tasks” before very recently. (Was DeepBlue a superintelligence? Is a calculator superintelligence?)
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as “fundamentally confused ideas” which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from “Superintelligence”, in which Bostrom talks about the problem of wireheading. Turner declares this to be “nonsense” since, according to Turner, RL systems don’t seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today’s systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow’s systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it’s not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
After this argument, Turner adds that “as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today”. This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI’s legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn’t invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called “counting argument” for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on “good” behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don’t perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn’t exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don’t think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam’s razor, but Occam’s razor doesn’t strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam’s razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn’t always generalize well empirically, it’s just that when it fails we add more data until it starts generalizing. But, if the failure is “kill all humans”, there is nobody left to add more data.
Turner’s conclusion is “it becomes far easier to just use the AIs as tools which do things we ask”. The extent to which I agree with this depends on the interpretation of the vague term “tools”. Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I’m already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don’t think we know how to safety create “tools” that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
It updates me towards natural latents being a useful concept for foundational questions in agent theory, as opposed to being some artifact of overindexing on Bayesian networks as the “right” ontology.
The proof technique involves defining an algorithmic information theory analogue of Bayesian networks, which is something I haven’t seen before and seems quite interesting in itself.
It would be interesting to see whether any of this carries over to the efficiently computable counterparts of Kolmogorov complexity I recently invented[1].
The main thing this post is missing is any rigorous examples or existence proofs of these AIT natural latents. I’m guessing that the following construction should work:
Choose a universal Turing machine .
Choose to be a -program for a total recursive function s.t. .
Choose to be random strings of length .
Set .
Then, with high probability, is a natural latent for the . (I think?)
It would be nice to see something like that in the post.
These ideas seem conceptually close to concepts like sophistication in algorithmic statistics, and the connection might be worth investigating.
Now, about the stated motivation: the OP claims that natural latents capture how “reasonable” agents choose to define categories about the world. The argument seems somewhat compelling, although some further justification is required for the claim that
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer.
That said, I think that real-world categorizations are also somewhat value-laden: depending on the agent’s preferences, and on the laws of the universe in which they find themselves, there might be particular features they care about much more than other features. (Since they are more decision-relevant.) The importance of these features will likely influence which categories are useful to define. This fact cannot be captured in a formalism on the level of abstraction in this post. (Although maybe we can get some of the way there by drawing on rate-distortion theory?)
Still unpublished.
Then: log ) .
O-notation in contexts like this is somewhat ambiguous, because it’s not clear what is treated as a constant and what is treated as a variable that grows to infinity. Would it be correct to say that there exist constants that depend on nothing except for the implicit choice of UTM (or maybe they also need to depend on ?) s.t.
Aside: if a string is natural over some strings y_1, …, y_n, then it’s also natural over any subset consisting of two or more of those strings.
Is this actually true? Intuitively, it feels wrong: a natural is supposed to contain exactly the information which is mutual between the strings. But, a subset would have more mutual information than the full set. Formally, the redundancy conditional obviously descends to subsets, but mediation seems to break, at least superficially?
In this post Jan Kulveit calls for creating a theory of “hierarchical agency”, i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.
The form of the post is a dialogue between Kulveit and Claude (the AI). I don’t like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it’s a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.
Now, about the content. I agree with Kulveit that multi-agency is important. I’m not entirely sold on the importance of the “hierarchical agency” frame, but I agree that “when can a system of agents be regarded as a single agent” seems like a question that should be answered. At the very least, a certain type of answer to this question might relieve of us of the need to worry about “what if humans are multi-agents” in the context of alignment (because, arguably, it would be possible to just regard humans as uni-agents anyway).
After reading this post, I came up with the following (extremely simplistic) toy model for hierarchical agency.
Let be the set of possible decisions an agent can make, the set of “possible worlds”, the set of “outcomes” and the (known) process by which outcomes are generated. Then, an agent that makes decision can be ascribed the belief and the utility function when is the unique maximum of over . Some decisions cannot be ascribed “intent” at all: for example, if then be be ascribed intent iff is an exposed point of the convex hull of the image of . (See also)
We can now consider a system of agents with decision sets and a process . For each set , we can define and , and then is defined in the obvious way. We can then ask which agent sets have “collective intent” and which don’t.
To give a simple example, let , , and is defined in the obvious way, where corresponds to flipping a fair coin to decide between and . Then, if both agents choose then they have intent individually (we can think of them as playing matching pennies, with each agent believing the other one’s action to depend on their own action), but not collectively. On the other hand, if they play a pure strategy that they have collective intent as well.
Extending this into a theory that fully engages with all relevant aspects of the problem would require incorporating infra-Bayesianism, Formal Computational Realism, possibly some form of the Algorithmic Descriptive Agency Measure etc, and more generally, first developing the theory of uni-agents. But maybe starting from the “hierarchical agency” end can be useful as well.
In this post, Abram Demski argues that existing AI systems are already “AGI”. They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.
Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker “AGI”, or maybe we need new terminology.
I tend to agree with Abram that there’s a sense in which modern AI is already “AGI”, and also agree with the commenters that there might be something important missing. To put the latter in my own words: I think that there is some natural property in computational-system-space s.t.
The prospect of AI with this property is the key reason to be worried about X-risk from unaligned AI.
Humans, or at least some humans, or at least humanity as a collective, has at least a little of this property, and this is what enabled humanity to become a technological civilization.
To handwave in the direction of that property, I would say “the ability to effectively and continuously acquire deep knowledge and exploit this knowledge to construct and execute goal-directed plans over long lifetimes and consequence horizons”.
It is IMO unclear whether modern AI are better thought of as having a positive but subhuman amount of this property, or as lacking it entirely (i.e. lacking some algorithmic component necessary for it). This question is hard to answer from our understanding of the algorithms, because foundation models “steal” some human cognitive algorithms in opaque ways, and we don’t even understand deep learning itself. Clearly, a civilization comprised of modern AI and no humans would not survive (not to mention progress), even if equipped with excellent robotic bodies. But, the latter might be just a “coincidental” fact about how harsh our specific universe is.
Be the case as it may, I think that the argument for more fine-grained terminology is strong. We can concede that modern AI is AGI, and have a new term for the thing modern AI might-not-yet-be. Maybe AGA: “Aritificial General Agent”?
Here’s a feature proposal.
The problem: At present, when a post has 0 reviews, there is an incentive against writing critical reviews. Writing such a review enables the post to enter the voting phase, which you don’t especially want to happen if you think the post is undeserving. This seems perverse: critical reviews are valuable, especially so if someone would write a positive review later, enabling the post to enter voting anyway. (In principle, you can “lie in ambush” until someone writes a positive review and only then write your negative review, but that requires annoying logistics.)
My suggestion: Allow flagging reviews as “critical” in the UI. (One option is to consider a review “critical” whenever your own vote for the post is negative, another is to have a separate checkbox.) Such reviews would not count for enabling the post to enter voting.
This work[1] was the first[2] foray into proving non-trivial regret bounds in the robust (infra-Bayesian) setting. The specific bound I got was later slightly improved in Diffractor’s and my later paper. This work studied a variant of linear bandits, due the usual reasons linear models are often studied in learning theory: it is a conveniently simple setting where we actually know how to prove things, even with computationally efficient algorithms. (Although we still don’t have a computationally efficient algorithm for the robust version: not because it’s very difficult, but (probably) just because nobody got around to solving it.) As such, this work was useful as a toy-model test that infra-Bayesianism doesn’t run into statistical intractability issues. As to whether linear-model algorithms or their direct descendants will actually play a role in the ultimate theory of learning, that is still an open question.
An abridged version was also published as a paper in JMLR.
Other than Tian et al, which technically is a robust regret bound, but was not framed by the authors as such (instead, their motivation was studying zero-sum games).
TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a general theory of intelligent agents. I have a lot of uncertainty about what shape the solution should take even in the broadest brushstrokes, as I outlined in my recent five worlds taxonomy[1]. This is in contrast to uni-agency, where Formal Computational Realism (FCR) is, IMO, pretty close to at least nailing down the correct type signature and qualitative nature of the desiderata.
At the same time, understanding multi-agency seems quite important in the context of AI alignment. There are many sorts of multi-agent interactions that are potentially relevant:
AI-user is in the very core of the problem.
user-[arbitrary agent] is important since the AI is supposed to faithfully “represent” the user in those interactions, and since examining those interactions might be necessary to correctly interpreting the user’s preferences.
[counterfactual user]-[counterfactual user] is relevant to dealing with uncertainty during value learning.
user-user is important for multi-user alignment.
AI-[counterfactual agent] is important when considering inner alignment, since mesaoptimizers can sometimes be regarded as “acausal attacks” by counterfactual agents.
AI-[successor agent] seems important for thinking about self-improving / reproducing agents.
AI-AI is important if we expect a multipole scenario.
This post tells a particular story of how multi-agent theory might look like. In this story, agents converge to a new type of solution concept described in the “stable cycles for multiplayer games” section. (I call this solution “haggling equilibrium”). As opposed to Nash equilibria, the “typical” (but not any) haggling equilibrium in a two-player game is Pareto-efficient. This stands in contrast even to Nash equilibria in repeated games, where Pareto-efficiency is possibly but, due to the folk theorem, very underdetermined.
Moreover, there is an argument that a particular type of robust RL algorithm (robust UCB) would converge to such equilibria under some assumptions. However, the argument is pretty informal and there is not even a rigorous conjecture at present. There are, broadly speaking, two possibilities how the story might be completed:
We promote convergence to haggling equilibrium to a desideratum, and demonstrate algorithms that accomplish it with good statistical and computational efficiency. (This corresponds to the “Economica” world in my five world taxonomy.)
We show that there are reasonable uni-agent desiderata (robust regret bounds and maybe more?) that imply convergence to haggling equilibrium. (This corresponds to the “Harmonia” world in my five world taxonomy.)
With either possibility, the hope is that combining such a result with FCR would promote it to applying in more “exotic” contexts as well, such as one-shot games with transparent source code (along the lines of Demski’s “logical time”).
It is also interesting to study the notion of haggling equilibrium in itself, for example: is there always a Pareto-efficient haggling equilibrium? (True for two players, but I don’t know the answer in general.)
To summarize, the ideas in this post are, AFAIK, novel (although somewhat similar ideas appeared in the literature in the guise of “aspiration-based” algorithms in multi-agent RL, see e.g. Crandall and Goodrich 2013) and might be key to understanding multi-agency. However, the jury is still very much out.
In the terminology of those five worlds, I consider Nihiland and Discord to be quite unlikely, but Linguistica, Economica and Harmonia all seem plausible.
I propose a taxonomy of 5 possible worlds for multi-agent theory, inspired by Imagliazzo’s 5 possible worlds of complexity theory (and also the Aaronson-Barak 5 worlds of AI):
Nihiland: There is not even a coherent uni-agent theory, not to mention multi-agency. I find this world quite unlikely, but leave it here for the sake of completeness (and for the sake of the number 5). Closely related is antirealism about rationality, which I have criticized in the past. In this world it is not clear whether the alignment problem is well-posed at all.
Discordia: There is a coherent uni-agent theory, but no coherent theory of multi-agency. This world is conceivable, since the current understanding of multi-agency is much worse than the understanding of “solitary” agents. In this world, negative-sum conflicts and coordination failures are probably ubiquitous (even among arbitrarily sophisticated agents), because there is no principle of rationality that rules them out. Acausal trade is probably not a thing, or at least rare and fragile. In the context of value learning, there might be no principled way to deal with uncertainty (which could otherwise be regarded as a bargaining problem). There is also no principled solution to multi-user alignment.
Linguistica: There is a coherent theory of multi-agency, but agents are inevitably divided into “types” s.t. only interactions between agents of the same type have strong guarantees. (The name of the world is because we can metaphorically think of the types as different “languages”.) An example of how this might happen is reflective oracles, where the type corresponds to the choice of fixed point. Acausal trade probably exists[1], but is segregated by type. Alignment is complicated by the need to specify or learn the human type.
Economica: There is a coherent uni-type theory of multi-agency, but this theory involves desiderata that can only be motivated by multi-agency. Explicitly thinking about multi-agency is necessary to construct the full theory of agents[2]. In this world, the Yudkowskian hope for ubiquitous strong cooperation guarantees can be justified, and acausal trade might be very common. Figuring out the multi-agent theory, and not just the uni-agent fragment, is probably important for alignment, or at least necessary in order to avoid leaving huge gains from trade on the table.
Harmonia: There is a coherent uni-type theory of multi-agency, and this theory can be derived entirely from desiderata that can be motivated without invoking multi-agency at all. There is no special “mutli-agent sauce”: any sufficiently rational agents automatically have strong guarantees in the multi-agent setting. Explicitly understanding multi-agency is arguably still important for dealing with uncertainty in value-learning, and dealing with multi-user alignment. (And also in order to know that we are in this world.)
For simplicity, I’m ignoring what is arguably an “orthogonal” axis: to which extent the “correct” multi-agent theory implies acausal cooperation even under favorable conditions. I believe that, outside of Nihiland and Discordia, it probably does, but the alternative hypothesis is also tenable.
On the border between Linguistica and Economica, there are worlds with strong guarantees for agents of the same type and medium-strength guarantees for agents of different type (where “medium-strength” is still stronger than “achieve maximin payoff”: the latter is already guaranteed in infra-Bayesianism). This blurs the boundary, but I would consider this to be Linguistica if even slightly different types have much weaker guarantees (or if there is no useful notion of “slightly different types”) and Economica if there is continuous graceful degradation like in Yudkowsky’s subjective fairness proposal.
This post discusses an important point: it is impossible to be simultaneously perfectly priorist (“updateless”) and learn. Learning requires eventually “passing to” something like a posterior, which is inconsistent with forever maintaining “entanglement” with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.
My own position on this cluster of questions is that you should be priorist/(infra-)Bayesian about physics but postist/learner/frequentist about logic. This idea is formally embodied in the no-regret criterion for Formal Computational Realism. I believe that this no-regret condition implies something like the OP’s “Eventual Learning”, but formally demonstrating it is future work.
Strictly speaking, there’s no result saying you can’t represent quantum phenomena by stochastic dynamics (a.k.a. hidden variables). Indeed, e.g. the de Broglie-Bohm interpretation does exactly that. What does exist is Bell’s inequality, which implies that it’s impossible to represent quantum phenomena by local hidden variables (local = the distribution is the limit of causal graphs in which variables are localized in spacetime and causal connections only run along future-directed timelike (not superluminal) separations). Now, our framework doesn’t even fall in the domain of Bell’s inequality, since (i) we have supracontributions (in this post called “ultracontributions”) instead of ordinary probability distributions (ii) we have multiple co-existing “worlds”. AFAIK, Bell-inequality-based arguments against local hidden variables support neither i nor ii. As such, it is conceivable that our interpretation is in some sense “local”. On the other hand, I don’t know that it’s local and have no strong reason to believe it.
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)