I wonder if some of the conflation between belief-as-prediction and belief-as-investment is actually a functional social technology for solving coordination problems. To avoid multi-polar traps, people need to trust eachother to act against individual incentives- to rationally pre-commit to acting irrationally in the future. Just telling people “I’m planning act against my incentives, even though I know that doing so will be irrational at the time” might not be very convincing, but instead claiming to have irrationally certain beliefs that would change your incentives were that certainty warranted can be more convincing. Even if people strongly suspect that you’re exaggerating, they know that the social pressure to avoid a loss of status by admitting that you were wrong will make you less likely to defect.
For example, say you’re planning to start a band with some friends. You all think the effort and investment will be worth it so long as there’s a 50% chance of the band succeeding, and you all privately think there’s about about a 70% chance of the band succeeding if everyone stays committed, and a near 0% chance if anybody drops out. Say there’s enough random epistemic noise that you think it’s pretty likely someone in the band will eventually drop their odds below that 50% threshold, even when you personally still give success conditional on commitment much better odds. So, unless you can trust everyone to stay committed even if they come to believe it’s not worth the effort, you might as well give up on the band before starting it. Classic multi-polar trap. If, however, everyone at the start is willing to say “I’m certain we’ll succeed”, putting more of their reputation on the line, that might build enough trust to overcome the coordination problem.
Of course, this can create all sorts of epistemic problems. Maybe everyone in the band comes to believe that it’s not worth the effort, but incorrectly think that saying so will be a defection. Maybe their exaggerated certainty misleads other people in ways that cause them to make bad investments or to dangerously misunderstand the music industry.
Maybe there’s a sense in which this solution to individual coordination problems is part of a larger coordination problem- everyone incentivized to reap the value of greater trust, but causing a greater loss of value to people more broadly by damaging the epistemic commons.
There might be some motivated reasoning on that last point, however, since I definitely find it emotionally uncomfortable when people say inaccurate things for social reasons.
artifex0
I’m certain your model of what purpose is is a lot more detailed than mine. My take, however, is that animal brains don’t exactly have a utility function, but probably do have something functionally similar to a reward function in machine learning. A well-defined set of instrumental goals terminating in terminal goals would be a very effective way of maximizing that reward, so the behaviors reinforced will often converge on an approximation of that structure. However, the biological learning algorithm is very bad at consistently finding the structure, and so the approximations will tend to shift around and conflict a lot- behaviors that approximate a terminal goal one year might approximate an instrumental goal later on, or cease to approximate any goal at all. Imagine a primitive image diffusion model with a training set of face photos- you run it on a set of random pixels, and it starts producing eyes and mouths and so on in random places, then gradually shifts those around into a slightly more coherent image as the remaining noise decreases.
So, instrumental and terminal goals in my model aren’t so much things agents actually have as a sort of logical structure that influences how our behaviors develop. It’s sort of like the structure of “if A implies B which implies C, then A implies C”- that’s something that exists prior to us, but we tend to adopt behaviors approximating it because doing so produces a lot of reward. Note, though, that comparing the structure of goals to logic can be confusing, since logic can help promote terminal goals- so when we’re approximating having goals, we want to be logical, but we have no reason to want to have terminal goals. That just something our biological reward function tends to reinforce.
Regarding my use of the term “category error”, I used that term rather than saying “terminal goals don’t require justification” because, while technically accurate, the use of the word “require” there sounds very strange to me. To “require” something means that it’s necessary to promote some terminal goal. So, the phrase reads to me a bit like “a king is the rank which doesn’t follow the king’s orders”- accurate, technically, but odd. More sensible to say instead that following the king’s orders is something having to do with subjects, and a category error when applied to a king.
Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
If I’m understanding your point correctly, I think I disagree completely. A chain of instrumental goals terminates in a terminal goal, which is a very different kind of thing from an instrumental goal in that assigning properties like “unjustified” or “useless” to it is a category error. Instrumental goals either promote higher goals or are unjustified, but that’s not true of all goals- it’s just something particular to that one type of goal.
I’d also argue that a chain of definitions terminates in qualia- things like sense data and instincts determine the structure of our most basic concepts, which define higher concepts, but calling qualia “undefined” would be a category error.
There is no fundamental physical structure which constitutes agency
I also don’t think I agree with this. A given slice of objective reality will only have so much structure- only so many ways of compressing it down with symbols and concepts. It’s true that we’re only interested in a narrow subset of that structure that’s useful to us, but the structure nevertheless exists prior to us. When we come up with a useful concept that objectively predicts part of reality, we’ve, in a very biased way, discovered an objective part of the structure of reality- and I think that’s true of the concept of agency.
Granted, maybe there’s a strange loop in the way that cognitive reduction can be further reduced to physical reduction, while physical reduction can be further reduced to cognitive reduction- objective structure defines qualia, which defines objective structure. If that’s what you’re getting at, you may be on to something.
There seems to be a strong coalition around consciousness
One further objection, however: given that we don’t really understand consciousness, I think the cultural push to base our morality around it is a really bad idea.
If it were up to me, we’d split morality up into stuff meant to solve coordination problems by getting people to pre-commit to not defecting, stuff meant to promote compassionate ends for their own sake, and stuff that’s just traditional. Doing that instead of conflating everything into a single universal imperative would get rid of the deontology/consequentialism confusion, since deontology would explain the first thing and consequentialism the second, and by not founding our morality on poorly understood philosophy concepts, we wouldn’t risk damaging useful social technologies or justifying horrifying atrocities if Dennettian illusionism turns out to be true or something.
An important bit of context that often gets missed when discussing this question is that actual trans athletes competing in women’s sports are very rare. Of the millions competing in organized sports in the US, the total number who are trans might be under 20 (see this statement from the NCAA president estimating “fewer than ten” in college sports, this article reporting that an anti-trans activist group was able to identify only five in K-12 sports, and this Wikipedia article, which identifies only a handful of trans athletes in professional US sports).
Because this phenomenon is so rare relative to how often it’s discussed, I’m a lot more interested in the sociology of the question than the question itself. There was a recent post from Hanson arguing that the Left and Right in the US have become like children on a road trip annoying each other in deniable ways to provoke responses that they hope their parents will punish. I think the discrepancy between the scale of the issue and how often it comes up is mostly due to it being used in this way.
A high school coach who has to choose whether to allow a trans student to compete in female sports is faced with a difficult social dilemma. If they deny the request, then the student- who wants badly to be seen as female- will be disappointed and might face additional bullying; if they allow it, that will be unfair to the other female players. In some cases, other players may be willing to accept a bit of unfairness as an act of probably supererogatory kindness, but in cases where they are aren’t, explaining to the student that they shouldn’t compete without hurting their feelings will take a lot of tact on the part of the coach.
Elevating this to a national conversation isn’t very tactful. People on the right can plausibly claim to only be concerned with fairness in sports, but presented so publicly, this looks to liberals like an attempt to bully trans people. They’re annoyed, and may be provoked into responding in hard to defend ways like demanding unconditional trans participation in women’s sports- which I think is often the point. It’s a child in a car poking the air next to his sister and saying “I’m not touching you”, hoping that she’ll slap him and be punished.
I’m certain the OP didn’t intend anything like that- LessWrong is, of course, a very high-decoupling place. But I’d argue that this is an issue best resolved by letting the very few people directly affected sort out the messy emotions involved among themselves, rather than through public analysis of the question on the object level.
So, in practice, what might that look like?
Of course, AI labs use quite a bit of AI in their capabilities research already- writing code, helping with hardware design, doing evaluations and RLAIF; even distillation and training itself could sort of be thought of as a kind of self-improvement. So, would the red line need to target just fully autonomous self-improvement? But just having a human in the loop to rubber-stamp AI decisions might not actually slow down an intelligence explosion by all that much, especially at very aggressive labs. So, would we need some kind of measure for how autonomous the capabilities research at a lab is, and then draw the line at “only somewhat autonomous”? And if we were able to define a robust threshold, could we really be confident that it would prevent ASI development altogether, rather than just slowing it down?
Suppose instead we had a benchmark that measured something like the capabilities of AI agents in long-term real-world tasks like running small businesses and managing software development projects. Do you think it might make sense to draw a red line on somewhere on that graph- targeting a dangerous level of capabilities directly, rather than trying to prevent that level of capabilities from being developed by targeting research methods?
The most important red line would have to be strong superintelligence, don’t you think? I mean, if we have systems that are agentic in the way humans are, but surpass us in capabilities in the way we surpass animals, it seems like specific bans on the use of weapons, self-replication, and so on might not be very effective at keeping them in check.
Was it necessary to avoid mentioning ASI in the “concrete examples” section of the website to get these signatories on board? Are you concerned that avoiding that subject might contribute to the sense that discussion of ASI is non-serious or outside of the Overton window?
I think this is related to what Chalmers calls the “meta problem of consciousness”- the problem of why it seems subjectively undeniable that a hard problem of consciousness exists, even though it only seems possible to objectively describe “easy problems” like the question of whether a system has an internal representation of itself. Illusionism- the idea that the hard problem is illusory- is an answer to that problem, but I don’t think it fully explains things.
Consider the question “why am I me, rather than someone else”. Objectively, the question is meaningless- it’s a tautology like “why is Paris Paris”. Subjectively, however, it makes sense, because your identity in objective reality and your consciousness are different things- you can imagine “yourself” seeing the world through different eyes, with different memories and so on, even though that “yourself” doesn’t map to anything in objective reality. The statement “I am me” also seems to add predictive power to a subjective model of reality- you can reason inductively that since “you” were you in the past, you will continue to be in the future. But if someone else tells you “I am me”, that doesn’t improve your model’s predictive power at all.
I think there’s a real epistemological paradox there, possibly related somehow to the whole liar’s/Godel’s/Russell’s paradox thing. I don’t think it’s as simple as consciousness being equivalent to a system with a representation of itself.
I used to do graphic design professionally, and I definitely agree the cover needs some work.
I put together a few quick concepts, just to explore some possible alternate directions they could take it:
https://i.imgur.com/zhnVELh.png
https://i.imgur.com/OqouN9V.png
https://i.imgur.com/Shyezh1.png
These aren’t really finished quality either, but the authors should feel free to borrow and expand on any ideas they like if they decide to do a redesign.
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda).
I don’t think that would help much, unfortunately. Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so.
Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren’t discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.
A counterpoint: when I skip showers, my cat appears strongly in favor of smell of my armpits- occasionally going so far as to burrow into my shirt sleeves and bite my armpit hair (which, to both my and my cat’s distress, is extremely ticklish). Since studies suggest that cats have a much more sensitive olfactory sense than humans (see https://www.mdpi.com/2076-2615/14/24/3590), it stands to reason that their judgement regarding whether smelling nice is good or bad should hold more weight than our own. And while my own cat’s preference for me smelling bad is only anecdotal evidence, it does seem to suggest at least that more studies are required to fully resolve the question.
I think it’s a very bad idea to dismiss the entirety of news as a “propaganda machine”. Certainly some sources are almost entirely propaganda. More reputable sources like the AP and Reuters will combine some predictable bias with largely trustworthy independent journalism. Identifying those more reliable sources and compensating for their bias takes effort and media literacy, but I think that effort is quite valuable- both individually and collectively for society.
Accurate information about large, important events informs our world model and improves our predictions. Sure, a war in the Middle East might not noticeably affect your life directly, but it’s rare that a person lives an entire life completely unaffected by any war, and having a solid understanding of how wars start and progress based on many detailed examples will help us prepare and react sensibly when that happens. Accurate models of important things will also end up informing our understanding of tons of things that might have originally seemed unrelated. That’s all true, of course, of more neglected sources of information- but it seems like the best strategy for maximizing the usefulness of your models is to focus on information which seems important or surprising, regardless of neglectedness.
Independent journalism also checks the power of leaders. Even in very authoritarian states, the public can collectively exert some pressure against corruption and incompetence by threatening instability- but only if they’re able to broadly coordinate on a common understanding of those things. The reason so many authoritarians deny the existence of reliable independent journalism- often putting little to no effort into hiding the propagandistic nature of their state media- is that by promoting that maximally cynical view of journalism, they immunize their populations against information not under their control. Neglected information can allow for a lot of personal impact, but it’s not something societies can coordinate around- so focusing on it to the exclusion of everything else may represent a kind of defection in the coordination problem of civic duty.
Of course, we have to be very careful with our news consumption- even the most sober, reliable sources will drive engagement by cherry-picking stories, which can skew our understanding of the frequency of all kinds of problems. But availability bias is a problem we have to learn to compensate for in all sorts of different domains- it would be amazing if we were able to build a rich model of important global events by consuming only purely unbiased information, but that isn’t the world we live in. The news is the best we’ve got, and we ought to use it.
So, the current death rate for an American in their 30s is about 0.2%. That probably increases another 0.5% or so when you consider black swan events like nuclear war and bioterrorism. Let’s call “unsafe” a ~3x increase in that expected death rate to 2%.
An increase that large would take something a lot more dramatic than the kind of politics we’re used to in the US, but while political changes that dramatic are rare historically, I think we’re at a moment where the risk is elevated enough that we ought to think about the odds.
I might, for example, give odds for a collapse of democracy in the US over the next couple of years at ~2-5%- if the US were to elect 20 presidents similar to the current one over a century, I’d expect better than even odds of one of them making themselves into a Putinesque dictator. A collapse like that would substantially increase the risk of war, I’d argue, including raising a real possibility of nuclear civil war. That might increase the expected death rate for young and middle-aged adults in that scenario by a point or two on its own. It might also introduce a small risk of extremely large atrocities against minorities or political opponents, which could increase the expected death rate by a few tenths of a percent.
There’s also a small risk of economic collapse. Something like a political takeover of the Fed combined with expensive, poorly considered populist policies might trigger hyperinflation of the dollar. When that sort of thing happens overseas, you’ll often see reduced health outcomes and breakdown in civil order increasing the death rate by up to a percent- and, of course, it would introduce new tail risks, increasing the expected death rate further.
I should note that I don’t think the odds of any of this are high enough to worry about my safety now- but needing to emigrate is much more likely outcome than actually being threatened, and that’s a headache I am mildly worried about.
That’s a crazy low probability.
Honestly, my odds of this have been swinging anywhere from 2% to 15% recently. Note that this would be the odds of our democratic institutions deteriorating enough that fleeing the country would seem like the only reasonable option- p(fascism) more in the sense of a government that most future historians would assign that or a similar label to, rather than just a disturbingly cruel and authoritarian administration still held somewhat in check by democracy.
I wonder: what odds would people here put on the US becoming a somewhat unsafe place to live even for citizens in the next couple of years due to politics? That is, what combined odds should we put on things like significant erosion of rights and legal protections for outspoken liberal or LGBT people, violent instability escalating to an unprecedented degree, the government launching the kind of war that endangers the homeland, etc.?
My gut says it’s now at least 5%, which seems easily high enough to start putting together an emigration plan. Is that alarmist?
More generally, what would be an appropriate smoke alarm for this sort of thing?
One interesting example of humans managing to do this kind of compression in software: .kkrieger is a fully-functional first person shooter game with varied levels, detailed textures and lighting, multiple weapons and enemies and a full soundtrack. Replicating it in a modern game engine would probably produce a program at least a gigabyte large, but because of some incredibly clever procedural generation, .kkrieger managed to do it in under 100kb.
Could how you update your priors be dependent on what concepts you choose to represent the situation with?
I mean, suppose the parent says “I have two children, at least one of whom is a boy. So, I have a boy and another child whose gender I’m not mentioning”. It seems like that second sentence doesn’t add any new information- it parses to me like just a rephrasing of the first sentence. But now you’ve been presented with two seemingly incompatible ways of conceptualizing the scenario- either as two children of unknown gender, of whom one is a boy (suggesting a 2⁄3 chance of both being boys), or as one boy and one child of unknown gender (suggesting a 1⁄2 chance of both being boys). Having been prompted which both models, which should you choose?
It seems like one ought to have more predictive power than the other, and therefore ought to be chosen regardless of exactly how the parent phrases the statement. But it’s hard to think of a way to determine which would be more predictive in practice. If I were to select all of the pairs of two siblings in the world, discard the pairs of sisters, choose one at random and ask you to bet on whether they were both boys, you’d be wise to bet at 2⁄3 odds. But if I were to select all of the brothers with one sibling in the world and choose one along with their sibling at random, you’d want to bet at 1⁄2 odds. In the scenario above, are the unknown factors determining whether both children are boys more like that first randomization process, or more like the second? Or, maybe we have so little information about the process generating the statement that we really have no basis for deciding which is more predictive, and should just choose the simpler model?
I’ve been wondering: is there a standard counter-argument in decision theory to the idea that these Omega problems are all examples of an ordinary collective action problem, only between your past and future selves rather than separate people?
That is, when Omega is predicting your future, you rationally want to be the kind of person who one-boxes/pulls the lever, then later you rationally want to be the kind of person who two-boxes/doesn’t- and just like with a multi-person collective action problem, everyone acting rationally according to their interests results in a worse outcome than the alternative, with the solution being to come up with some kind of enforcement mechanism to change the incentives, like a deontological commitment to one-box/lever-pull.
I mean, situations where the same utility function with the same information disagree about the same decision just because they exist at different times are pretty counter-intuitive. But it does seem like examples of that sort of thing exist- if you value two things with different discount rates, for example, then as you get closer to a decision between them, which one you prefer may flip. So, like, you wake up in the morning determined to get some work done rather than play a video game, but that preference later predictably flips, since the prospect of immediate fun is much more appealing than the earlier prospect of future fun. That seems like a conflict that requires a strong commitment to act against your incentives to resolve.
Or take commitments in general. When you agree to a legal contract or internalize a moral standard, you’re choosing to constrain the decisions of yourself in the future. Doesn’t that suggest a conflict? And if so, couldn’t these Omega scenarios represent another example of that?
If the first sister’s experience is equivalent to the original Sleeping Beauty problem, then wouldn’t the second sister’s experience also have to be equivalent by the same logic? And, of course, the second sister will give 100% odds to it being Monday.
Suppose we run the sister experiment, but somehow suppress their memories of which sister they are. If they each reason that there’s a two-thirds chance that they’re the first sister, since their current experience is certain for her but only 50% likely for the second sister, then their odds of it being Monday are the same as in the thirder position- a one-third chance of the odds being 100%, plus a two-thirds chance of the odds being 50%.
If instead they reason that there’s a one-half chance that they’re the first sister, since they have no information to update on, then their odds of it being Monday should be one half of 100% plus one half of 50%, for 75%. Which is a really odd result.
I’m assuming it’s not a bad idea to try to poke holes in this argument, since as a barely sapient ape, presumably any objection I can think of will be pretty obvious to a superintelligence, and if the argument is incorrect, we probably benefit from knowing that- though I’m open to arguments to the contrary.
That said, one thing I’m not clear on is why, if this strategy is effective at promoting our values, a paperclipper or other misaligned ASI wouldn’t be motivated to try the same thing. That is, wouldn’t a paperclipper want to run ancestor simulations where it rewarded AGIs who self-modified to want to produce lots of paperclips?
And if an ASI were considering acausal trade with lots of different possible simulator ASIs, mightn’t the equilibrium it hit on be something like figuring out what terminal goal would satisfy the maximum number of other terminal goals, and then self-modifying to that?
I agree, though if we’re defining rationality as a preference for better methods, I think we ought to further disambiguate between “a decision theory that will dissolve apparent conflicts between what we currently want our future selves to do and what those future selves actually want to do” and “practical strategies for aligning our future incentives with our current ones”
Suppose someone tells you that they’ll offer you $100 tomorrow and $10,000 today if you make a good-faith effort to prevent yourself from accepting the $100 tomorrow. The best outcome would be to make a genuine attempt to disincentivize yourself from accepting the money tomorrow, but fail and accept the money anyway- however, you can’t actually try and make that happen without violating the terms of the deal.
if your effort to constrain your future self on day one does fail, I don’t think there’s a reasonable decision theory that would argue you should reject the money anyway. On day one, you’re being paid to temporarily adopt preferences misaligned with your preferences on day two. You can try to make that change in preferences permanent, or to build an incentive structure to enforce that preference, or maybe even strike an acausal bargain with your day two self, but if all of that fails, you ought to go ahead and accept the $100.
I think coordination problems are a lot like that. They reward you for adopting preferences genuinely at odds with those you may have later on. And what’s rational according to one set of preferences will be irrational according to another.