Anyone here has good posts/articles on proliferation risks attached to embracing nuclear as a source of energy? Either on this site or otherwise.
I’ve read a few articles but haven’t formed a strong opinion for or against nuclear due to this point. There’s definitely a pro-nuclear slant on this site, hence wished to know people’s reasoning.
Is there a LW post on “things in our brain we can edit using choice/free will, versus things we can’t” as two distinct categories or a spectrum.
For instance I feel like: I can choose to lift my leg instantly, I can’t choose to fall out of love with someone instantly, I can choose to this over a period of time, I can’t choose to enjoy eating shit. This seems unambigous irrespective of your stance on free will, and I often find myself having to refer to these two categories.
If you read https://meaningness.com there’s probably some stuff in there about ‘partial control’. (Though I haven’t double checked what is where since https://metarationality.com got broken off into a separate website.)
It’s critical of a lot of stuff on LW in a particular, reasoned fashion.
LessWrong isn’t exactly founded on the map-territory model of truth, but it’s definitely pretty core to the LessWrong worldview. The map-territory model implies a correspondence theory of truth. But I’d like to convince you that the map-territory model creates confusion and that the correspondence theory of truth, while appealing, makes unnecessary claims that infect your thinking with extraneous metaphysical assumptions. Instead we can see what’s appealing about the map-territory metaphor but drop most of it in favor of a more nuanced and less confused model of how we know about the world.
Thanks. I’ll check out the book. “Partial control” seems exactly what I’m referring to. Although the book does seem to be on a slightly different topic, and I haven’t heard of the author. Do you by any chance have link to a summary or review?
I’m not criticising anything on LW btw (not here atleast). It’s just that—even if you assume the naturalist compatibilist stance that LW assumes—you still need a phrase to refer to things that feel like they’re in or not in our control; you still need to talk about the first-person experience.
Do you by any chance have link to a summary or review?
There are some summaries in the (hyper text) book, but they’re probably too short to give an overview.
I could write a review, but I’d probably want to PM rather than post that.
One reason I haven’t written a review is that I find the book to be: easy to read, and short enough that it probably doesn’t save time, without some work.
That is, however, not as short as things could be with skimming. I figured the first two sections are more pre-requisites than things which go off somewhere else.
Something rather fast, which might fail, is just reading the page: https://meaningness.com/control. Maybe it has what you’re looking for there, maybe it doesn’t. If you have any questions feel free to PM me.
I found that page using this search: site:meaningness.com partial control
The next two hits from that search only have a one hit for the page search for partial, at the very end of them. Beyond that hits seem to be from a repeated phrase, which is just a short summary of a section in the table of contents.
I don’t have a map of what is a pre-requisite of what. But, assuming that that’s handled if you read it in order, then: ‘partial control’ is addressed in https://meaningness.com/control. I’d guess… you can read section one (Why meaningness? and it’s sub pages 4), section 2 (Stances and its subpages about 14). Then that page on control is late section 3. The preceding sub-pages don’t have sub-pages, but eternalism has 3 sub-subpages of some kind or another before it).
(Things that might change: https://meaningness.com/meaningness-practice which comes late part 2 contains a subsection that is currently 100 words but could become more relevant to what you’re looking for if it’s updated, or pages are added after that page. That possibility is mentioned in a note from July 2014 though, so if you read this soon, I’m guessing that won’t happen by then.)
Here’s how long the first two sections are: (using https://wordcount.com, copying the text in, (adding a return a “-” and another two returns) not the table of contents on the left because it’s repeated. Using the url led to a massive overcount by at least an order of magnitude, possibly some builtin recursive counting on the links or something, so I didn’t use that.)
This does not include the comments (a separate page, which you don’t have to read, but might be useful if you’re confused or have questions, though such things might also be answered later on in the book).
12,154 Words
80,013 Characters
67,310 Characters without space
21,985 Syllables
827 Sentences
487 Paragraphs*
This one includes every “-”, all 18 I added, so it’s actually:
469
Paragraphs
If you figure section 3 is 75% the length of all the stuff before it, then, including the page on control, that’s:
Assume you could use magic that could progressively increase the intelligence of exactly one human in a “natural way”. * You need to pick one person (not yourself) who you trust a lot and give them some amount of intelligence. Your hope is they use this capability to solve useful problems that will increase human wellbeing broadly.
What is the maximum amount of intelligence you’d trust them with?
*when I say natural way I mean their neural circuits grow, maintain and operate in a manner similar to how they already naturally work for humans, biologically. Maybe the person just has more neurons that make them more intelligent, instead of a fundamentally different structure.
I guess the two key considerations are:
whether there exists any natural form of cognitive enhancement that doesn’t also cause signficant value drift
whether you’d trust a superpowerful human even if their values seem mostly good and they don’t drift
Any web dev here wanna host a tool that lets you export your account data from this site? I’ve mostly figured it out, I’m just being lazy. Need to use graphQL queries, then write to files, then I guess upload the files to db and zip, and let the user download the zip.
For posts, need graphQL query, then dump each htmlBody into a separate file. No parsing required, I hope. { posts(input: { terms: { view: “userPosts” userId: “nmk3nLpQE89dMRzzN” limit: 50 meta: null # this seems to get both meta and non-meta posts } }) { results { _id title pageUrl postedAt htmlBody voteCount baseScore slug } } }
For comments, need to graphQL query, then dump html body into individual files. (Although I’m not entirely sure what one will do with thousands of comment files)
Plus some loops to iterate if the limit is too large. And handle errors. Plus some way to share the credentials securely—or make it into a browser plugin.
Has anyone ever published a comprehensive piece of the form:
“I assign X probability to AI alignment being a solveable problem (in theory), and here’s the set of intuitions / models / etc that I base this estimate on”
--
Cause my intuitions point in the opposite direction, but maybe that’s just because I’m missing intuitions of people here.
Logical uncertainty is hard. But the intuition that I have is that humans exist, so there’s at least a proof of concept for a sort of aligned AGI (although admittedly not a proof of concept for an ASI)
But if your definition of alignment is “an AI that does things in a way such that all humans agree on it’s ethical choices” I think you’re doomed from the start, so this counterintuition proves too much. I don’t think there is an action an AI could take or a recommendation it could make that would satisfy that criteria (in fact, many people would say that the AI by it’s nature shouldn’t be taking actions or making recommendations)
Okay. I’d be keen on your definition of alignment.
P.S. This discussion which we’re having right now is exactly what I’d be keen on, in a compressed fashion, written by an alignment researcher who has anticipated lots of intuitions and counterintuitions.
It seems like something like “An AI that acts and reasons in a way that most people who are broadly considered moral consider moral” would be a pretty good outcome.
Assume AI is sufficiently capable it can establish a new world order all by itself. I wouldn’t trust most people I otherwise consider moral with such power.
Epistemic status: I haven’t spent too much time on this, I could easily be missing stuff here.
Problem
Whenever a new post is made on a website (reddit / stackexchange / lesswrong etc), upvotes act as an initial measure for how engaging and high quality the post is. This initial vetting by few users allows the remaining users to access a feed of already high quality posts.
This model however still requires few users to willingly go through unvetted posts and upvote the good quality ones. It would be useful to reduce the amount of input required to get a measure of quality.
Solution
Users could stake money on how many upvotes they think a post will receive. This money acts as an initial signal to show the post to more users. If more users see the post and upvote it, the money is returned to whoever staked it. If they don’t, the money could be taken by the website (or distributed to all the users on the website, or atleast the users who saw the low quality post).
The reverse can also happen, users can stake money claiming a post will not receive upvotes.
This isn’t a prediction market in the traditional sense where people bet higher votes only against those who bet lower votes. Instead you’re betting against the website directly. Why? Website benefits from good quality content, and reverse for bad quality.
This differs from standard advertisement model, because in that model you pay whether the content is engaging for users (good for website) or not engaging (bad for website). Automatically this means users will more often use this to push content that does not engage users much—blatant ads.
Considerations
You don’t want the users staking money to gatekeep the content of the website too much. More specifically if there’s a dynamic relationship between [money staked, impressions and upvotes], you don’t want money staked alone to impact impressions so much that you don’t get a good signal from upvotes, you still need that. Cause the model goes both ways—more impressions can mean more upvotes, but more upvotes or money staked can also mean more impressions. Maybe instead of betting on raw number of upvotes it’ll make sense to bet on upvote-to-impression ratio or something. People who study ad models probably have a better sense of how to mathematically model this.
- bot accounts—Can use captcha and/or karma and/or KYC to detect them. (I wonder if KYC will have to be mandatory though.)
- bribing real users—Users will need to coordinate the bribe on a different website, trust that the bribe will actually pay through, and it’s hard to scale this without someone posting evidence back to the mods of original site.
- driving genuine traffic to the site in the hopes that people will upvote—hence need to track vote-to-impression ratio or similar rather than just votes. And can again privilege the votes of longstanding community members (like lesswrong karma does)
“Why it is a good idea to google / discuss concrete examples instead of thinking / talking purely in the abstract” ?
Like I was reading up number of ICBMs in different countries and military policies and stuff, but it helps to pull up a youtube video of an actual missile test launch—just to get a sense of how big the thing is and what a launch looks like in practice.
MESSY INTUITIONS ABOUT AGI, MIGHT TYPE THEM OUT PROPERLY LATER
OR NOT
I’m sure we’re a finite number of voluntary neurosurgeries away from worshipping paperclip maximisers. I tend to feel we’re a hodge-podge of quick heuristic modules and deep strategic modules, and until you delete the heuristic modules via neurosurgery our notion of alignment will always be confused. Our notion of superintelligence / super-rationality is an agent that doesn’t use the bad heuristics we do, people have even tried formalising this with Solomonoff / Turing machines / AIXI. But when actually coming face to face with one:
- Either we are informed of the consequences of the agent’s thinking and we dislike those, because those don’t match our heuristics
- Or the AGI can convince us to become more like them to the point we can actually agree with their values. The fastest way to get there is neurosurgery but if we initially feel neurosurgery is too invasive I’m sure there exists another much more subtle path that the AGI can take. Namely one where we want our values to be influenced in ways, that eventually end up with us getting closer to the neurosurgery table.
- Or ofcourse the AGI doesn’t bother to even get our approval (the default case) but I’m ignoring that and considering far more favourable situations.
We don’t actually have “values” in an absolute sense, we have behaviours. Plenty of Turing machines have no notion of “values”, they just have behaviour given a certain input. “Values” are this fake variable we create when trying to model ourselves and each other. In other words the turing machine has a model of itself inside itself, that’s how we think about ourselves (metacognition). So a mini-Turing machine inside a Turing machine. Ofcourse the mini-machine has some portions deleted, it is a model. First of all this is physically necessitated. But more importantly, you need a simple model to do high-level reasoning on it in short amounts of time. So we create this singular variable called “values” to point to essentially a cluser in thingspace. Let’s say the Turing machine tends to increment it’s 58th register on only 0.1% of all possible 24-bitstring inputs, else it tends to decrement a lot more. The mini-Turing machine inside the machine modelling itself will just have some equivalent of the 58th register never incrementing at all, and decrementing instead. So now the Turing machine incorrectly thinks its 58th register never increments. So it thinks that decrementing the 58th register is a “value” of the machine.
[Meta note: When I say “value” here, I’d like to still stick to a viewpoint where concepts like “free will”, “choice”, “desire” and “consciousness” are taboo. Basically I have put on my reductionist hat. If you believe free will and determinism are compatible, you should be okay with this—as I’m just consciously restricting the number of tools/concepts/intuitions I wish to use for this particular discussion—not adopting any incorrect ones. You can certainly reintroduce your intuitions in a different discussion, but in a compabilist world, both our discussions should generally lead to true statements.
Hence in this case, when the machine thinks of “decrementing its 58th register” as its own value, I’m not referring to concepts like “I am driven to decrement my 58th register” or “I desire to decrement my 58th register” but rather “Decrementing 58th register is something I do a lot.” and since “value” is a fake variable that the Turing machine has full liberty to define, it says ” “Value” is defined by the things I tend to do.” When I say “fake” I mean it exists in the Turing machine’s model of itself, the mini-machine.
“Why do I do the things I do?” or “Can I choose what I actually do?” are not questions I’m considering, and I’m for now let’s assume the machine doesn’t bother itself with such questions (although in practice it certainly may end up asking itself such terribly confused questions, if it is anything like human beings. This doesn’t really matter rn.)
End note]
I’m gonna assume a single scale called “intelligence” along which all Turing machines can be graded. I’m not sure this scale actually even exists, but I’m gonna assume it anyway. On this scale:
Humans <<< AGI <<< AIXI-like ideal
<<< means “much less intelligent than” or “much further away from reasoning like the AIXI-like ideal”, these two are the same thing for now, by definition.
An AGI trying to model human beings won’t use such a simple fake variable called “values”, it’ll be able to build a far richer model of human behaviour. It’ll know all about the bad human heuristics that prevent humans from becoming like an AGI or AIXI-like.
Even if the AGI wants us to be aligned, it’s just going to do the stuff in the first para. There are different notions of aligned:
Notion 1: “I look at another agent superficially and feel we want the same things.” In other words, my fake variable called “my values” is sufficiently close to my fake variable called “this agent’s values. I will necessarily be creating such simple fake variables if I’m stupid, i.e., I’m human, cause all humans are stupid relative to the AIXI-like ideal.
An AGI that optimises to satisfy notion 1 can hide what it plans to do to humans and maintain a good appearance until it kills us without telling us.
Notion 2: “I get a full picture of what the agent intends to do and want the same things”
This is very hard because my heuristics tell me all the consequences the AGI plans to do are bad. Problem is my heuristics. If I didn’t have those heuristics, if was closer to the AIXI-like ideal, I wouldn’t mind. Again “I wouldn’t mind” is from a perspective of machines not consciousness or desires, so translate it to ” the interaction of my outputs will be favourable towards the AGI’s outputs in the real world”.
So the AGI will find ways to convince us to want our values neurosurgically altered. Eventually we will both be clones and hence perfectly aligned.
Now let’s bring stuff like “consciousness” and “desires” and “free will” back into the picture. All this stuff interacts very strongly with the heuristics, exactly the things that make us further away from the AIXI-like ideal.
Simply stated, we don’t naturally want to be ideal rational agents. We can’t want them, in the sense we can’t get ourselves to truly consistently want to be ideal rational agents by sheer willpower, we need physical intervention like neurosurgey. Even if free will exists and is a useful concept, it has finite power. I can’t delete sections of my brain using free will alone.
So now if intelligence is defined as closer to AIXI-like ideal in internal structure, then intelligence by definition leads to misalignment.
P.S. I should probably also throw some colour on what kinds of “bad heuristics” I am referring to here. In simplest terms “low kolmogrow complexity behaviours that are very not AIXI-like”
1/ For starters, the entirety of System 1 and sensory processing (see Daniel Kahlmann thinking fast and slow). We aren’t designed to maximise our intelligence, we just happen to have an intelligence module (aka somewhat AIXI-like). Things we care about sufficiently strongly are designed to override System 2 which is more AIXI-like, insofar has evolution has any design for us. So maybe it’s not even “bad heuristics” here, it’s entire modules in our brain that are not meant for thinking in the first place. It’s just neurochemicals firing and one side winning and the other side losing, this system looks nothing like AIXI. And it’s how we deal with most life-and-death situations.
This stuff is beyond the reach of free will, I can’t stop reacting to snakes out of sheerwill, or hate chocolate or love to eat shit. Maybe I can train myself on snakes, but I can’t train myself to love to eat shit. The closer you are to sensory apparatus and the further away from the brain, the less the system looks like AIXI. And simultaneously the less free will you seem to have.
(P.S. Is that a coincidence or is free will / consciousness really an emergent property of being AIXI-like? I have no clue, it’s again the messy free will debates that might get nowhere)
2/ Then ofcourse at the place where System 1 and 2 interact you can actually observe behaviour that makes us further away from AIXI-like. Things like why we find it difficult to have independent thoughts that go away from the crowd. Even if we do have independent thoughts we need to spend a lot more energy actually developing them further (versus thinking of hypothetical arguments to defend those ideas in society).
This stuff is barely within the reach of our “free will”. Large portions of LessWrong attempt at training to make us more AIXI-like. By reducing so-called cognitive biases and socially-induced biases.
3/ Then we have proper deep thinking (System 2 and such-like) which seems a lot closer to AIXI. This is where we move beyond “bad heuristics” aka “heuristics that AIXI-like agents won’t use”. But maybe an AGI will find these modules of ours horribly flawed too, who knows.
Anyone proposing that building AGI should be banned by governments?
Cause it seems like even if AGI alignment is possible (I’m skeptical), there’s no guarantee the person who happens to create AGI also happens to want to follow this perfect solution. Or, even if they want to, that they should have a moral right to decide on the behalf of their country or humanity as a whole. Nation states pre-committing to “building AGI is evil” seems a better solution. It might also slow down rate of progress in AI capabilities, which I’m guessing is also desirable to some alignment theorists.
I haven’t seen any government, let alone the set of governments, demonstrate any capability of commitment on this kind of topic. States (especially semi-representative ones like modern democracies) just don’t operate with a model that makes this effective.
I don’t know if it is or not. Human cloning seems both less useful and less harmful (just less impactful overall), so simultaneously easier to implement and not a good comparison to AGI.
I see cloning-based research as very impactful, it’s also a route to getting more intelligent beings to exist. I’d be hard-pressed to find something as impactful as AGI though.
Also I’m not sure about “less useful”. Given a world where AI researchers know that alignment is hard or impossible, they might see human cloning as more useful than AGI. Unless you mean AGI’s perceivedusefulness is higher, which may be true today but maybe not in the future.
I’m not following the connection between human cloning and AGI. Are you talking about something different from https://en.wikipedia.org/wiki/Human_cloning , where a baby is created with only one parent’s genetic material?
To me, human cloning is just an expensive way to make normal babies.
Yep referring to that only. You can keep cloning the most intelligent people. At enough scale you’ll be increasing the collective intelligence of mankind, and scientific output. Since these clones will hopefully retain basic human values, you now have more intelligence with alignment.
Do you have any reason to believe that this is happening AT ALL? I’d think the selection of who gets cloned (especially when it’s illicit, but probably even if it were common) would follow wealth more than intelligence.
Selective embryo implantation based on genetic examination of two-parent IVF would seem more effective, and even that’s not likely to do much unless it becomes a whole lot more common, and if intelligence were valued more highly in the general population.
Since these clones will hopefully retain basic human values
Huh? Why these any more than the general population? The range of values and behaviors found in humans is very wide, and “basic human values” is a pretty thin set.
Most importantly, a 25-year improvement cycle, with a mandatory 15-20 year socialization among many many humans of each new instance is just not as scary as an AGI with an improvement cycle under a year (perhaps much much faster), and with direct transmission of models/beliefs from previous generations. Just not comparable.
Do you have any reason to believe that this is happening AT ALL?
Wasn’t talking about today, just an arbitrary point in the future.
I’d think the selection of who gets cloned (especially when it’s illicit, but probably even if it were common) would follow wealth more than intelligence.
I was commenting that it has a lot of power and potential benefits if groups of people weild it, whether they actually do is a different question.
On latter question, you’re right ofcourse—different groups of people will select for different traits. I would assume there will exist atleast some groups of intelligent people will want to select for intelligent people further. There is competitive advantage to nations who legalise this.
re: last 2 paras, I’m not sure we understood each other. I’ll try again. Intelligence is valuable towards building stable paths to survival, happiness, prosperity. AGI will be much more intelligent than selected humans. However, AGI will almost certainly kill us because of lack of alignment. (Assume a world where AGI researchers have accepted this as fact.) This makes AGI not very useful, on the balance of it.
Humans selected for intelligence are also valuable. They will be a lot less intelligent than AGI ofcourse. But they will (hopefully) be aligned enough with the rest of humanity to work for its welfare and prosperity. This makes selected humans very useful.
Hence selected humans could be more useful than AGI.
That’s a valid intuition—I’d be happy to learn why you feel that if you have time (no worries if not).
Would a non-democratic state like China or Russia fare better in this regard then? If one of them takes the issue seriously enough they could force other states to also take it seriously via carrot and stick.
Consider any innovation so world-changing, that goverments are not willing to let the creator have complete control over how it is used. For instance, a new super-cheap energy source such as fusion reactors.
Maximising profit from fusion reactors for instance could mean selling electricity at a price slightly lower than the current market price, waiting to monopolise the global power grid, waiting for all other power companies to shut down, then raising prices again, thereby not letting anyone actually reap the benefits of super-cheap electricity. It is unlikely that govts however will let the creator do this.
As someone funding early-stage fusion research, would you have to account for this in your investment thesis? That there is some upper limit on how large a company can legally grow. So far it seems the cap is higher than $2.5 trillion atleast, looking at Apple’s market cap. Although it is possible that even a company smaller than $2.5 trillion monopolises a sector and is then prevented from price-fixing by the government.
I don’t think, at this scale, that “the government” is a useful model. There are MANY governments, and many non-government coalitions that will impact any large-scale system. The trick is delivering incremental value at each stage of your path, to enough of the selectorate of each group who can destroy you.
I get the article on How an algorithm feels from the inside—if you assume a deterministic universe and consciousness as something that emerges out but has no causal implications on the universe.
Now if I try drawing the causal arrows
Outside view:
Brain ← → Human body ← → Environment
In the outside view, “you” don’t exist as a coherent singular force of causation. Questions about free will and choice cease to exist.
Inside view (CDT):
Coherent singular “soul” (using CDT) ← → Brain ← → Human body ← → Environment
Notably Yudkowsky would call this singular soul as something that only exists in the inside view itself, and not the outside view
Now we replace this with …
Inside view (LDT):
Coherent singular “soul” (using CDT) ← → All instantiations of this cognitive algorithm across space and time ← → Brain(s) ← → Human body(s) ← → Environment
The new decision theories don’t seem to eliminate the illusion of the soul—they just now assert that the soul not only interacts with this particular instantion of the algorithm the brain is running, but all instantiations of this algorithm across space (and time?). Why is this more reasonable than assuming the soul only interacts with this particular instantation? Note that the soul is a fake property here, one that only exists in the inside view. And, to our best understanding of physical laws, the universe is made of atoms* in its casual chain, not algorithms. Two soulless algorithms don’t causally interact by virtue of the fact that they’re similar algorithms, they interact by virtue of the atoms they causally impact, and the causal impacts of those atoms on each other. Why do algorithms with the fake property of a soul suddenly get causally bound through space and time?
*well technically its QM waves or strings or whatever, but that doesn’t matter to this discussion
This doesn’t help. In a counterfactual, atoms are not where they are in actuality. Worse, they are not even where the physical laws say they must be in the counterfactual, the intervention makes the future contradict the past before the intervention.
Do I assume “counterfactual” is just the english word as used here?
If so, it should only exists in the inside view, right? (If I understand you)
The sentence I wrote on soulless algorithms is about the outside view. Say two robots are playing football. The outside view is—one kicks the football, other sees football (light emitted by football), then kicks it. So the only causal interaction between the two robots is via atoms. This is independent of what decision theory either robot is using (if any), and it is independent of whether the robots are capable of creating an internal mental model of themselves or the other robot. So it applies to both robots with dumb microcontrollers like those in a refrigerator and smart robots that could even be AGI or have some ideal decision theory. Atleast assuming the universe follows the deterministic physical laws we know about.
The point is that the weirdness with counterfactuals breaking physical laws is the same for controlling the world through one agent (as in orthodox CDT) and for doing the same through multiple copies of an agent in concert (as in FDT). Similarly, in actuality neither one-agent intervention nor coordinated many-agent intervention breaks physical laws. So this doesn’t seem relevant for comparing the two, that’s what I meant by “doesn’t help”.
By “outside view” you seem to be referring to actuality. I don’t know what you mean by “inside view”. Counterfactuals are not actuality as normally presented, though to the extent they can be constructed out of data that also defines actuality, they can aspire to be found in some nonstandard semantics of actuality.
Do you mean the counterfactual may require more time to compute than the situation playing out in real time? If so, yep makes a ton of sense, they should probably focus on algorithms or decision theories that can (atleast in theory) be implemented in real life on physical hardware. But please confirm.
Could you please define “actuality” just so I know we’re on the same page? I’m happy to read any material if it’ll help.
Inside view and outside view I’m just borrowing from Yudkowsky’s How an algorithm feels from the inside. Basically assumes deterministic universe following elegant physical laws, and tries to dissolve questions of free will / choice / consciousness. So the outside view is just a state of the universe or a state of the Turing machine. This object doesn’t get to “choose” what computation it is going to do or what decision theory it is going to execute, that is already determined by current state. So the future states of the object are calculable*.
*by an oracle that can observe the universe without interacting, with sufficient but finite time.
Only in the inside view does a question like “Which decision theory should I pick?” even make sense. In the inside view, free will and choice are difficult to reason about (as humans have observed over centuries) - if you really wanna reason about those you can go to the outside view where they cease to exist.
Are there people who have read these four posts and still self-identify as either consequentialists or utilitarians? Is yes, why so?
Because the impression I got from these posts (which also matches my independent explorations) is that humans have deontological rules wired into them due to evolution—this is just observable fact. And that you don’t really get to change that even if you want to.
Considering the trivial example—would you kill one person to save two. No one gets to know your decision, so there are no future implications on anyone else, no precedent being set. Even on this site there is a ton of deflection from honestly answering the question as is. Either “Yes I will murder one person in cold blood” or “No I will let the two people die”. Assume LCPW.
I believe that in the LCPW it would be the right decision to kill one person to save two, and I also predict that I wouldn’t do it anyway, mainly because I couldn’t bring myself to do it.
In general, I understood the Complexity of Value sequence to be saying “The right way to look at ethics is consequentialism, but utilitarianism specifically is too narrow, and we want to find a more complex utility function that matches our values better.”
Why do you feel it would be the right decision to kill one? Who defines “right”?
I personally understood it differently. Thou art godshatter says (to me) that evolution depends on consequences but evolving consequentialism into a brain is hard and therefore human desires are not wired consequentially. Also that evolution only cares about consequences that actually happen, not the ones it predicts happens—because it cannot predict.
Why do you feel it would be the right decision to kill one? Who defines “right”?
I define “right” to be what I want, or, more exactly, what I would want if I knew more, thought faster and was more the person I wish I could be. This is of course mediated by considerations on ethical injunctions, when I know that the computations my brain carries out are not the ones I would consciously endorse, and refrain from acting since I’m running on corrupted hardware. (You asked about the LCPW, so I didn’t take these into account and assumed that I could know that I was being rational enough).
It’s been a while since I read Thou Art Godshatter and the related posts, so maybe I’m conflating the message in there with things I took from other LW sources.
Just FYI, I’ve become convinced that most online communication through comments with a lot of context are much better settled through conversations, so if you want, we could also talk about this over audio call.
I read some of the articles. Happy to get on a voice call if you prefer. My thoughts so far boil down to:
- Corrupted hardware seems to imply a clear distinction between goals (/ends/terminal goals) and actions towards goals (/means/instrumental goals), and that only actions are computed imperfectly. I say firstly we don’t have as sharp a distinction between the two in our brain’s wiring. (Instrumental goals often become terminal if you focus on them hard enough.) Secondly that it’s not actions but terminal goals themselves that are in conflict.
- We have multiple conflicting values. There’s no “rational” way to always decide what trumps what—sometimes it’s just two sections of the brain firing neurochemicals and one side winning, that’s it. System-2 is somewhat rational, System-1 not so much, and System-1 has more powerful rewards and penalties. System-1 preferences admit circular preferences, and there’s nothing you can do about it.
- “what I would want if I knew more, thought faster etc” doesn’t necessarily lead to one coherent place. You have multiple conflicting values, which of those you end up deleting if you had the brain of a supercomputer could be arbitrary. You could become Murder Gandhi or a some extreme happiness utilitarian, I don’t see either of these as necessarily desirable places to be relative to my current state. Basically I want to run on corrupted hardware. I don’t want my irrational System-1 module deleted.
I’ve found people generally find it harder to answer open-ended questions. Not just in terms of giving a good answer but giving any answer at all. It’s almost as if they lack the cognitive module needed for such search.
Has anyone else noticed this? Is there any research on it? Any post on LessWrong or elsewhere?
Post-note: Now that I’ve finished writing, a lot of this post feels kinda “stupid”—or more accurately, not written using a reasoning process I personally find appealing. Nevertheless I’m going to post it just in case someone finds it valuable.
-----
I don’t see a lot of shortform posts here so I’m unsure of the format. But I’m in general thinking a lot about how you cannot entirely use reasoning to reason about usefulness of various reasoning processes amongst each other. In other words, rationality is not closed.
Consider a theory in first-order logic. This has a specific set of axioms and a set of deductive rules. Consider a first-order theory with axioms “Socrates is mortal” and “Socrates is immortal”. A first-order theory which is inconsistent is obviously a bad reasoning process. A first-order theory which is consistent but whose axioms don’t map to real world assumptions, is also a bad reasoning process for different reasons. Lastly, someone can argue that first-order logic is a bad reasoning process no matter what axioms it is instantiated with, because axioms + rigid conclusions is a bad way of reasoning about the world. And that humans are not wired to do pure FOL and are instead capable of reaching meaningful conclusions without resorting to FOL.
All these three ways of calling a particular FOL theory bad are different, but none of them are expressible in FOL itself. You can’t prove why inconsistent FOL theories are “bad” inside of the very same FOL theory (although ofcourse you can prove it inside of a different theory, perhaps one that has “Inconsistent FOL theories are bad” as an axiom). You can’t prove that the axioms don’t map to real world conditions (let alone prove why axioms not mapping to real world conditions makes the axioms “bad”). You can’t prove that deductive rules don’t map to real world reasoning capacities of the human mind. If an agent was rigidly coded with this FOL theory, you’d never get anywhere with them on these topics, there’d be a communication failure between the two agents – you and them.
All these three arguments however can be framed as appeals to reality and what is observable. A statement and its negation being simultaneously proven true is bad because such phenomena are not typically observed in practice, and because there are no statements provable from this system that are useful to achieve objectives in the real world. Axioms not mapping to physical world is obviously an appeal to the observable. FOL being a bad framework for human reasoning is also an appeal to observation, in this case an observation you’ve made after observing yourself, and you’re hoping the other person has also made.
It seems intuitive that someone who uses “what maps to the observable is correct” will not admit any other axioms, if they wish to be consistent – because most such axioms will conflict with this appeal to the observable. But in a world with multiple agents, we can’t take this as axiom, lest we get get stuck in our own communication bubble. We need to be able to reason about the superiority of “what maps to the observable is correct” as a reasoning process, using some other reasoning process. And in fact, I have seemingly been using this post so far to do exactly that – use some reasoning process to argue in favour of “what maps to observable” reasoning processes over FOL theories instantiated with simple axioms such as “Socrates is mortal”.
And if you notice further, my argument for why “what maps to observable is good” in this post doesn’t seem very logical. I still seemingly am appealing to “what maps to observable is good” in order to prove “what maps to observable is good” – which is obviously a no-go when using FOL. But to your human mind, the first half of this post still sounded like it was saying something useful, despite not being written in FOL, nor having a clear separation between axioms and deduced statements. You could at this point appeal to Wittgensteinian “word clouds” or “language games” and say that some sequences of words referring to each other are perceived to be more meaningful than other sequences of words, and that I have hit upon one of the more meaningful sequences of words.
But how will you justify Wittgensteinian discourse as a meta-reasoning process to understand reasoning processes. More specifically, what reasoning process is wittgensteinian discourse using to prove that “witgensteinian reasoning processes are good”? I could at this point self-close it and say that Wittgensteinian reasoning processes are being used to reason and reach the conclusion that Wittgensteinian reasoning processes are good. But do you see the problem here?
Firstly, this kind of self-closure can be easily done by most systems. An FOL theory can assert that axiom A is being used to prove axiom A because A=A. A system based on “what is observable is correct” can appeal to observation and to argue the superiority of “what is observable is correct”.
And secondly, reasoning this self-closure only tends to look meaningful inside of the system itself. An FOL prover will say that the empiricist and the Wittgensteinian have not done anything meaningful when trying to analyse themselves (the empiricist and the Wittgensteinian respectively), they have just applied A=A. The empiricist will say that the FOL prover and the Wittgensteinian have not done anything meaningful to analyse themselves (the FOL prover and the Wittgensteinian), they have just observed their own thoughts and realised what they think is true. And similarly the Wittgensteinian will assert that everyone else is using Wittgensteinian reasoning to (wrongly) argue the superiority of their non-Wittgensteinian process.
So if someone else uses their reasoning process to prove their own reasoning process as superior, you’ll not only easily disagree with the conclusion – you might also disagree about whether they actually even used their own reasoning process to do it.
If you define bad = inconsistent as an axiom then yes, trivial proof. If you don’t define bad you can’t provie anything. You can’t capture the intuitive notion of bad using FOL.
If A: it is possible to define “intelligence” such that all Turing machines can be graded on a scale in terms of intelligence irrespective of their “values”, and B: some Turing machines have no values,
is it possible that the theoretical max intelligent Turing machine has no values?
(And can replace Turing machines with configurations of atoms or quarks or whatever)
I think that the link from micro to macro is too weak for this to be a useful line of inquiry. “intelligence” applies on a level of abstraction that is difficult (perhaps impossible for human-level understanding) to predict/define in terms of neural configuration, let alone Turing-machine or quantum descriptions.
Okay, but my question is more like, could “max intelligent neural configuration has no values” be true in a version of reality that makes sense to you. And not me actively trying to assert that it is true. Basically trying to deconfuse concepts and definitions.
I’m not sure what you’re asking. A lot of reality doesn’t make sense to me, so that’s pretty weak evidence either way. And it does seem believable that, since there is a very wide range of consistency and dimensionality to human values that don’t seem well-correlated to intelligence, the same could be true of AIs.
Fair, but abstractions like “aligned”, “values” and “intelligence” are created by humans, so it can make sense to formalise them before asking a question like “align an intelligent agent”, else the question becomes poorly defined.
And it does seem believable that, since there is a very wide range of consistency and dimensionality to human values that don’t seem well-correlated to intelligence, the same could be true of AIs.
True, but I’m asking not just about AI that doesn’t have human values, but any values at all.
I think this could reasonably be true for some definitions of “intelligence”, but that’s mostly because I have no idea how intelligence would be formalized anyways?
Got it, I think formalising definitions of “intelligence” and “values” is worth doing. Even if the original definitions don’t map perfectly to your intuitive understanding of the concepts, atleast you’ll be asking a well-formed question when you ask “align an intelligent agent”.
i think asking well-formed questions is useful but we shouldn’t confuse our well-formed question as being what we actually care about unless we are sure it is in fact what we care about
Anyone here has good posts/articles on proliferation risks attached to embracing nuclear as a source of energy? Either on this site or otherwise.
I’ve read a few articles but haven’t formed a strong opinion for or against nuclear due to this point. There’s definitely a pro-nuclear slant on this site, hence wished to know people’s reasoning.
Is there a LW post on “things in our brain we can edit using choice/free will, versus things we can’t” as two distinct categories or a spectrum.
For instance I feel like: I can choose to lift my leg instantly, I can’t choose to fall out of love with someone instantly, I can choose to this over a period of time, I can’t choose to enjoy eating shit. This seems unambigous irrespective of your stance on free will, and I often find myself having to refer to these two categories.
If you read https://meaningness.com there’s probably some stuff in there about ‘partial control’. (Though I haven’t double checked what is where since https://metarationality.com got broken off into a separate website.)
It’s critical of a lot of stuff on LW in a particular, reasoned fashion.
Although you might see some criticism on here as well, like this post today: https://www.lesswrong.com/posts/cBH9FT7AWNNhJycaG/the-map-territory-distinction-creates-confusion
Thanks. I’ll check out the book. “Partial control” seems exactly what I’m referring to. Although the book does seem to be on a slightly different topic, and I haven’t heard of the author. Do you by any chance have link to a summary or review?
I’m not criticising anything on LW btw (not here atleast). It’s just that—even if you assume the naturalist compatibilist stance that LW assumes—you still need a phrase to refer to things that feel like they’re in or not in our control; you still need to talk about the first-person experience.
There are some summaries in the (hyper text) book, but they’re probably too short to give an overview.
I could write a review, but I’d probably want to PM rather than post that.
One reason I haven’t written a review is that I find the book to be: easy to read, and short enough that it probably doesn’t save time, without some work.
I could try to summarize anything you have questions about after or while reading this: https://meaningness.com/control
A dialogue might be more (immediately) constructive and save time relative to trying to cover everything.
That is, however, not as short as things could be with skimming. I figured the first two sections are more pre-requisites than things which go off somewhere else.
Something rather fast, which might fail, is just reading the page: https://meaningness.com/control. Maybe it has what you’re looking for there, maybe it doesn’t. If you have any questions feel free to PM me.
I found that page using this search: site:meaningness.com partial control
The next two hits from that search only have a one hit for the page search for partial, at the very end of them. Beyond that hits seem to be from a repeated phrase, which is just a short summary of a section in the table of contents.
I don’t have a map of what is a pre-requisite of what. But, assuming that that’s handled if you read it in order, then: ‘partial control’ is addressed in https://meaningness.com/control. I’d guess… you can read section one (Why meaningness? and it’s sub pages 4), section 2 (Stances and its subpages about 14). Then that page on control is late section 3. The preceding sub-pages don’t have sub-pages, but eternalism has 3 sub-subpages of some kind or another before it).
(Things that might change: https://meaningness.com/meaningness-practice which comes late part 2 contains a subsection that is currently 100 words but could become more relevant to what you’re looking for if it’s updated, or pages are added after that page. That possibility is mentioned in a note from July 2014 though, so if you read this soon, I’m guessing that won’t happen by then.)
Not counting: https://meaningness.com/all-dimensions-schematic-overview towards the word count or anything because it’s a bunch of charts. (Which might help summarize if you have a little bit of the necessary background.)
Here’s how long the first two sections are: (using https://wordcount.com, copying the text in, (adding a return a “-” and another two returns) not the table of contents on the left because it’s repeated. Using the url led to a massive overcount by at least an order of magnitude, possibly some builtin recursive counting on the links or something, so I didn’t use that.)
This does not include the comments (a separate page, which you don’t have to read, but might be useful if you’re confused or have questions, though such things might also be answered later on in the book).
12,154
Words
80,013
Characters
67,310
Characters without space
21,985
Syllables
827
Sentences
487
Paragraphs*
This one includes every “-”, all 18 I added, so it’s actually:
469
Paragraphs
If you figure section 3 is 75% the length of all the stuff before it, then, including the page on control, that’s:
estimated numbers:
8,000
Words
60,000
Characters
50,250
Characters without space
16,500
Syllables
620.25
Sentences
351.75
Paragraphs
That comes out to an estimated 20,000 words.
HYPOTHETICAL (possibly relevant to AI safety)
Assume you could use magic that could progressively increase the intelligence of exactly one human in a “natural way”. * You need to pick one person (not yourself) who you trust a lot and give them some amount of intelligence. Your hope is they use this capability to solve useful problems that will increase human wellbeing broadly.
What is the maximum amount of intelligence you’d trust them with?
*when I say natural way I mean their neural circuits grow, maintain and operate in a manner similar to how they already naturally work for humans, biologically. Maybe the person just has more neurons that make them more intelligent, instead of a fundamentally different structure.
I guess the two key considerations are:
whether there exists any natural form of cognitive enhancement that doesn’t also cause signficant value drift
whether you’d trust a superpowerful human even if their values seem mostly good and they don’t drift
Any web dev here wanna host a tool that lets you export your account data from this site? I’ve mostly figured it out, I’m just being lazy. Need to use graphQL queries, then write to files, then I guess upload the files to db and zip, and let the user download the zip.
(graphQL tutorial: https://www.lesswrong.com/posts/LJiGhpq8w4Badr5KJ/graphql-tutorial-for-lesswrong-and-effective-altruism-forum)
First graphQL query to get the user’s id from the slug
{
user(input: {selector: {slug: “eliezer_yudkowsky”}}) {
result {
_id
slug
}
}
}
For posts, need graphQL query, then dump each htmlBody into a separate file. No parsing required, I hope.
{
posts(input: {
terms: {
view: “userPosts”
userId: “nmk3nLpQE89dMRzzN”
limit: 50
meta: null # this seems to get both meta and non-meta posts
}
}) {
results {
_id
title
pageUrl
postedAt
htmlBody
voteCount
baseScore
slug
}
}
}
For comments, need to graphQL query, then dump html body into individual files. (Although I’m not entirely sure what one will do with thousands of comment files)
{
comments(input: {
terms: {
view: “userComments”,
userId: “KPEajTss7fsccBEgJ”,
limit: 500,
}
}) {
results {
_id
post {
title
slug
}
user {
username
slug
displayName
}
userId
postId
postedAt
pageUrl
htmlBody
baseScore
voteCount
}
}
}
Plus some loops to iterate if the limit is too large. And handle errors. Plus some way to share the credentials securely—or make it into a browser plugin.
Why doesn’t Scott Alexander work on alignment? Has he ever mentioned this? I feel like he could make non-trivial contribution.
Has anyone ever published a comprehensive piece of the form:
“I assign X probability to AI alignment being a solveable problem (in theory), and here’s the set of intuitions / models / etc that I base this estimate on”
--
Cause my intuitions point in the opposite direction, but maybe that’s just because I’m missing intuitions of people here.
Logical uncertainty is hard. But the intuition that I have is that humans exist, so there’s at least a proof of concept for a sort of aligned AGI (although admittedly not a proof of concept for an ASI)
That’s weak though, I’m hoping alignment researchers have stronger intuitions than that.
I don’t think it’s that weak?
Here’s a countering intuition (which is also weak to me, but to show why stronger intuitions are needed):
Humans have disagreements on ethics, and have done so for millenia, so they’re not 100% aligned.
But if your definition of alignment is “an AI that does things in a way such that all humans agree on it’s ethical choices” I think you’re doomed from the start, so this counterintuition proves too much. I don’t think there is an action an AI could take or a recommendation it could make that would satisfy that criteria (in fact, many people would say that the AI by it’s nature shouldn’t be taking actions or making recommendations)
Okay. I’d be keen on your definition of alignment.
P.S. This discussion which we’re having right now is exactly what I’d be keen on, in a compressed fashion, written by an alignment researcher who has anticipated lots of intuitions and counterintuitions.
It seems like something like “An AI that acts and reasons in a way that most people who are broadly considered moral consider moral” would be a pretty good outcome.
Fair.
Then one more intuition:
Assume AI is sufficiently capable it can establish a new world order all by itself. I wouldn’t trust most people I otherwise consider moral with such power.
Upvote prediction markets
Epistemic status: I haven’t spent too much time on this, I could easily be missing stuff here.
Problem
Whenever a new post is made on a website (reddit / stackexchange / lesswrong etc), upvotes act as an initial measure for how engaging and high quality the post is. This initial vetting by few users allows the remaining users to access a feed of already high quality posts.
This model however still requires few users to willingly go through unvetted posts and upvote the good quality ones. It would be useful to reduce the amount of input required to get a measure of quality.
Solution
Users could stake money on how many upvotes they think a post will receive. This money acts as an initial signal to show the post to more users. If more users see the post and upvote it, the money is returned to whoever staked it. If they don’t, the money could be taken by the website (or distributed to all the users on the website, or atleast the users who saw the low quality post).
The reverse can also happen, users can stake money claiming a post will not receive upvotes.
This isn’t a prediction market in the traditional sense where people bet higher votes only against those who bet lower votes. Instead you’re betting against the website directly. Why? Website benefits from good quality content, and reverse for bad quality.
This differs from standard advertisement model, because in that model you pay whether the content is engaging for users (good for website) or not engaging (bad for website). Automatically this means users will more often use this to push content that does not engage users much—blatant ads.
Considerations
You don’t want the users staking money to gatekeep the content of the website too much. More specifically if there’s a dynamic relationship between [money staked, impressions and upvotes], you don’t want money staked alone to impact impressions so much that you don’t get a good signal from upvotes, you still need that. Cause the model goes both ways—more impressions can mean more upvotes, but more upvotes or money staked can also mean more impressions. Maybe instead of betting on raw number of upvotes it’ll make sense to bet on upvote-to-impression ratio or something. People who study ad models probably have a better sense of how to mathematically model this.
The problem is that this sets a lot of incentives for corruption.
What kinds of corruption?
Ad models can be gamed too, is there a reason my model is more vulnerable?
People will try to drive votes to get their predictions to come true and thus the vote count becomes a worse signal for quality.
How? I can think of:
- bot accounts—Can use captcha and/or karma and/or KYC to detect them. (I wonder if KYC will have to be mandatory though.)
- bribing real users—Users will need to coordinate the bribe on a different website, trust that the bribe will actually pay through, and it’s hard to scale this without someone posting evidence back to the mods of original site.
- driving genuine traffic to the site in the hopes that people will upvote—hence need to track vote-to-impression ratio or similar rather than just votes. And can again privilege the votes of longstanding community members (like lesswrong karma does)
Do lmk if I’m missing something.
In the case of LessWrong comments that argue against the OP or for it’s value can effect voting.
Damn, that’s valid. So for now my system can only work if there’s no commenting. I’ll try thinking of a way to get around this.
Is there an LW post on:
“Why it is a good idea to google / discuss concrete examples instead of thinking / talking purely in the abstract” ?
Like I was reading up number of ICBMs in different countries and military policies and stuff, but it helps to pull up a youtube video of an actual missile test launch—just to get a sense of how big the thing is and what a launch looks like in practice.
Anyone has any good resources on linguistic analysis to doxx people online?
Both automated and manual, although I’m more keen on learning about automated.
What is the state-of-the-art in such capabilities today? Are there forecasts on future capabilities?
Trying to figure out whether defending online anonymity is worth doing or a lost cause.
found this: https://oar.princeton.edu/bitstream/88435/pr13z6v/1/FeasibilityInternetScaleAuthorID.pdf
MESSY INTUITIONS ABOUT AGI, MIGHT TYPE THEM OUT PROPERLY LATER
OR NOT
I’m sure we’re a finite number of voluntary neurosurgeries away from worshipping paperclip maximisers. I tend to feel we’re a hodge-podge of quick heuristic modules and deep strategic modules, and until you delete the heuristic modules via neurosurgery our notion of alignment will always be confused. Our notion of superintelligence / super-rationality is an agent that doesn’t use the bad heuristics we do, people have even tried formalising this with Solomonoff / Turing machines / AIXI. But when actually coming face to face with one:
- Either we are informed of the consequences of the agent’s thinking and we dislike those, because those don’t match our heuristics
- Or the AGI can convince us to become more like them to the point we can actually agree with their values. The fastest way to get there is neurosurgery but if we initially feel neurosurgery is too invasive I’m sure there exists another much more subtle path that the AGI can take. Namely one where we want our values to be influenced in ways, that eventually end up with us getting closer to the neurosurgery table.
- Or ofcourse the AGI doesn’t bother to even get our approval (the default case) but I’m ignoring that and considering far more favourable situations.
We don’t actually have “values” in an absolute sense, we have behaviours. Plenty of Turing machines have no notion of “values”, they just have behaviour given a certain input. “Values” are this fake variable we create when trying to model ourselves and each other. In other words the turing machine has a model of itself inside itself, that’s how we think about ourselves (metacognition). So a mini-Turing machine inside a Turing machine. Ofcourse the mini-machine has some portions deleted, it is a model. First of all this is physically necessitated. But more importantly, you need a simple model to do high-level reasoning on it in short amounts of time. So we create this singular variable called “values” to point to essentially a cluser in thingspace. Let’s say the Turing machine tends to increment it’s 58th register on only 0.1% of all possible 24-bitstring inputs, else it tends to decrement a lot more. The mini-Turing machine inside the machine modelling itself will just have some equivalent of the 58th register never incrementing at all, and decrementing instead. So now the Turing machine incorrectly thinks its 58th register never increments. So it thinks that decrementing the 58th register is a “value” of the machine.
[Meta note: When I say “value” here, I’d like to still stick to a viewpoint where concepts like “free will”, “choice”, “desire” and “consciousness” are taboo. Basically I have put on my reductionist hat. If you believe free will and determinism are compatible, you should be okay with this—as I’m just consciously restricting the number of tools/concepts/intuitions I wish to use for this particular discussion—not adopting any incorrect ones. You can certainly reintroduce your intuitions in a different discussion, but in a compabilist world, both our discussions should generally lead to true statements.
Hence in this case, when the machine thinks of “decrementing its 58th register” as its own value, I’m not referring to concepts like “I am driven to decrement my 58th register” or “I desire to decrement my 58th register” but rather “Decrementing 58th register is something I do a lot.” and since “value” is a fake variable that the Turing machine has full liberty to define, it says ” “Value” is defined by the things I tend to do.” When I say “fake” I mean it exists in the Turing machine’s model of itself, the mini-machine.
“Why do I do the things I do?” or “Can I choose what I actually do?” are not questions I’m considering, and I’m for now let’s assume the machine doesn’t bother itself with such questions (although in practice it certainly may end up asking itself such terribly confused questions, if it is anything like human beings. This doesn’t really matter rn.)
End note]
I’m gonna assume a single scale called “intelligence” along which all Turing machines can be graded. I’m not sure this scale actually even exists, but I’m gonna assume it anyway. On this scale:
Humans <<< AGI <<< AIXI-like ideal
<<< means “much less intelligent than” or “much further away from reasoning like the AIXI-like ideal”, these two are the same thing for now, by definition.
An AGI trying to model human beings won’t use such a simple fake variable called “values”, it’ll be able to build a far richer model of human behaviour. It’ll know all about the bad human heuristics that prevent humans from becoming like an AGI or AIXI-like.
Even if the AGI wants us to be aligned, it’s just going to do the stuff in the first para. There are different notions of aligned:
Notion 1: “I look at another agent superficially and feel we want the same things.” In other words, my fake variable called “my values” is sufficiently close to my fake variable called “this agent’s values. I will necessarily be creating such simple fake variables if I’m stupid, i.e., I’m human, cause all humans are stupid relative to the AIXI-like ideal.
An AGI that optimises to satisfy notion 1 can hide what it plans to do to humans and maintain a good appearance until it kills us without telling us.
Notion 2: “I get a full picture of what the agent intends to do and want the same things”
This is very hard because my heuristics tell me all the consequences the AGI plans to do are bad. Problem is my heuristics. If I didn’t have those heuristics, if was closer to the AIXI-like ideal, I wouldn’t mind. Again “I wouldn’t mind” is from a perspective of machines not consciousness or desires, so translate it to ” the interaction of my outputs will be favourable towards the AGI’s outputs in the real world”.
So the AGI will find ways to convince us to want our values neurosurgically altered. Eventually we will both be clones and hence perfectly aligned.
Now let’s bring stuff like “consciousness” and “desires” and “free will” back into the picture. All this stuff interacts very strongly with the heuristics, exactly the things that make us further away from the AIXI-like ideal.
Simply stated, we don’t naturally want to be ideal rational agents. We can’t want them, in the sense we can’t get ourselves to truly consistently want to be ideal rational agents by sheer willpower, we need physical intervention like neurosurgey. Even if free will exists and is a useful concept, it has finite power. I can’t delete sections of my brain using free will alone.
So now if intelligence is defined as closer to AIXI-like ideal in internal structure, then intelligence by definition leads to misalignment.
P.S. I should probably also throw some colour on what kinds of “bad heuristics” I am referring to here. In simplest terms “low kolmogrow complexity behaviours that are very not AIXI-like”
1/ For starters, the entirety of System 1 and sensory processing (see Daniel Kahlmann thinking fast and slow). We aren’t designed to maximise our intelligence, we just happen to have an intelligence module (aka somewhat AIXI-like). Things we care about sufficiently strongly are designed to override System 2 which is more AIXI-like, insofar has evolution has any design for us. So maybe it’s not even “bad heuristics” here, it’s entire modules in our brain that are not meant for thinking in the first place. It’s just neurochemicals firing and one side winning and the other side losing, this system looks nothing like AIXI. And it’s how we deal with most life-and-death situations.
This stuff is beyond the reach of free will, I can’t stop reacting to snakes out of sheerwill, or hate chocolate or love to eat shit. Maybe I can train myself on snakes, but I can’t train myself to love to eat shit. The closer you are to sensory apparatus and the further away from the brain, the less the system looks like AIXI. And simultaneously the less free will you seem to have.
(P.S. Is that a coincidence or is free will / consciousness really an emergent property of being AIXI-like? I have no clue, it’s again the messy free will debates that might get nowhere)
2/ Then ofcourse at the place where System 1 and 2 interact you can actually observe behaviour that makes us further away from AIXI-like. Things like why we find it difficult to have independent thoughts that go away from the crowd. Even if we do have independent thoughts we need to spend a lot more energy actually developing them further (versus thinking of hypothetical arguments to defend those ideas in society).
This stuff is barely within the reach of our “free will”. Large portions of LessWrong attempt at training to make us more AIXI-like. By reducing so-called cognitive biases and socially-induced biases.
3/ Then we have proper deep thinking (System 2 and such-like) which seems a lot closer to AIXI. This is where we move beyond “bad heuristics” aka “heuristics that AIXI-like agents won’t use”. But maybe an AGI will find these modules of ours horribly flawed too, who knows.
Anyone proposing that building AGI should be banned by governments?
Cause it seems like even if AGI alignment is possible (I’m skeptical), there’s no guarantee the person who happens to create AGI also happens to want to follow this perfect solution. Or, even if they want to, that they should have a moral right to decide on the behalf of their country or humanity as a whole. Nation states pre-committing to “building AGI is evil” seems a better solution. It might also slow down rate of progress in AI capabilities, which I’m guessing is also desirable to some alignment theorists.
I haven’t seen any government, let alone the set of governments, demonstrate any capability of commitment on this kind of topic. States (especially semi-representative ones like modern democracies) just don’t operate with a model that makes this effective.
I also wonder what you feel about the ban of human cloning. Is it effectively implemented?
I don’t know if it is or not. Human cloning seems both less useful and less harmful (just less impactful overall), so simultaneously easier to implement and not a good comparison to AGI.
I see cloning-based research as very impactful, it’s also a route to getting more intelligent beings to exist. I’d be hard-pressed to find something as impactful as AGI though.
Also I’m not sure about “less useful”. Given a world where AI researchers know that alignment is hard or impossible, they might see human cloning as more useful than AGI. Unless you mean AGI’s perceived usefulness is higher, which may be true today but maybe not in the future.
I’m not following the connection between human cloning and AGI. Are you talking about something different from https://en.wikipedia.org/wiki/Human_cloning , where a baby is created with only one parent’s genetic material?
To me, human cloning is just an expensive way to make normal babies.
Yep referring to that only. You can keep cloning the most intelligent people. At enough scale you’ll be increasing the collective intelligence of mankind, and scientific output. Since these clones will hopefully retain basic human values, you now have more intelligence with alignment.
Do you have any reason to believe that this is happening AT ALL? I’d think the selection of who gets cloned (especially when it’s illicit, but probably even if it were common) would follow wealth more than intelligence.
Selective embryo implantation based on genetic examination of two-parent IVF would seem more effective, and even that’s not likely to do much unless it becomes a whole lot more common, and if intelligence were valued more highly in the general population.
Huh? Why these any more than the general population? The range of values and behaviors found in humans is very wide, and “basic human values” is a pretty thin set.
Most importantly, a 25-year improvement cycle, with a mandatory 15-20 year socialization among many many humans of each new instance is just not as scary as an AGI with an improvement cycle under a year (perhaps much much faster), and with direct transmission of models/beliefs from previous generations. Just not comparable.
Wasn’t talking about today, just an arbitrary point in the future.
I was commenting that it has a lot of power and potential benefits if groups of people weild it, whether they actually do is a different question.
On latter question, you’re right ofcourse—different groups of people will select for different traits. I would assume there will exist atleast some groups of intelligent people will want to select for intelligent people further. There is competitive advantage to nations who legalise this.
re: last 2 paras, I’m not sure we understood each other. I’ll try again. Intelligence is valuable towards building stable paths to survival, happiness, prosperity. AGI will be much more intelligent than selected humans. However, AGI will almost certainly kill us because of lack of alignment. (Assume a world where AGI researchers have accepted this as fact.) This makes AGI not very useful, on the balance of it.
Humans selected for intelligence are also valuable. They will be a lot less intelligent than AGI ofcourse. But they will (hopefully) be aligned enough with the rest of humanity to work for its welfare and prosperity. This makes selected humans very useful.
Hence selected humans could be more useful than AGI.
That’s a valid intuition—I’d be happy to learn why you feel that if you have time (no worries if not).
Would a non-democratic state like China or Russia fare better in this regard then? If one of them takes the issue seriously enough they could force other states to also take it seriously via carrot and stick.
On innovations too big for capitalism
Consider any innovation so world-changing, that goverments are not willing to let the creator have complete control over how it is used. For instance, a new super-cheap energy source such as fusion reactors.
Maximising profit from fusion reactors for instance could mean selling electricity at a price slightly lower than the current market price, waiting to monopolise the global power grid, waiting for all other power companies to shut down, then raising prices again, thereby not letting anyone actually reap the benefits of super-cheap electricity. It is unlikely that govts however will let the creator do this.
As someone funding early-stage fusion research, would you have to account for this in your investment thesis? That there is some upper limit on how large a company can legally grow. So far it seems the cap is higher than $2.5 trillion atleast, looking at Apple’s market cap. Although it is possible that even a company smaller than $2.5 trillion monopolises a sector and is then prevented from price-fixing by the government.
I don’t think, at this scale, that “the government” is a useful model. There are MANY governments, and many non-government coalitions that will impact any large-scale system. The trick is delivering incremental value at each stage of your path, to enough of the selectorate of each group who can destroy you.
Thanks, this makes sense.
I still don’t get stuff like TDT, EDT, FDT, LDT
Article on LDT: https://arbital.com/p/logical_dt/?l=58f
I get the article on How an algorithm feels from the inside—if you assume a deterministic universe and consciousness as something that emerges out but has no causal implications on the universe.
Now if I try drawing the causal arrows
Outside view:
Brain ← → Human body ← → Environment
In the outside view, “you” don’t exist as a coherent singular force of causation. Questions about free will and choice cease to exist.
Inside view (CDT):
Coherent singular “soul” (using CDT) ← → Brain ← → Human body ← → Environment
Notably Yudkowsky would call this singular soul as something that only exists in the inside view itself, and not the outside view
Now we replace this with …
Inside view (LDT):
Coherent singular “soul” (using CDT) ← → All instantiations of this cognitive algorithm across space and time ← → Brain(s) ← → Human body(s) ← → Environment
The new decision theories don’t seem to eliminate the illusion of the soul—they just now assert that the soul not only interacts with this particular instantion of the algorithm the brain is running, but all instantiations of this algorithm across space (and time?). Why is this more reasonable than assuming the soul only interacts with this particular instantation? Note that the soul is a fake property here, one that only exists in the inside view. And, to our best understanding of physical laws, the universe is made of atoms* in its casual chain, not algorithms. Two soulless algorithms don’t causally interact by virtue of the fact that they’re similar algorithms, they interact by virtue of the atoms they causally impact, and the causal impacts of those atoms on each other. Why do algorithms with the fake property of a soul suddenly get causally bound through space and time?
*well technically its QM waves or strings or whatever, but that doesn’t matter to this discussion
This doesn’t help. In a counterfactual, atoms are not where they are in actuality. Worse, they are not even where the physical laws say they must be in the counterfactual, the intervention makes the future contradict the past before the intervention.
Do I assume “counterfactual” is just the english word as used here?
If so, it should only exists in the inside view, right? (If I understand you)
The sentence I wrote on soulless algorithms is about the outside view. Say two robots are playing football. The outside view is—one kicks the football, other sees football (light emitted by football), then kicks it. So the only causal interaction between the two robots is via atoms. This is independent of what decision theory either robot is using (if any), and it is independent of whether the robots are capable of creating an internal mental model of themselves or the other robot. So it applies to both robots with dumb microcontrollers like those in a refrigerator and smart robots that could even be AGI or have some ideal decision theory. Atleast assuming the universe follows the deterministic physical laws we know about.
(edited)
The point is that the weirdness with counterfactuals breaking physical laws is the same for controlling the world through one agent (as in orthodox CDT) and for doing the same through multiple copies of an agent in concert (as in FDT). Similarly, in actuality neither one-agent intervention nor coordinated many-agent intervention breaks physical laws. So this doesn’t seem relevant for comparing the two, that’s what I meant by “doesn’t help”.
By “outside view” you seem to be referring to actuality. I don’t know what you mean by “inside view”. Counterfactuals are not actuality as normally presented, though to the extent they can be constructed out of data that also defines actuality, they can aspire to be found in some nonstandard semantics of actuality.
Do you mean the counterfactual may require more time to compute than the situation playing out in real time? If so, yep makes a ton of sense, they should probably focus on algorithms or decision theories that can (atleast in theory) be implemented in real life on physical hardware. But please confirm.
Could you please define “actuality” just so I know we’re on the same page? I’m happy to read any material if it’ll help.
Inside view and outside view I’m just borrowing from Yudkowsky’s How an algorithm feels from the inside. Basically assumes deterministic universe following elegant physical laws, and tries to dissolve questions of free will / choice / consciousness. So the outside view is just a state of the universe or a state of the Turing machine. This object doesn’t get to “choose” what computation it is going to do or what decision theory it is going to execute, that is already determined by current state. So the future states of the object are calculable*.
*by an oracle that can observe the universe without interacting, with sufficient but finite time.
Only in the inside view does a question like “Which decision theory should I pick?” even make sense. In the inside view, free will and choice are difficult to reason about (as humans have observed over centuries) - if you really wanna reason about those you can go to the outside view where they cease to exist.
Am completely in love with Yudkowsky’s posts in the Complexity of Value sequence.
https://www.lesswrong.com/tag/complexity-of-value
Would recommend the four major posts to everyone.
Are there people who have read these four posts and still self-identify as either consequentialists or utilitarians? Is yes, why so?
Because the impression I got from these posts (which also matches my independent explorations) is that humans have deontological rules wired into them due to evolution—this is just observable fact. And that you don’t really get to change that even if you want to.
Considering the trivial example—would you kill one person to save two. No one gets to know your decision, so there are no future implications on anyone else, no precedent being set. Even on this site there is a ton of deflection from honestly answering the question as is. Either “Yes I will murder one person in cold blood” or “No I will let the two people die”. Assume LCPW.
I believe that in the LCPW it would be the right decision to kill one person to save two, and I also predict that I wouldn’t do it anyway, mainly because I couldn’t bring myself to do it.
In general, I understood the Complexity of Value sequence to be saying “The right way to look at ethics is consequentialism, but utilitarianism specifically is too narrow, and we want to find a more complex utility function that matches our values better.”
Thanks for replying.
Why do you feel it would be the right decision to kill one? Who defines “right”?
I personally understood it differently. Thou art godshatter says (to me) that evolution depends on consequences but evolving consequentialism into a brain is hard and therefore human desires are not wired consequentially. Also that evolution only cares about consequences that actually happen, not the ones it predicts happens—because it cannot predict.
I define “right” to be what I want, or, more exactly, what I would want if I knew more, thought faster and was more the person I wish I could be. This is of course mediated by considerations on ethical injunctions, when I know that the computations my brain carries out are not the ones I would consciously endorse, and refrain from acting since I’m running on corrupted hardware. (You asked about the LCPW, so I didn’t take these into account and assumed that I could know that I was being rational enough).
It’s been a while since I read Thou Art Godshatter and the related posts, so maybe I’m conflating the message in there with things I took from other LW sources.
The sequence on ethical injunctions looks cool. I’ll read it first before properly replying.
Just FYI, I’ve become convinced that most online communication through comments with a lot of context are much better settled through conversations, so if you want, we could also talk about this over audio call.
Thanks, I will let you know!
I read some of the articles. Happy to get on a voice call if you prefer. My thoughts so far boil down to:
- Corrupted hardware seems to imply a clear distinction between goals (/ends/terminal goals) and actions towards goals (/means/instrumental goals), and that only actions are computed imperfectly. I say firstly we don’t have as sharp a distinction between the two in our brain’s wiring. (Instrumental goals often become terminal if you focus on them hard enough.) Secondly that it’s not actions but terminal goals themselves that are in conflict.
- We have multiple conflicting values. There’s no “rational” way to always decide what trumps what—sometimes it’s just two sections of the brain firing neurochemicals and one side winning, that’s it. System-2 is somewhat rational, System-1 not so much, and System-1 has more powerful rewards and penalties. System-1 preferences admit circular preferences, and there’s nothing you can do about it.
- “what I would want if I knew more, thought faster etc” doesn’t necessarily lead to one coherent place. You have multiple conflicting values, which of those you end up deleting if you had the brain of a supercomputer could be arbitrary. You could become Murder Gandhi or a some extreme happiness utilitarian, I don’t see either of these as necessarily desirable places to be relative to my current state. Basically I want to run on corrupted hardware. I don’t want my irrational System-1 module deleted.
Sorry if I’m going off-topic but yeah.
Open-ended versus close-ended questions
I’ve found people generally find it harder to answer open-ended questions. Not just in terms of giving a good answer but giving any answer at all. It’s almost as if they lack the cognitive module needed for such search.
Has anyone else noticed this? Is there any research on it? Any post on LessWrong or elsewhere?
Post-note: Now that I’ve finished writing, a lot of this post feels kinda “stupid”—or more accurately, not written using a reasoning process I personally find appealing. Nevertheless I’m going to post it just in case someone finds it valuable.
-----
I don’t see a lot of shortform posts here so I’m unsure of the format. But I’m in general thinking a lot about how you cannot entirely use reasoning to reason about usefulness of various reasoning processes amongst each other. In other words, rationality is not closed.
Consider a theory in first-order logic. This has a specific set of axioms and a set of deductive rules. Consider a first-order theory with axioms “Socrates is mortal” and “Socrates is immortal”. A first-order theory which is inconsistent is obviously a bad reasoning process. A first-order theory which is consistent but whose axioms don’t map to real world assumptions, is also a bad reasoning process for different reasons. Lastly, someone can argue that first-order logic is a bad reasoning process no matter what axioms it is instantiated with, because axioms + rigid conclusions is a bad way of reasoning about the world. And that humans are not wired to do pure FOL and are instead capable of reaching meaningful conclusions without resorting to FOL.
All these three ways of calling a particular FOL theory bad are different, but none of them are expressible in FOL itself. You can’t prove why inconsistent FOL theories are “bad” inside of the very same FOL theory (although ofcourse you can prove it inside of a different theory, perhaps one that has “Inconsistent FOL theories are bad” as an axiom). You can’t prove that the axioms don’t map to real world conditions (let alone prove why axioms not mapping to real world conditions makes the axioms “bad”). You can’t prove that deductive rules don’t map to real world reasoning capacities of the human mind. If an agent was rigidly coded with this FOL theory, you’d never get anywhere with them on these topics, there’d be a communication failure between the two agents – you and them.
All these three arguments however can be framed as appeals to reality and what is observable. A statement and its negation being simultaneously proven true is bad because such phenomena are not typically observed in practice, and because there are no statements provable from this system that are useful to achieve objectives in the real world. Axioms not mapping to physical world is obviously an appeal to the observable. FOL being a bad framework for human reasoning is also an appeal to observation, in this case an observation you’ve made after observing yourself, and you’re hoping the other person has also made.
It seems intuitive that someone who uses “what maps to the observable is correct” will not admit any other axioms, if they wish to be consistent – because most such axioms will conflict with this appeal to the observable. But in a world with multiple agents, we can’t take this as axiom, lest we get get stuck in our own communication bubble. We need to be able to reason about the superiority of “what maps to the observable is correct” as a reasoning process, using some other reasoning process. And in fact, I have seemingly been using this post so far to do exactly that – use some reasoning process to argue in favour of “what maps to observable” reasoning processes over FOL theories instantiated with simple axioms such as “Socrates is mortal”.
And if you notice further, my argument for why “what maps to observable is good” in this post doesn’t seem very logical. I still seemingly am appealing to “what maps to observable is good” in order to prove “what maps to observable is good” – which is obviously a no-go when using FOL. But to your human mind, the first half of this post still sounded like it was saying something useful, despite not being written in FOL, nor having a clear separation between axioms and deduced statements. You could at this point appeal to Wittgensteinian “word clouds” or “language games” and say that some sequences of words referring to each other are perceived to be more meaningful than other sequences of words, and that I have hit upon one of the more meaningful sequences of words.
But how will you justify Wittgensteinian discourse as a meta-reasoning process to understand reasoning processes. More specifically, what reasoning process is wittgensteinian discourse using to prove that “witgensteinian reasoning processes are good”? I could at this point self-close it and say that Wittgensteinian reasoning processes are being used to reason and reach the conclusion that Wittgensteinian reasoning processes are good. But do you see the problem here?
Firstly, this kind of self-closure can be easily done by most systems. An FOL theory can assert that axiom A is being used to prove axiom A because A=A. A system based on “what is observable is correct” can appeal to observation and to argue the superiority of “what is observable is correct”.
And secondly, reasoning this self-closure only tends to look meaningful inside of the system itself. An FOL prover will say that the empiricist and the Wittgensteinian have not done anything meaningful when trying to analyse themselves (the empiricist and the Wittgensteinian respectively), they have just applied A=A. The empiricist will say that the FOL prover and the Wittgensteinian have not done anything meaningful to analyse themselves (the FOL prover and the Wittgensteinian), they have just observed their own thoughts and realised what they think is true. And similarly the Wittgensteinian will assert that everyone else is using Wittgensteinian reasoning to (wrongly) argue the superiority of their non-Wittgensteinian process.
So if someone else uses their reasoning process to prove their own reasoning process as superior, you’ll not only easily disagree with the conclusion – you might also disagree about whether they actually even used their own reasoning process to do it.
If the theory is inconsistent, you can prove anything in it, can’t you? So you should also be able to prove that inconsistent theories are “bad”.
If you define bad = inconsistent as an axiom then yes, trivial proof. If you don’t define bad you can’t provie anything. You can’t capture the intuitive notion of bad using FOL.
If
A: it is possible to define “intelligence” such that all Turing machines can be graded on a scale in terms of intelligence irrespective of their “values”, and
B: some Turing machines have no values,
is it possible that the theoretical max intelligent Turing machine has no values?
(And can replace Turing machines with configurations of atoms or quarks or whatever)
I think that the link from micro to macro is too weak for this to be a useful line of inquiry. “intelligence” applies on a level of abstraction that is difficult (perhaps impossible for human-level understanding) to predict/define in terms of neural configuration, let alone Turing-machine or quantum descriptions.
Okay, but my question is more like, could “max intelligent neural configuration has no values” be true in a version of reality that makes sense to you. And not me actively trying to assert that it is true. Basically trying to deconfuse concepts and definitions.
I’m not sure what you’re asking. A lot of reality doesn’t make sense to me, so that’s pretty weak evidence either way. And it does seem believable that, since there is a very wide range of consistency and dimensionality to human values that don’t seem well-correlated to intelligence, the same could be true of AIs.
Fair, but abstractions like “aligned”, “values” and “intelligence” are created by humans, so it can make sense to formalise them before asking a question like “align an intelligent agent”, else the question becomes poorly defined.
True, but I’m asking not just about AI that doesn’t have human values, but any values at all.
I think this could reasonably be true for some definitions of “intelligence”, but that’s mostly because I have no idea how intelligence would be formalized anyways?
Got it, I think formalising definitions of “intelligence” and “values” is worth doing. Even if the original definitions don’t map perfectly to your intuitive understanding of the concepts, atleast you’ll be asking a well-formed question when you ask “align an intelligent agent”.
i think asking well-formed questions is useful but we shouldn’t confuse our well-formed question as being what we actually care about unless we are sure it is in fact what we care about
Yup agreed.