Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.
TurnTrout
It’s a spaced repetition system that focuses on incremental reading. It’s like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the “flashcard” (usually ‘clozes’, demonstrated below).
Here’s a Wikipedia article I pasted into SuperMemo. Blue bits are the extracts, which it’ll remind me to refine into flashcards later. A cloze deletion flashcard. It’s easy to make a lot of these. I like them. Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away.
In the context of this post, I’m reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!
I don’t follow the last bit. If ghosts were real, the first-order news would be amazing: maybe humanity wouldn’t have truly lost the brain-information of any human, ever!
The all-or-nothing vaccine hypothesis is:
But maybe the vaccine is 100% effective against all outcomes! So long as it’s correctly transported and administered, that is. Except sometimes vaccines are left at high temperature for too long, the delicate proteins are damaged, and people receiving them are effectively not vaccinated. If this happens 5% of the time, then 95% of people are completely immune to Covid and 5% are identical to not be vaccinated. Whatever chance they had of getting severe Covid before, it’s the same now.
If all-or-nothing were true, you would expect the following equality in conditional probability distributions
This is not what we see:
This was already shown in the mass Pfizer study, but several other sources indicate the ratio of asymptomatic-to-symptomatic cases is increased for vaccinated people. In other words, vaccination works better against symptomatic Covid (more severe) than asymptomatic Covid (less severe).
Therefore, all-or-nothing cannot be true.
Am I missing something?
Do you think such humans would have a high probability of working on TAI alignment, compared to working on actually making TAI?
>:(
I think you are indeed making a mistake by letting unsourced FB claims worry you, given the known proliferation of antivax-driven misinformation. There is an extremely low probability that you’re first hearing about a real issue via some random, unsourced FB comment.
For more evidence, look to the overreactions to J&J / AZ adverse effects. Regulatory bodies are clearly willing to make a public fuss over even small probabilities of things going wrong.
Evolution requires some amount of mutation, which is occasionally beneficial to the species. Species that were too good at preventing mutations would be unable to adapt to changing environmental conditions, and thus die out.
We’re aware of many species which evolved to extinction. I guess I’m looking for why there’s no plausible “path” in genome-space between this arrangement and an arrangement which makes fatal errors happen less frequently. EG why wouldn’t it be locally beneficial to the individual genes to code for more robustness against spontaneous abortions, or an argument that this just isn’t possible for evolution to find (like wheels instead of legs, or machine guns instead of claws).
I feel confused wrt the genetic mutation hypothesis for the spontaneous abortion phenomenon. Wouldn’t genes which stop the baby from being born, quickly exit the gene pool? Similarly for gamete formation processes which allow such mutations to arise?
I agree. I’ve put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!
(midco developed this separately from our project last term, so this is actually my first read)
I have a lot of small questions.
What is your formal definition of the IEU ? What kinds of goals is it conditioning on (because IEU is what you compute after you view your type in a Bayesian game)?
Multi-agent “impact” seems like it should deal with the Shapley value. Do you have opinions on how this should fit in?
You note that your formalism has some EDT-like properties with respect to impact:
Well, in a sense, they do. The universes where player shouts “heads” are exactly the universes in which everyone wins. The problem is that of agency: player doesn’t choose their action, the coin () does. If we condition on the value of , then each player’s action becomes deterministic, thus IEU is constant across each player’s (trivial) action space.
This seems weird and not entailed by the definition of IEU, so I’m pretty surprised that IEU would tell you to shout ‘heads.’
Given arbitrary R.V.s A, B, we define the estimate of A given B=b as
Is this supposed to be ? If so, this is more traditionally called the conditional expectation of A given B=b.
I’m really excited about this project. I think that in general, there are many interesting convergence-related phenomena of cognition and rational action which seem wholly underexplored (see also instrumental convergence, convergent evolution, universality of features in the Circuits agenda (see also adversarial transferability), etc...).
My one note of unease is that an abstraction thermometer seems highly dual-use; if successful, this project could accelerate AI timelines. But that doesn’t mean it isn’t worth doing.
I still don’t fully agree with OP but I do agree that I should weight this heuristic more.
Yeah, I think these are good points.
OK, if we’re talking about central identity, then I very much wouldn’t sign a contract giving away rights to my central identity. I interpreted the question to be about selling one’s “immortal soul” (which supposedly goes to heaven if I’m good).
I guess I feel like this is a significant steelman and atypical of normal usage. In my ontology, that algorithm is closer to ‘mind.’
I agree that “soul” has more ‘real’ meaning than “florepti xor bobble.” There’s another point to consider, though, which is that many of us will privilege claims about souls with more credence than they realistically deserve, as an effect of having grown up in a certain kind of culture.
Out of all the possible metaphysical constructs which could ‘exist’, why believe that souls are particularly likely? Many people believing in souls is some small indirect evidence for them, but not an amount of evidence commensurate with the concept’s prior improbability.
I think “Don’t casually make contracts you don’t intent to keep” is just pretty cruxy for me. This is a key piece of being a trustworthy person who can coordinate in complex, novel domains. There might be a price where there is worth it to do it as a joke, but $10 is way too low.
I agree that the contracts part was important, and I share this crux. I should have noted that. I did purposefully modify my hypothetical so that I wasn’t becoming less trustworthy by signing my acquaintance’s piece of paper.
This actually seems obviously wrong to me, if for no other reason than “I think it’s moderately likely at some point you will get approached by someone who’d buy it for more than $10.”
I meant something more like the desperate “oh no my soul was so important, I’m going to pay $10k, $20k, whatever it takes to get it back!”; I should have clarified that in my original comment.
My gut reaction is… okay, sure, maybe doing it ostentatiously is obnoxious, but these reasons against feel rather contrived.
(It’s not at all a takedown to say “I disagree, your arguments feel contrived, bye!”, but I figured I’d rather write a small comment than not engage at all)
If an acquaintance approached me on the street, asked me to sign a piece of paper that says “I, TurnTrout, give [acquaintance] ownership over my metaphysical soul” in exchange for $10 (and let’s just ignore other updates I should make based on being approached with such a weird request)… that seems like a great deal to me. I’d bet 1000:1 odds against my ever wanting to “buy my soul back.” If future me changed his mind anyways? Well, screw him, he’s wrong. I don’t reflectively want to help that possible future-me, and so I won’t.
EMs? Would a religious person really think that an EM is/has a soul? Would a judge later rule that I had signed my mind-computation into slavery, before EMs even were possible?
Suppose instead that the acquaintance approached me with a piece of paper that says “I, TurnTrout, give [acquaintance] ownership over my florepti xor bobble.” I’d sign that in exchange for $10. It’s meaningless. It’s nothing. I see no reason why I should treat ‘souls’ any differently, just because ‘souls’ have a privileged place in society’s memeplex thanks to thousands of years of cult influence and bad epistemology.
where these people feel the need to express their objections even before reading the full paper itself
I’d very much like to flag that my comment isn’t meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)
Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.
Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:
“That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.”
Curious. So here is some speculation.
Perhaps I could have flagged this so you would have realized it wasn’t meant as a “rhetorical move”: it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility. From the abstract:
A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI).
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started… [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
This does not parse like a normal-strength mathematical claim. This is a claim about the de facto post-deployment safety properties of “arbitrarily advanced utility maximizing agents.”
Again, I’m not saying that your paper doesn’t have any good contributions. I can’t judge that without further reading. But I am standing by my statement that this is a non-standard claim which I’m skeptical of and which makes me hesitate to read the rest of the paper.
We know very well ‘how to do this’ for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today’s actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.
Yes, we know how to do it for existing AI agents. But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
And even that is an unrealistically structured scenario, since it seems like prosaic AGI is quite plausible. Prosaic AGI would be way messier than AIXI, since it wouldn’t be doing anything as clean as Bayes-updating the simplicity prior to optimize an explicit utility function.
Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.
This is not going to happen for AGI, since we might not survive testing on-distribution, and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?
You seem to be looking for ‘not very narrow sense’ corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don’t build a system like that if you want to be safe.
I’m not just talking about that; the above shows how symbol grounding is tough even for seemingly well-defined events like “is the off-switch being pressed?”, without any fancy self-improvement.
Apparently VMs are the way to go for pdf support on linux.