I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
I just looked up “many minds” and it’s a little bit like what I wrote here, but described differently in ways that I think I don’t like. (It’s possible that Wikipedia is not doing it justice, or that I’m misunderstanding it.) I think minds are what brains do, and I think brains are macroscopic systems that follow the laws of quantum mechanics just like everything else in the universe.
What property distinguished a universe where “Harry found himself in a tails branch” and a universe where “Harry found himself in a heads branch”?
Those both happen in the same universe. Those Harry’s both exist. Maybe you should put aside many-worlds and just think about Parfit’s teletransportation paradox. I think you’re assuming that “thread of subjective experience” is a coherent concept that satisfies all the intuitive properties that we feel like it should have, and I think that the teletransportation paradox is a good illustration that it’s not coherent at all, or at the very least, we should be extraordinarily cautious when making claims about the properties of this alleged thing you call a “thread of subjective experience” or “thread of consciousness”. (See also other Parfit thought experiments along the same lines.)
I don’t like the idea where we talk about what will happen to Harry, as if that has to have a unique answer. Instead I’d rather talk about Harry-moments, where there’s a Harry at a particular time doing particular things and full of memories of what happened in the past. Then there are future Harry-moments. We can go backwards in time from a Harry-moment to a unique (at any given time) past Harry-moment corresponding to it—after all, we can inspect the memories in future-Harry-moment’s head about what past-Harry was doing at that time (assuming there were no weird brain surgeries etc). But we can’t uniquely go in the forward direction: Who’s to say that multiple future-Harry-moments can’t hold true memories of the very same past-Harry-moment?
Here I am, right now, a Steve-moment. I have a lot of direct and indirect evidence of quantum interactions that have happened in the past or are happening right now, as imprinted on my memories, surroundings, and so on. And if you a priori picked some possible property of those interactions that (according to the Born rule) has 1-in-a-googol probability to occur in general, then I would be delighted to bet my life’s savings that this property is not true of my current observations and memories. Obviously that doesn’t mean that it’s literally impossible.
I wrote “flipping an unbiased coin” so that’s 50⁄50.
there’s some preferred future “I” out of many who is defined not only by observations he receives, but also by being a preferred continuation of subjective experience defined by an unknown mechanism
I disagree with this part—if Harry does the quantum equivalent of flipping an unbiased coin, then there’s a branch of the universe’s wavefunction in which Harry sees heads and says “gee, isn’t it interesting that I see heads and not tails, I wonder how that works, hmm why did my thread of subjective experience carry me into the heads branch?”, and there’s also a branch of the universe’s wavefunction in which Harry sees tails and says “gee, isn’t it interesting that I see tails and not heads, I wonder how that works, hmm why did my thread of subjective experience carry me into the tails branch?”. I don’t think either of these Harrys is “preferred”.
I don’t think there’s any extra “complexity penalty” associated with the previous paragraph: the previous paragraph is (I claim) just a straightforward description of what would happen if the universe and everything in it (including Harry) always follows the Schrodinger equation—see Quantum Mechanics In Your Face for details.
I think we deeply disagree about the nature of consciousness, but that’s a whole can of worms that I really don’t want to get into in this comment thread.
doesn’t strike me as “feeling more natural”
Maybe you’re just going for rhetorical flourish, but my specific suggestion with the words “feels more natural” in the context of my comment was: the axiom “I will find myself in a branch of amplitude approaching 0 with probability approaching 0” “feels more natural” than the axiom “I will find myself in a branch of amplitude c with probability ”. That particular sentence was not a comparison of many-worlds with non-many-worlds, but rather a comparison of two ways to formulate many-worlds. So I think your position is that you find neither of those to “feel natural”.
Quantum Mechanics In Your Face talk by Sidney Coleman, starting slide 17 near the end. The basic idea is to try to operationalize how someone might test the Born rule—they take a bunch of quantum measurements, one after another, and they subject their data to a bunch of randomness tests and so on, and then they eventually declare “Born rule seems true” or “Born rule seems false” after analyzing the data. And you can show that the branches in which this person declares “Born rule seems false” have collective amplitude approaching zero, in the limit as their test procedure gets better and better (i.e. as they take more and more measurements).
(Warning that I may well be misunderstanding this post.)
For any well-controlled isolated system, if it starts in a state , then at a later time it will be in state where U is a certain deterministic unitary operator. So far this is indisputable—you can do quantum state tomography, you can measure the interference effects, etc. Right?
OK, so then you say: “Well, a very big well-controlled isolated system could be a box with my friend Harry and his cat in it, and if the same principle holds, then there will be deterministic unitary evolution from into , and hey, I just did the math and it turns out that will have a 50⁄50 mix of ‘Harry sees his cat alive’ and ‘Harry sees his cat dead and is sad’.” This is beyond what’s possible to directly experimentally verify, but I think it should be a very strong presumption by extrapolating from the first paragraph. (As you say, “quantum computers prove larger and larger superpositions to be stable”.)
OK, and then we take one more step by saying “Hey what if I’m in the well-controlled isolated system?” (e.g. the “system” in question is the whole universe). From my perspective, it’s implausible and unjustified to do anything besides say that the same principle holds as above: if the universe (including me) starts in a state , then at a later time it will be in state where U is a deterministic unitary operator.
…And then there’s an indexicality issue, and you need another axiom to resolve it. For example: “as quantum amplitude of a piece of the wavefunction goes to zero, the probability that I will ‘find myself’ in that piece also goes to zero” is one such axiom, and equivalent (it turns out) to the Born rule. It’s another axiom for sure; I just like that particular formulation because it “feels more natural” or something.
I think the place anti-many-worlds-people get off the boat is this last step, because there’s actually two attitudes:
My attitude is: there’s a universe following orderly laws, and the universe was there long before there were any people around to observe it, and it will be there long after we’re gone, and the universe happened to spawn people and now we can try to study and understand it.
An opposing attitude is: the starting point is my first-person subjective mind, looking out into the universe and making predictions about what I’ll see. So my perspective is special—I need not be troubled by the fact that I claim that there are many-Harrys when Harry’s in the box and I’m outside it, but I also claim that there are not many-me’s when I’m in the box. That’s not inconsistent, because I’m the one generating predictions for myself, so the situation isn’t symmetric. If I see that the cat is dead, then the cat is dead, and if you outside the well-isolated box say “there’s a branch of the wavefunction where you saw that the cat’s alive”, then I’ll say “well, from my perspective, that alleged branch is not ‘real’; it does not ‘exist’”. In other words, when I observed the cat, I “collapsed my wavefunction” by erasing the part of the (alleged) wavefunction that is inconsistent with my indexical observations, and then re-normalizing the wavefunction.
I’m really unsympathetic to the second bullet-point attitude, but I don’t think I’ve ever successfully talked somebody out of it, so evidently it’s a pretty deep gap, or at any rate I for one am apparently unable to communicate past it.
maybe the pilot-wave model is directionally correct in the sense of informing us about the nature of knowledge?
FWIW last I heard, nobody has constructed a pilot-wave theory that agrees with quantum field theory (QFT) in general and the standard model of particle physics in particular. The tricky part is that in QFT there’s observable interference between states that have different numbers of particles in them, e.g. a virtual electron can appear then disappear in one branch but not appear at all in another, and those branches have easily-observable interference in collision cross-sections etc. That messes with the pilot-wave formalism, I think.
I think the standard technical term for what you’re talking about is “unsupervised machine translation”. Here’s a paper on that, for example, although it’s not using the LLM approach you propose. (I have no opinion about whether the LLM approach you propose would work or not.)
In practice minds mostly seem to converge on quite similar latents
Yeah to some extent, although it’s stacking the deck when the minds speak the same language and grew up in the same culture. If you instead go to remote tribes, you find plenty of untranslatable words—or more accurately, words that translate to some complicated phrase that you’ve probably never thought about before. (I dug up an example for §4.3 here, in reference to Lisa Feldman Barrett’s extensive chronicling of exotic emotion words from around the world.)
(That’s not necessarily relevant to alignment because we could likewise put AGIs in a training environment with lots of English-language content, and then the AGIs would presumably get English-language concepts.)
“inconsistent beliefs”
You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?
I’m in the market for a new productivity coach / accountability buddy, to chat with periodically (I’ve been doing one ≈20-minute meeting every 2 weeks) about work habits, and set goals, and so on. I’m open to either paying fair market rate, or to a reciprocal arrangement where we trade advice and promises etc. I slightly prefer someone not directly involved in AGI safety/alignment—since that’s my field and I don’t want us to get nerd-sniped into object-level discussions—but whatever, that’s not a hard requirement. You can reply here, or DM or email me. :)update: I’m all set now
Now, a system which doesn’t satisfy the coherence conditions could still maximize some other kind of utility function—e.g. utility over whole trajectories, or some kind of discounted sum of utility at each time-step, rather than utility over end states. But that’s not very interesting, in general; any old system can be interpreted as maximizing some utility function over whole trajectories (i.e. the utility function which assigns high score to whatever the system actually does, and low score to everything else).
It’s probably not intended, but I think this wording vaguely implies a false dichotomy between “a thing (approximately) coherently pursues a long-term goal” and “an uninteresting thing like a rock”. There are other options like “Bob wants to eventually get out of debt, but Bob also wants to always act with honor and integrity”. See my post Consequentialism & Corrigibility.
Relatedly, I don’t think memetics is the only reason humans don’t approximately-coherently pursue states of the world in the distant future. (You didn’t say it was, but sorta gave that vibe.) For one thing, something can be pleasant or unpleasant right now. For another thing, the value function is defined and updated in conjunction with a flawed and incomplete world-model, as in your Pointers Problem post.
I’m interested in Metacelsus’s answer.
My take is: I really haven’t been following the lab leak stuff. The point of my comment was to bring this hypothesis to the attention of people who have, and hopefully get some takes from them. As I understand it:
We know for sure that miners went into a cave, the same cave where btw one of the closest known wild relatives of COVID was later sampled
We know for sure that the miners got sick with COVID-like symptoms, some for 4+ months
We know for sure that samples (including posthumous samples) from those sick miners were sent to WIV, and that the researchers still had access to those samples into 2020
I think that’s more than enough to at least raise the Mojiang Miner Passage theory to consideration. Figuring out whether the theory is actually true or not would require a lot more beyond that, e.g. arguments about the exact genetic code of the furin cleavage site and all this other stuff which is way outside my area of expertise. :)
[genetic sequence analysis] is stupid because none of the people involved had the technical understanding required to even interpret papers on the topic.
The two judges were:
Will van Treuren, a pharmaceutical entrepreneur with a PhD from Stanford and a background in bacteriology and immunology.
Eric Stansifer, an applied mathematician with a PhD from MIT and experience in mathematical virology.
Do you think the judges lack technical understanding to interpret papers on genetic sequence analysis, or do you not count the judges as “involved”, or both, or something else?
Way back in 2020 there was an article A Proposed Origin For SARS-COV-2 and the COVID-19 Pandemic, which I read after George Church tweeted it (!) (without comment or explanation). Their proposal (they call it “Mojiang Miner Passage” theory) in brief was that it WAS a lab leak but NOT gain-of-function. Rather, in April 2012, six workers in a “Mojiang mine fell ill from a mystery illness while removing bat faeces. Three of the six subsequently died.” Their symptoms were a perfect match to COVID, and two were very sick for more than four months.
The proposal is that the virus spent those four months adapting to life in human lungs, including (presumably) evolving the furin cleavage site. And then (this is also well-documented) samples from these miners were sent to WIV. The proposed theory is that those samples sat in a freezer at WIV for a few years while WIV was constructing some new lab facilities, and then in 2019 researchers pulled out those samples for study and infected themselves.
I like that theory! I’ve liked it ever since 2020! It seems to explain many of the contradictions brought up by both sides of this debate—it’s compatible with Saar’s claim that the furin cleavage site is very different from what’s in nature and seems specifically adapted to humans, but it’s also compatible with Peter’s claim that the furin cleavage site looks weird and evolved. It’s compatible with Saar’s claim that WIV is suspiciously close to the source of the outbreak, but it’s also compatible with Peter’s claim that WIV might not have been set up to do serious GoF experiments. It’s compatible with the data comparing COVID to other previously-known viruses (supposedly). Etc.
Old as this theory is, the authors are still pushing it and they claim that it’s consistent with all the evidence that’s come out since then (see author’s blog). But I’m sure not remotely an expert, and would be interested if anyone has opinions about this. I’m still confused why it’s never been much discussed.
I think this is a perfectly valid argument for why NYT shouldn’t publish it, it just doesn’t seem very strong or robust… Like, if the NYT did go out and count the number of pebbles on your road, then yes there’s an opportunity cost to this etc., which makes it a pretty unnecessary thing to do, but it’s not like you’d have any good reason to whip out a big protest or anything.
The context from above is that we’re weighing costs vs benefits of publishing the name, and I was pulling out the sub-debate over what the benefits are (setting aside the disagreement about how large the costs are).
I agree that “the benefits are ≈0” is not a strong argument that the costs outweigh the benefits in and of itself, because maybe the costs are ≈0 as well. If a journalist wants to report the thickness of Scott Alexander’s shoelaces, maybe the editor will say it’s a waste of limited wordcount, but the journalist could say “hey it’s just a few words, and y’know, it adds a bit of color to the story”, and that’s a reasonable argument: the cost and benefit are each infinitesimal, and reasonable people can disagree about which one slightly outweighs the other.
But “the benefits are ≈0” is a deciding factor in a context where the costs are not infinitesimal. Like if Scott asserts that a local gang will beat him senseless if the journalist reports the thickness of his shoelaces, it’s no longer infinitesimal costs versus infinitesimal benefits, but rather real costs vs infinitesimal benefits.
If the objection is “maybe the shoelace thickness is actually Scott’s dark embarrassing secret that the public has an important interest in knowing”, then yeah that’s possible and the journalist should certainly look into that possibility. (In the case at hand, if Scott were secretly SBF’s brother, then everyone agrees that his last name would be newsworthy.) But if the objection is just “Scott might be exaggerating, maybe the gang won’t actually beat him up too badly if the shoelace thing is published”, then I think a reasonable ethical journalist would just leave out the tidbit about the shoelaces, as a courtesy, given that there was never any reason to put it in in the first place.
But can you imagine writing a newspaper article where you are reporting on the actions of an anonymous person? Its borderline nonsense.
I can easily imagine writing a newspaper article about how Charlie Sheen influenced the film industry, that nowhere mentions the fact that his legal name is Carlos Irwin Estévez. Can’t you? Like, here’s one.
(If my article were more biographical in nature, with a focus on Charlie Sheen’s childhood and his relationship with his parents, rather than his influence on the film industry, then yeah I would presumably mention his birth name somewhere in my article in that case. No reason not to.)
[partly copied from here]
The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles DNA functionally, regardless of whether it resembles DNA chemically or mechanistically.
The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing ‘things’ for ‘their future existence & proliferation’ in some broad sense (or something like that)
[infinitely many more things like that]
If future humans switch from DNA to XNA, or upload themselves into simulations, or imprint their values on AI successors, or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms.
In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
(Related to: “goal misgeneralization”.)
I don’t think it’s productive to just stare at the list of bullet points and try to find the one that corresponds to the “broadest, truest” essence of natural selection. What does that even mean? Why is it relevant to this discussion?
I do think it is potentially productive to argue that the evidence from some of these bullet-point “experiments” is more relevant to AI alignment than the evidence from others of these bullet-point “experiments”. But to make that argument, one needs to talk more specifically about what AI alignment will look like, and argue on that basis that some of the above bullet point RL algorithms are more disanalogous to AI alignment than others. This kind of argument wouldn’t be talking about which bullet point is “reasonable” or “the true essence of natural selection”, but rather about which bullet point is the tightest analogy to the situation where future programmers are developing powerful AI.
(And FWIW my answer to the latter is: none of the above—I think all of those bullet points are sufficiently disanalogous to AI alignment that we don’t really learn anything from them, except that they serve as an existence proof illustration of the extremely weak claim that inner misalignment in RL is not completely impossible. Further details here.)
I don’t think I was making that argument.
If lots of people have a false belief X, that’s prima facie evidence that “X is false” is newsworthy. There’s probably some reason that X rose to attention in the first place; and if nothing else, “X is false” at the very least should update our priors about what fraction of popular beliefs are true vs false.
Once we’ve established that “X is false” is newsworthy at all, we still need to weigh the cost vs benefits of disseminating that information.
I hope that everyone including rationalists are in agreement about all this. For example, prominent rationalists are familiar with the idea of infohazards, reputational risks, picking your battles, simulacra 2, and so on. I’ve seen a lot of strong disagreement on this forum about what newsworthy information should and shouldn’t be disseminated and in what formats and contexts. I sure have my own opinions!
…But all that is irrelevant to this discussion here. I was talking about whether Scott’s last name is newsworthy in the first place. For example, it’s not the case that lots of people around the world were under the false impression that Scott’s true last name was McSquiggles, and now NYT is going to correct the record. (It’s possible that lots of people around the world were under the false impression that Scott’s true last name is Alexander, but that misconception can be easily correctly by merely saying it’s a pseudonym.) If Scott’s true last name revealed that he was secretly British royalty, or secretly Albert Einstein’s grandson, etc., that would also at least potentially be newsworthy.
Not everything is newsworthy. The pebbles-on-the-sidewalk example I mentioned above is not newsworthy. I think Scott’s name is not newsworthy either. Incidentally, I also think there should be a higher bar for what counts as newsworthy in NYT, compared to what counts as newsworthy when I’m chatting with my spouse about what happened today, because of the higher opportunity cost.
My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post claiming that if we have cheap AIs that can do literally everything that a human can do, the result would be “a pluralistic and competitive economy that’s not too different from the one we have now”, which I view as patently absurd.
Anyway, “how an AI algorithm impacts the world” is obviously an important thing to talk about, but “what an AI algorithm can do” is also an important topic, and different, and that’s what I’m asking about, and “TAI” doesn’t seem to fit it as terminology.
There’s a fact of the matter about whether the sidewalk on my street has an odd vs even number of pebbles on it, but I think everyone including rationalists will agree that there’s no benefit of sharing that information. It’s not relevant for anything else.
By contrast, taboo topics generally become taboo because they have important consequences for decisions and policy and life.
There were two issues: what is the cost of doxxing, and what is the benefit of doxxing. I think
the main cruxan equally important crux of disagreement is the latter, not the former. IMO the benefit was zero: it’s not newsworthy, it brings no relevant insight, publishing it does not advance the public interest, it’s totally irrelevant to the story. Here CM doesn’t directly argue that there was any benefit to doxxing; instead he kinda conveys a vibe / ideology that if something is true then it is self-evidently intrinsically good to publish it (but of course that self-evident intrinsic goodness can be outweighed by sufficiently large costs). Anyway, if the true benefit is zero (as I believe), then we don’t have to quibble over whether the cost was big or small.
I always thought of −∑x∈microstatesP(x)logP(x) as the exact / “real” definition of entropy, and log(number of microstates) as the specialization of that “exact” formula to the case where each microstate is equally probable (a case which is rarely exactly true but often a good approximation). So I found it a bit funny that you only mention the second formula, not the first. I guess you were keeping it simple? Or do you not share that perspective?