Much has been said about superintelligences cooperating with each other via reasoning or proving statements about each other’s source code. But it seems like this “source code” is likely to be a neural network rather than something amenable to formal methods, in which case it’s not at all clear that the problem is computationally tractable. If this has been discussed before, can someone point me to the discussion?
Joey Marcellino
Limited verification can hurt debate oversight
Joey Marcellino’s Shortform
I like this framing, but I think the examples are missing the bit that makes me most skeptical about the kind of acausal trades that people on this website like to discuss; namely, that they’re acausal in “both directions.” In the apples for charity example, I think some of the intuition rides on the assumption that, even if I can’t initiate or verify the trade, the descendent of the apple civilization will in fact check that I have placed an apple. That’s what lets me do normal, everyday counterfactual reasoning of the form “if I place an apple, then he will donate, and if I don’t, he won’t” (with some probability). So there’s still some causality in there somewhere, in that the presence of the apple directly causes the donation. In the case of superintelligences in different universes or whatever, we don’t even have this, so the metaphor is more like “I think that there’s a descendent of a civilization who donates when I put an apple someplace, and I think that he thinks that I am likely to exist and to put the apple, so he’ll donate.”
The obvious objection is, given only what I just wrote, I still get the donation even if I don’t put the apple, so why should I bother? To get around this it needs to be the case that what the apple guy thinks I’ll do somehow depends on what I actually do, which connects to the various other (controversial, nonintuitive) cans of worms that people like to talk about here. So agreed, “one-directional” acausal trade is not so scary, but I’m still not sure about bidirectional.
“Armchair psychologizing about which of my rhetorical opponents’ cognitive deficits cause them to fail to agree with me” is by far my least favorite kind of LessWrong post, and the proposed solution to the “problem” (“recruit smarter people to the field”) is not interesting or insightful.
Testing few-shot coup probes
Thanks for reading! Could you realize the same thought-cloud twice using, for example, language and music? And if so, do you think the end results would count as “translations” of each other in some sense? If the answer is yes I’d be very curious to see/hear an example.
Thank you!
That’s an interesting perspective. I wouldn’t have described poetry writing as being a clear case of 1), since on my model ordinary thoughts already spawn in language, and so wouldn’t require “translation” rather than just massaging or reshaping into cool sentences. The model you suggest, where for experienced writers their poetic thoughts spawn ex nihilo rather than being the result of this sort of massaging, seems plausible as well.
On music and language
It’s not obvious to me that “magically transmuting observations into behavior” is actually all that disanalogous to how the brain works. On something like the Surfing Uncertainty theory of the brain, updating probability distributions and minimizing predictive error is all the brain is ever doing, including potentially for things like moving your hand.
I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?
Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?
According to the model, between 10^3 and 10^5 dollars. At the low end, that’s not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.
As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.
I see the two main arguments of the book as 1) we should understand “gender identity” as a bunch of subjective feelings about various traits, which may or may not cohere into an introspectively accessible “identity”; 2) we can understand gender categories as a particular kind of irreducible category (namely historical lineages) to which membership is granted by community consensus, the categories being “irreducible” in that they are not defined by additional facts about their members. These stand or fall independently of whether we accept gender self-id, although self-id is compatible with BG’s understanding of categories in a way that it is not necessarily with clusters.
See the last section of the review for reasons why we might sometimes prefer BG’s analysis of categories on the outside view; I think it’s potentially more useful for thinking about the role of categories in society and in people’s lives. I agree this is not a knockdown case, but I certainly think it’s a better framework than e.g. “men are those with the essential spirit of man-ness inside them,” which is also coherent but not very interesting.
That’s a good question. I think BG’s way of thinking about gender categories is potentially useful for racial/ethnic categories as well, particularly the bit about category membership as a conferred status. I think they’d probably agree with this. They don’t really argue that we ought to have gender self ID; they explicitly assume this to be the case, and are more trying to show that it’s coherent. I suspect if you asked them they would probably say that we ought not to have racial self ID, or that it ought to be much more limited than in the case of gender (here are some candidate reasons why one might think this https://www.bostonreview.net/articles/robin-dembroff-dee-payton-breaking-analogy-between-race-and-gender/), but they’d probably grant that it is at least also coherent.
Book Review: What Even Is Gender?
Sure, one can always embed a game inside another one and so alter the overall expectation values how one likes. That said, we still only want to play the meta-game if it had positive expectation value, no?
On martingales
The conclusion seems rather to be “human metabolism is less efficient than solar panels,” which, while perhaps true, has limited bearing on the question of whether or not the brain is thermodynamically efficient as a computer when compared to current or future AI. The latter is the question that recent discussion has been focused on, and to which the “No - ” in the title makes it seem like you’re responding.
Moreover, while a quick Google search turns up 100W as the average resting power output of a person, another search suggests the brain is only responsible for about 20% of energy consumption per time. Adding this to your analysis gives .13% “efficiency” in the sense that you’re using it, so the brain still outperforms AI even on this admittedly rather odd sunlight-to-capability metric.
What does quantum entanglement mean for causality? Due to entanglement, there can be spacelike separated measurements such that there exists a reference frame > where it looks like measurement A precedes and has a causal influence on the outcomes of measurement B, and > also a reference frame where it looks like measurement B precedes and has a causal influence on the outcomes of measurement A.
“Causality” is already a somewhat fraught notion in fundamental physics irrespective of quantum mechanics; it’s not clear that one needs to have some sort of notion of causality in order to do physics, nor that the universe necessarily obeys some underlying causal law. To the extent that quantum mechanics breaks our common-sense notions of causality, it’s only in this very particular sense (where it seems like Alice measuring first “causes” Bob’s measurement to take a certain value, or vice versa), and since neither party can use a measurement scheme like this to send information, the breakage doesn’t invite paradoxes or any sort of other weirdness.
Outside of philosophical musings about causality (which, to be clear, I think are perfectly valid and interesting) it suffices to say that entangled systems exhibit correlations without a common cause, and leave it at that.
If you’re interested in a recent technical discussion of some of these ideas, I recommend the following paper: https://arxiv.org/pdf/2208.02721.pdf
Just to (hopefully) make the distinction a bit more clear:
A true copying operation would take |psi1>|0> to |psi1>|psi1>; that’s to say, it would take as input one qubit in an arbitrary quantum state and a second qubit in |0>, and output two qubits in the same arbitrary quantum state that the first qubit was in. For our example, we’ll take |psi1> to be an equal superposition of 0 and 1: |psi1> = |0> + |1> (ignoring normalization).
If CNOT is a copying operation, it should take (|0> + |1>)|0> to (|0> + |1>)(|0> + |1>) = |00> + |01> + |10> + |11>. But as you noticed, what it actually does is create an entangled state (in this case, a Bell state) that looks like |00> + |11>.
So in some sense yes, the forbidden thing is to have a state copied and not entangled, but more importantly in this case CNOT just doesn’t copy the state, so there’s no tension with the no-cloning theorem.
This is not at all obvious and needs much more argument to be convincing. RLHF is a pretty weak example, as it’s more or less a capabilities technique applied to alignment, and so it’s unsurprising that it’s differentially valuable for capabilities. Something like Constitutional AI does not clearly improve capabilities at all, let alone more so than alignment.
IMO the core reason alignment work is not automatically capabilities work is that ultimately in alignment we are interested in suppressing certain behaviors, while capabilities people are interested in eliciting or unlocking them, and there’s no a priori reason why a technique that is useful for one of these must be useful for the other. Taking your interpretability example, suppose we get some great tool that allow us to conclude that there’s some circuit or whatever in our model that corresponds to bomb-making knowledge, and no circuit for doing calculus. If we want the model to stop making bombs, this tool points directly at something we can remove, but if we want it to start doing calculus, it doesn’t tell us what we should add or change. The work on activation steering is a good example of this; you can subtract the “unkind” vector to make the model nicer, but you can’t add a “smart” vector to make it smarter.