I’ve found the AI Village amusing when I can catch glimpses of it, but I wasn’t aware of a regular digest. Is https://theaidigest.org/village/blog what you are referring to?
Domenic
These posts always leave me feeling a little melancholy that my life doesn’t seem to have that many challenges where thinking faster/better/harder/sooner would actually help.
Most of my waking hours are spent on my job, where cognitive performance is not at all the bottleneck. (I honestly believe that if you made me 1.5x “better at thinking”, this would not give a consistent boost in output as valued by the business. I’m a software engineer.) I have some intellectual spare-time hobbies, but the most demanding of them is Japanese studying, which is more about volume, exposure, and spaced repetition than clever strategies. I am intrigued by making myself more productive in my programming side projects, but I think the biggest force multiplier for me there is learning how to leverage AI agents more effectively. (Besides the raw time savings, rapid iteration speed can also lessen the need for thinking of the right solution the first time around.)
I can easily see how this would be an important skill for someone doing novel academic-ish research, however. And I wish some of the examples were about that, instead of Thinking Physics and video games!
It’s interesting to compare this to the other curated posts I got in my inbox over the last week, What is malevolence? and How will we update about scheming. Both of those (especially the former) I bounced off of due to length. But this one I stuck with for quite a while, before I started skimming in the worksheet section.
I think the instinct to apply a length filter before sending a post to many peoples’ inboxes is a good one. I just wish it were more consistently applied :)
Finding non-equilibrium quantum states would be evidence of pilot wave theory since they’re only possible in a pilot wave theory.
If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium
(Seems pretty unlikely we’d ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)
I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.
This many years distant, I can’t give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something that kind of works, but people outside the field don’t find it fully satisfying. (Even people in closely adjacent fields, i.e. other quantum foundations people.) For example, one route I recall abandons using position as the hidden variable, which makes one question what the point was in the first place, since we no longer recover a simple manifest image where there is a “real” notion of particles with positions. And I don’t know whether the math fully worked out all the way up to the complexities of the standard model weakly coupled to gravity. (As opposed to, e.g., only working with spin-1/2 particles, or something.)
Now I want to go re-read some of Ward’s papers...
This is great, until Spotify is ready this will be the best way to share on social media.
May I suggest adding lyrics, either in the description or as closed captions or both?
If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?
For my part, I don’t feel like I have many issues/baggage/trauma, so while some of the “fundamental debugging” techniques discussed around here (like IFS or meditation) seem kind of interesting, I don’t feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential “power ups” from a baseline-fine state.
So I’m wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
The agent is indifferent between creating stoppable or unstoppable subagents, but the agent goes back to being corrigible in this way. The “emergent incentive” handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we’re commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn’t work, you must also believe there’s a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it’s in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I’m not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what’s going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you’re going to need an unbounded offset term.
I found that the paper’s proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I’d encourage you to be a bit more willing to dive into this work, possibly including the paper series it’s part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.’s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I’m still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there’s been no progress on the shutdown problem since Sores et al. seems… wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman’s paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman’s safety layer does not prevent the agent from taking the “create an unstoppable sub-agent” action. I’ll code it up, apply the correction term, and get back to you.
Are you aware of how Holtman solved MIRI’s formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic
Skimming through your proposal, I believe Holtman’s correctly-constructed utility function correction terms would work for the scenario you describe, but it’s not immediately obvious how to apply them once you jump to a subagent model.
This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I’m most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I’m not sure I understand the authors’ conclusions: they state (3.2.3) this “suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning.” I don’t see any evidence of that in the paper!
In particular, 3.2.3 and figure 7′s categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it’ll propagate and screw you up. You can see, e.g., in figure 7′s five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don’t see what any of this has to do with planning and composing! In particular I don’t see any measurement of something like “set up a totally wrong plan for multiplying numbers” or “fail to compose all the individual digit computation-steps into the final answer-concatenation step”. Such errors might exist, but the paper doesn’t give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It’s possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let’s look at it from another angle: how different is this from humans? What’s impressive to me about this result is that it is quite different. I was expecting to say something like, “oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it’s OK to expect some mistakes”. But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don’t always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it’d be nice to reproduce their results to confirm. If they’re to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we’ll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I’m skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But “number of times you saw the whole data set” seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.
I’ve had a hard time connecting John’s work to anything real. It’s all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.
In contrast, look at work like Anthropic’s superposition solution, or the representation engineering paper from CAIS. If someone told me “I’m interested in identifying the natural abstractions AIs use when producing their output”, that is the kind of work I’d expect. It’s on actual LLMs! (Or at least “LMs”, for the Anthropic paper.) They identified useful concepts like “truth-telling” or “Arabic”!
In John’s work, his prose often promises he’ll point to useful concepts like different physics models, but the results instead seem to operate on the level of random variables and causal diagrams. I’d love to see any sign this work is applicable toward real-world AI systems, and can, e.g., accurately identify what abstractions GPT-2 or LLaMA are using.
Interesting post. One small area where I might have a useful insight:
A lot of online multiplayer games rest on the appeal of their character design. Think of Smash Bros, Overwatch, or League of Legends. Characters’ unique abilities give rise to a dense hypergraphs of strategic relationships which players will want to learn the whole of.
But in these games, a character cannot have unique motivations. They’ll have a backstory that alludes to some, but in the game, that will be forgotten. Instead, every mind will be turned towards just one game and one goal: Kill the other team, whoever they are. MostPointsWins forbids the expression of the most interesting dimensions of personality.
So imagine how much richer expressions of character could be if you had this whole other dimension of gameplay design to work with. That would be cohabitive.Role-playing games, including online multiplayer RPGS (“MMORPGs”), often achieve this. In SWTOR, when I play my Empire-hating fallen-to-the-dark-side Jedi Knight, versus my heart-of-gold bounty hunter, or my murderously-insane Sith Inquisitor, I make very different choices even when faced with the same content. This is most-obviously manifest in the dialogue tree choices for the main story, but extends to other aspects of gameplay as well. Whether I take a stealth route or go on a murderous rampage; whether my character does all the side missions or jumps straight to the main objective; which companions I choose to bring along; and even what clothes I wear.
This role playing mostly does go out the window in the “most intense” parts of the game (PvP and raids). Although sometimes over voice chat we’ll briefly slip into character for fun, or compliment each other on a particularly on-point use of abilities. (“Just like a smuggler, to stun all the trash.”)
One key ingredient here is strong archetypes from an existing background (e.g. the Star Wars galaxy in my case, or the Warcraft universe in others, or maybe just fantasy in general). Another is the long-lasting investment and relationship one builds with their character, from design onward; I feel a lot more connection to my SWTOR characters than I do to the random personalities I pick up in any given Betrayal at House on the Hill session.
Anyway, maybe there’s something to learn here for your game design!
I wonder if more people would join you on this journey if you had more concrete progress to show so far?
If you’re trying to start something approximately like a new field, I think you need to be responsible for field-building. The best type of field-building is showing that the new field is not only full of interesting problems, but tractable ones as well.
Compare to some adjacent examples:
Eliezer had some moderate success building the field of “rationality”, mostly though explicit “social” field-building activities like writing the sequences or associated fanfiction, or spinning off groups like CFAR. There isn’t much to show in terms of actual results, IMO; we haven’t developed a race of Jeffreysai supergeniuses who can solve quantum gravity in a month by sufficiently ridding themselves of cognitive biases. But the social field-building was enough to create a great internet social scene of like-minded people.
MIRI tried to kickstart a field roughly in the cluster of theoretical alignment research, focused around topics like “how to align AIXI”, decision theories, etc. In terms of community, there are a number of researchers who followed in these footsteps, mostly at MIRI itself to my knowledge, but also elsewhere. (E.g. I enjoy @Koen.Holtman’s followup work such as Corrigibility with Utility Preservation.) In terms of actual results, I think we see a steady stream of papers/posts showing slow-but-legible progress on various sub-agendas here: infra-bayesianism, agent foundations, corrigibility, natural abstractions, etc. Most (?) results seem to be self-published to miri.org, the Alignment Forum, or the arXiv, and either don’t attempt or don’t make it past peer review. So those who are motivated to join a field by legible incentives such as academic recognition and acceptance are often not along for the ride. But it’s still something.
“Mainstream” AI alignment research, as seen by the kind of work published by OpenAI, Anthropic, DeepMind, etc., has taken a much more conventional approach. People in this workstream are employed at large organizations that pay well; they publish in peer-reviewed journals and present at popular conferences. Their work often has real-world applications in aligning or advancing the capabilities of products people use.
In contrast, I don’t see any of this sort of field-building work from you for meta-philosophy. Your post history doesn’t seem to be trying to do social field-building, nor does it contain published results that could make others sit up and take notice of a tractable research agenda they could join. If you’d spent age 35-45 publishing a steady stream of updates and progress reports on meta-philosophy, I think you’d have gathered at least a small following of interested people, in the same way that the theoretical alignment research folk have. And if you’d used that time to write thousands of words of primers and fanfic, maybe you could get a larger following of interested bystanders. Maybe there’s even something you could have done to make this a serious academic field, although that seems pretty hard.
In short, I like reading what you write! Consider writing more of it, more often, as a first step toward getting people to join you on this journey.
I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I’d like to ask about is corrigibility.
From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won’t stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?
I am sympathetic to this viewpoint. However I think there are large-enough gains to be had from “just” an AI that: matches genius-level humans; has N times larger working memory; thinks N times faster; has “tools” (like calculators or Mathematica or Wikipedia) integrated directly into it’s “brain”; and is infinitely copyable. That gets you to https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message territory, which is quite x-risky.
Working memory is a particularly powerful intuition pump here, I think. Given that you can hold 8 items in working memory, it feels clear that something which can hold 800 or 8 million such items would feel qualitatively impressive. Even if it’s just a quantitative scale-up.
You can then layer on more speculative things from the ability to edit the neural net. E.g., for humans, we are all familiar with flow states where we are at our best, most-well-rested, most insightful. It’s possible that if your brain is artificial, reproducing that on demand is easier. The closest thing we have to this today is prompt engineering, which is a very crude way of steering AIs to use the “smarter” path when it’s navigating through their stack of transformer layers. But the existence of such paths likely means we, or an introspective AI, could consciously promote their use.
I don’t think the lack of machine translations of AI alignment materials is holding the field back in Japan. Japanese people have an unlimited amount of that already available. “Doing it for them”, when their web browser already has the feature built in, seems honestly counterproductive, as it signals how little you’re willing to invest in the space.
I think it’s possible increasing the amount of human-translated material could make a difference. (Whether machines are good enough to aid such humans or not, is a question I leave to the professional translators.)
Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?
My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they’re made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I’ve done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear (!) single-layer (!!) regression models, not anything like a LLM. How much progress does Conjecture expect to really make? What are other papers our study group should read?
Echoing what others have said here, this article was quite well-written. It felt well suited for people who do not know much about the field, with good analogies, recaps of foundational concepts, and links to various fun events not everyone will have caught (e.g. DeepThink switching to Chinese, or Golden Gate Claude). But it did that without being grating toward those of us who have been following along and for which much of this was review, which is especially impressive.
You might want to consider pitching this, or your future writing, around to larger outlets than LessWrong! I imagine your writing would be a perfect fit for detail-loving places like Quanta or Asterisk, but maybe larger online outlets (The Verge? Andandtech?? I don’t know what people read these days) would be interested.