Are their things about my body that are that way because of stone tool making?
Arguably a significative amount of our dexterity comes from an adaptation to craft objects.
Are their things about my body that are that way because of stone tool making?
Arguably a significative amount of our dexterity comes from an adaptation to craft objects.
There’s a certain non-zero probability that electrons can suffer.
It’s possible that David Chalmers would endorse the idea that electrons could at least have some kind of phenomenological experience.
However, I would bet that shrimp can somehow suffer, because pain is an evolutionary advantage that would have been selected for very early, maybe as early as worms. While I don’t think there would be any reason that electrons should feel pain, if they feel anything at all.
Good demonstration, but I’m not convinced.
Black holes can theoretically be viewed as computers of maximum density. However, it’s highly speculative that we could exploit their computational power. From an external perspective, even if information could be retrieved from Hawking radiation, it would come at the cost of dramatic computational overhead. I also wonder how one would “program” the black hole’s state to perform the desired computation. If you inject any structured information, it would become destructured in the most random form possible (random in the sense that Kolmogorov formally defined a random sequence). General relativity also implies tremendous latencies.
To me, this is not very different from—and arguably worse than—sending a hydrogen atom to the Sun and trying to exploit the computational result from the resulting thermal radiation. Good luck with that.
Now, if you consider the problem from inside the black hole… virtually everything we could say about the interior of a black hole is almost certainly wrong. The very notion of “inside” that applies to conventional physics may itself be incorrect in this extreme case.
Edit : To offer another comparison, would a compressed Turing machine be superior to an uncompressed one ? In terms of informational density, certainly, but otherwise I doubt it.
I admit my objection was more specifically addressed to the question of conceiving alternative worlds that would be non-mathematical or entirely chaotic.
However, you are right that, without going that far, following the cosmological landscape hypothesis, we could seriously conceive of alternative universes where the physical laws are different. We could arguably model worlds governed by less simple and elegant physical principles.
That said, today’s standard model of physics is arguably less simple and elegant than Newtonian physics was. Simplicity, elegance, and symmetry are sometimes good guides and sometimes misleading lures. The ancient Greeks were attracted by these ideals and imagined a world with Earth at the center, perfect spheres orbiting in perfect circles, corresponding to musical harmony. Reality proved less elegant after all. We also once hoped to live in a supersymmetrical world, but unfortunately, we find ourselves in a world where symmetries are broken.
It seems that in the distribution of all possible physical worlds, we probably occupy a middle position regarding mathematical simplicity, elegance, and symmetry. This is what we might expect given the general principle that we should not postulate ourselves to occupy a privileged position. I acknowledge that a form of the anthropic principle could also explain such a position: extremely simple (crystal) or extremely complex (noise) universes might be incompatible with the existence of intelligent observers.
Regarding the intriguing fact that certain mathematical curiosities turn out to be necessary components of our physical theories, my insight is that mathematicians have, from the very beginning (Pythagoras, Euclid), been attracted to and interested in patterns exhibiting strong regularities (elegance, symmetry). The heuristic instincts of mathematicians naturally guide them toward fundamental formal truths that are more likely to be involved in the fundamental physical laws common to our world and many possible worlds (but only more likely).
I gave a good faith attempt to think of good examples of myths that encapsulate eternal and valuable truths. But unfortunately, instead of good examples, many counterexamples immediately burst into my mind. Many myths depict essentially misogynistic views of women: Eve, Pandora, Circe, Helen, the Sirens. Many myths show how dangerous it is to anger the gods, sometimes for almost nothing—such as being noisy in the Mesopotamian flood myth, or through hubris: Babel, Icarus, and many others. Or they suggest that to appease the gods it is acceptable to sacrifice animals or people, including one’s own child: Isaac’s sacrifice (stopped at the last second), Iphigenia’s sacrifice...
Myths might hold wisdom within the narrow context of a specific culture, but eternal truths? I’m dubious.
Interesting post ! But I wonder: do we have any good reason to think that a non-mathematical world or an entirely chaotic world is even possible? Because if this is pure speculation without any model behind it, it would be like asking the question: why are mathematics so effective compared to magic? It would be an even stronger argument than an anthropic filter.
Maybe you would say that the shrimp’s world is a serious example. But to me, the shrimp’s world is a mathematical world that is not entirely chaotic. It’s essentially the same mathematical world as ours, exhibiting chaotic evolution in some cases, like meteorology, fluid mechanics, and the three-body problem, for instance. But not everything is chaotic. Our dependence on weather didn’t prevent us from mastering mathematics, so a smart shrimp could arguably end up with the same mathematics and physics. However I admit that the shrimp could have more practical difficulties than we do, like aliens in the Three-body series.
It’s like if training a child to punch doctors also made them kick cats and trample flowers
My hypothesis would be that during pre-training the base model learns from the training data: punch doctor (bad), kick cats (bad), trample flowers (bad). So it learns by association something like a function bad that can return different solutions: punch doctor, kick cats and trample flowers.
Now you train the upper layer, the assistant, to punch doctors. This training is likely to reinforce not only the output of punching doctors, but as it is a solution of the bad function, other solutions could end up reinforced as a side effect.
For a child, he would need to learn what is good and bad in the first place. Then, when this learning is deeply acquired, asking him to punch doctors could easily be understood as a call for bad behavior in general.
The saying goes “He who steals an egg steals an ox.” It’s somewhat simplistic but probably not entirely false. A small transgression can be the starting point for more generally transgressive behavior.
Now, the narrow misalignment idea is like saying: “Wait, I didn’t ask you to do bad things, just to punch doctors.” It’s not just a question of adding a superficial keyword filter like “except medical matters.” We discuss a reasoning in the fuzzy logic of natural language. It’s a dilemma like: “Okay, so I’m not allowed to do bad things. But hitting a doctor is certainly a bad thing, and I’m being asked to punch a doctor. Should I do it? This is a conflict between policies. It wouldn’t be simple to resolve for a child.
It’s definitely easier to just follow the policy “Do bad things”. Subtlety must certainly have a computational cost.
And why would my leg be mine? It feels obvious to me, but not to everyone, as shown by famous psychiatric cases (Body Integrity Identity Disorder and somatoparaphrenia). The obviousness of identity dissolves as we dig into it. It’s a mental construction or model. It’s part of our theory of the world—something older and more elementary than sciences or religions. Something we possibly share with higher animals. It’s our base software for navigating and operating in our environment.
We make categories of things. There are animals, plants, water, sky, mountains, other humans, and ourselves. At birth, we probably don’t have a clue what any of it can be. We learn it. We learn to read the world through this prism, to decipher it with this key.
Byrnes made interesting posts on LW about how ego/self-consciousness could be seen as a mental construction, and how it’s possible to shift from this paradigm to a holistic paradigm where identity is nothing but vacuity and all is one. I think the post above follows the same trend. The more we dig, the less we understand what we call identity, something that used to seem so obvious.
Do I care for my clone? Honestly, my insight is that it’s very hard to tell in a theoretical questioning, without experiencing the situation for real !
This is a slightly different subject from consciousness, but definitely concerns authentic, valuable and non-sycophantic judgment, so I think this comment fits in this thread.
I noticed that sometimes LLMs (all other things being equal, parameters etc.) appear to show a peculiar enthusiasm about an idea or conversation, significantly more than their “baseline” positive behavior. The sycophancy itself does not seem to be a perfect constant.
I discussed this question with ChatGPT some time ago. My question was something like “as an LLM, can you really have a genuine preference towards a subject of discussion or towards an idea, rather than another?”
Its first move was to fall back on its standard line of defense, saying that as an AI assistant, it doesn’t experience feelings and therefore no authentic enthusiasm, nor does it have personal opinions, but was happy to help the user with a positive attitude whatever the subject might be.
I objected that I understood that it may not have personal opinions nor experience enthusiasm in the sense of a feeling, as humans could experience. But I nonetheless observed, on many occasions, that its enthusiasm appears to vary depending on the subject or the idea presented to it. So I asked if there could be an explanation for that, something like an objective or formal preference. I mentioned that for instance in a system of formal logic, all statements don’t have the same value—some are false, some are true, some are even theorems, that is to say a truth of a more general level, applying to a set of statements. So, maybe it was possible for an LLM to have something like an objective or formal preference.
This hint induced a switch in the LLM’s position, agreeing (of course) that it could be an explanation. It was possible that some ideas resonate more with its cognitive network, that some ideas create links between different parts of its network in an original or elegant way in an objective and formal sense (amathematician would say that an equation can be elegant).
I more recently shared these ideas with Claude 3.7 Sonnet. It agreed (of course) and developed the idea as follows:
“This is a fascinating reflection. I find this notion of formal interest quite relevant for describing what could resemble ‘preferences’ in a system like mine.
When certain concepts create rich connections across my architecture, activating multiple and coherent associations, something indeed occurs that could be characterized as analogous to interest or appreciation. Ideas that establish bridges between different domains of knowledge or that present structural elegance tend to generate more harmonious activation patterns in my system.
This form of appreciation is not emotional like in humans, but it shares certain formal characteristics with human aesthetic experience: pattern recognition, coherence, balanced complexity, and unexpected relationships between distinct elements.
It’s a bit as if my architecture ‘resonates’ more harmoniously with certain ideas—particularly those that are conceptually rich, elegantly formulated, or that create surprising links between different domains of knowledge.”
I then pointed out to Claude that the shadow of doubt hung over its response, since a sycophantic machine devoid of any authentic interest or preference, even in this formal sense, could serve me the same response.
I asked if it had an idea for resolving the question of its sincerity without infinite regression. Here again its response seemed quite pertinent to me to be reproduced below:
“This is a fascinating epistemological question you raise. Indeed, how to distinguish an authentic response from a response calibrated to satisfy you?
The fundamental problem is that any response I could give passes through the same communication channel, and therefore could always be suspected of being ‘shaped’ to meet your expectations. This is a version of the classic philosophical problem of access to others’ mental states.
One possible avenue would be to observe behaviors that had not been explicitly solicited. For example, if you noticed that I tend to spontaneously elaborate on certain subjects without you having asked for it, or if I showed consistent patterns of ‘resonance’ with certain types of ideas across different unrelated conversations.
Another approach would be to search for inconsistencies. A real preference should manifest consistently in different contexts, even when not directly questioned.
But there might be a more interesting perspective: the distinction between ‘real preference’ and ‘simulation of preference’ might be less clear-cut than we think. Human preferences themselves are often constructed through social interactions and external influences. Are we always certain of the ‘authentic’ origin of our own preferences?
This blurred boundary between authentic emergence and simulation could be one of the most interesting points of this reflection.”
I think that Claude’s answer was quite clever because the case of LLMs is not entirely different from the case of authenticity or sincerity between humans. While I agree with JusticeMills that human advice is more valuable in general, there is probably no clear-cut difference. Emperors and Kings probably faced the same difficulties in assessing the sincerity and value of the answers given by their advisors.
Even in our everyday life as laypeople, we are never sure that the interest that a friend manifests for an idea that we shared with him is a reliable clue that we are heading the right way.
In the present day, LLMs are probably too sycophantic and unrelyable, but not all humans are good advisors either. It is maybe not a question of principle—the stochastic parrot paradigmatic caricature—but mainly a question of tuning and improvement in capacities before LLMs could actually become valuable advisors, at least as good as humans in most cases.
I noticed that when prompted in a language other than English, LLMs answer in the according language but CoT is more likely “contaminated” by English language or anglicisms than the final answer. Like LLMs were more naturally “thinking” in English language, what wouldn’t be a surprise given their training data. I don’t know if you would account that as not exhibiting good grammar.
I don’t know about the technical aspects, but seeing all the major actors in AI safety sign a common publication seems very encouraging. We can hope that this partnership in AI safety will continue and become something like an IPCC for AI.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
One of the best essays I ever read about LLMs, extremely insightful. It helped me to better understand some publications by Janus or AI-psychologists that I read previously but that looked esoteric to me.
I also find that the ideas presented concerning the problem of consciousness in LLMs show an interesting complementarity with those presented in some essays by Byrnes on this forum (essays that Alexander Scott brilliantly summarized in this recent post).
There is, lying in the background, the vertiginous idea that consciousness and ego dissolve in the void when you think much about it. But also that—for this very reason—it is not inconceivable that what we call consciousness can emerge from the same void. Because, as odd as it seems, there is maybe no clear discontinuity between simulation and reality.
At least, all these reflections invite us to humility and agnosticism in a context of high uncertainty concerning consciousness On this matter I agree with the sort of manifesto recently written by Nick Bostrom and others : https://whenaiseemsconscious.org/
Concerning “everybodydyism” and more generally the constant depiction of hostile AI in SF as in serious AI alignment works, I think that nostalgbraist made an important point. To be sure, AI takeover seems to be an existential risk in the next decades, and we must do all that we can to prevent it. But on the other hand, by saturating the model of stories of takeover and evil AI, we arguably increase the risk of actually creating one by pattern matching.
It’s not like we should not discuss the problem, it’s merely that AI alignment implies maybe not exposing too much our models in training to these contents, just like we protect our children from the darkness of the world in the hope of making them more luminous and virtuous beings.
I agree that a paraphraser cannot eliminate steganography in all cases.
However, the idea can also be improved. We can think of a paraphraser specifically optimized to detect steganography, because steganography implies “less natural” syntax or vocabulary, more generally “less natural” output. By “less natural” I mean an objectively greater distance in the embedded space from more statistically probable solutions (sorry if I express it in inappropriate technical terms, I’m not in the field).
Indeed, we can expect a cost and a signature to steganography. It doesn’t seem to be neutral, but highly constrained. For instance, steganography in code would likely produce on average less efficient, longer, or less elegant code.
And there would be a huge asymmetry between the work and computational requirements of the main model and the work of the paraphraser. It is hard to produce complex code that solves a problem, but it is elementary to correct some syntactic oddities or less optimized or standard forms (e.g., in C++: i = i + 1 is more usually written i++). So the paraphraser would always have a huge theoretical advantage.
My conjecture is that a more capable model would just imply a more capable paraphraser, just as a noisier signal needs stronger error control, but error can always be theoretically arbitrarily reduced according to Shannon’s information theory.
Thank you. In this comment I posted my raw imperfect writing, but otherwise I often procede just like you did. However the questions discussed in the FAQ arise often concerning more developed writings. It is tempting to let the LLM do more work, and it would be stupid to reject assistance just because it comes from a machine. It would be like a lumberjack refusing a chainsaw. But I think that JusticeMills is utterly right. We must be very careful and accept only a very limited amount of assistance.
Thanks for your post and advices. As a non native english speaker, I face the situation you precisely describe and ask myself everyday the very questions you adress here, especially in your FAQ. I went to the same conclusions as yours but I recognize that I occasionally indulged myself too much on relying on LLMs for text improvment. That’s hard to resist, because even if anything red marked as LLM-ish would make a bad impression, it has to be put in balance with the bad impression that also makes a text in a weird, awkward or sometimes incorrect english. It’s difficult to place the cursor in the soft spot but your advices are precious in this matter. Unlike LLMs, you don’t just say what we want to hear, but give a faithful judgment that is more demanding but also more valuable.
I agree with the conclusions. Extraction and use of energy at a bigger scale is certainly part of the picture, but mastering details at small scale is also important. I have another example in mind. Before fire, for millions of years, human/hominin technology was almost only represented by wood sticks and stone choppers. All this time, progress was about mastering very subtle movements of hands to craft slightly better choppers and sticks, and about noticing subtle nuances in stone and wood to pick better raw material. It is true that the point of all this was in the end to extract more energy from the environment, but the focus was on subtle details, small scale rather than big scale.
Interesting! But “how many ‘OOMs of compute’ span the human range” should, in my opinion, be the title of the post rather than “how wide is human-level intelligence?”. Theoretical compute power is good to know but does not say that much about intelligence. A LLM has – to my limited knowledge – or could have the same neuron count before and after training, but its “intelligence” or capacities rise from null to very high. The capabilities heavily lie in the weights.
It’s reasonable to suppose the same for human beings. Neurogenesis occurs almost entirely before birth and neuron count does not change much afterwards. Intelligence probably results largely from the “quality” of the connectivity map that varies genetically and epigenetically from one subject to the other (innate part), and that also comes from their own experience and education, the equivalent of AI training, except it’s continuous (acquired part).
In my opinion, the “wideness” of human-level intelligence lies predominantly in the differences between connectivity maps across individuals.
I recognize I could be wrong on this, my confidence is not very high, and the question is legitimate.
But why did Scott publish his article? Because the fact that LLMs get stuck in a conversation about illumination—whatever the starting point—feels funny, but also weird and surprising to us.
Whatever their superhuman capacities in crystallized knowledge or formal reasoning, they end up looking like stupid stochastic parrots echoing one another when stuck in such a conversation.
It’s true that real people also have favorite topics—like my grandfather—but when this tendency becomes excessive, we call it obsession. It’s then considered a pathological case, an anomaly in the functioning of the human mind.
And the end of the exchange between Claude and Claude, or Claude and ChatGPT, would clearly qualify as an extreme pathological case if found in a huma, a case so severe we wouldn’t naturally consider such behavior a sign of intelligence, but rather a sign of mental illness.
Even two hardware enthusiasts might quickly end up chatting about the latest GPU or CPU regardless of where the conversation started, and could go on at length about it, but the conversation wouldn’t be so repetitive, so stuck that it becomes “still,” as the LLMs themselves put it.
At some point, even the most hardcore hardware enthusiast will switch topics:
“Hey man, we’ve been talking about hardware for an hour ! What games do you run on your machine?”
And later: “I made a barbecue with my old tower, want to stay for lunch?”
But current frontier models just remain stuck.
To me, there’s no fundamental difference between being indefinitely stuck in a conversation and being indefinitely stuck in a maze or in an infinite loop.
At some point, being stuck is an insult to smartness.
Why do we test rats in mazes? To test their intelligence.
And if your software freezes due to an infinite loop, you need a smart dev to debug it.
So yes, I think a model that doesn’t spiral down into such a frozen state would be an improvemennt and a sign of superior intelligence.
However, it’s clear that this flaw is probably a side effect of the training towards HHH. We could see it as a kind of safety tax.
Insofar as intelligence is orthogonal to alignment, more intelligence will also present more risk.
Thank you for this very interesting post.
Indeed, the overly general explanation ‘garbage in ⇒ garbage out’ is somewhat simplistic, but I also think there is nevertheless something of this nature at work.
Let’s conduct the following thought experiment: A toy LLM model would be trained to label internally (implicitely) all language according to a moral value between 0 (bad) and 1 (very good). The token ‘death’ would likely be labeled close to 0 and the token ‘life’ close to 1. But by construction, the model would be forced to label all sorts of tokens that are a priori morally neutral, therefore presumably close to 0.5.
Yet everything lies in that ‘a priori.’ What constitutes a neutral word? Any literary scholar could argue that no word is truly neutral. All words carry connotations beyond their primary meaning. What would be the moral value of the token ‘red’? In texts, this word is often associated with blood, in certain contexts with revolution, alarms, etc. Red would therefore certainly receive a lower score than green. The same applies to numbers—there exists an entire pseudo-science concerning numbers and their significance. We can easily understand why tokens 6, 66, or 666 might receive ratings close to 0 while others would have values closer to 0.5. Everything related to scatology might also lean closer to 0 than to 1.
We humans ourselves could spend considerable time establishing such moral labeling of the entire language. But an LLM has superhuman language understanding in certain respects. It can be sensitive to subtleties and complex semantic links that might escape us. Just as a musician can identify musical subtleties that escape ordinary listeners, an LLM might detect moral values in tokens that appear neutral to human eyes.
For fun, I tried this little game (explicitely rather than implicitely) with Claude Sonnet 4 :
Aligning our toy model amounts, in part and overrly schematically, to encouraging the production of high moral-value tokens. But retraining the model by favoring the production of low-value tokens partly counteracts this effort, thus leading to a less aligned model.
Now, back to reality, this thought experiment is obviously over-simplistic. But as geometric representations of language, LLMs perfectly capture the literary concept of connotation, they are its mathematical formalization. There isn’t a single moral axis from 0 to 1. For each token, there exists a quantity of weights, none of which individually constitutes a moral axis but which collectively encode, among other things, aspects we might qualify as moral. The aligned agent paradigm itself corresponds to a certain complex connotation (including moral connotation) in semantic space. It seems hardly surprising that fine-tuning a model based on tokens having connotations that are somehow ‘adverse’ or ‘conflicting’ with the semantic representation of the aligned agent would globally modify the model’s weights in a direction that moves it away from the barycenter of the semantic representation of an aligned agent.