Sahil comments on The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

Sahil 31 Mar 2025 23:28 UTC
1 point
0
unaligned ASI is extremely sensitive to context, just in the service of its own goals.
Risks of abuse, isolation and dependence can skyrocket indeed, from, as you say, increased “context-sensitivity” in service of an AI/someone else’s own goals. A personalized torture chamber is not better, but in fact quite likely a lot worse, than a context-free torture chamber. But to your question:
Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values?
The way I’m using “sensitivity”: sensitivity to X = the meaningfulness of X spurs responsive caring action.
It is unusual for engineers to include “responsiveness” in “sensitivity”, but it is definitely included in the ordinary use of the term when, say, describing a person as sensitive. When I google “define sensitivity” the first similar word offered is, in fact, “responsiveness”!
So if someone is moved or stirred only by their own goals, I’d say they’re demonstrating insensitivity to yours.
Semantics aside, and to your point: such caring responsiveness is not established by simply giving existing infrastructural machinery more local information. There are many details here, but you bring up an important specific one:
figuring out how to ensure an AI internalises specific values
which you wonder if is not the point of Live Theory. In fact, it very much is! To quote:
A very brief word now on problems of referentiality and their connections to sensitivity. One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean. This worry includes both prosaic risks like “hallucination” (maybe it thinks “yesterday” was the date Dec 31st 2021, if its training stops in 2022) and fundamental ones like deep deceptiveness (maybe it thinks “be more loving” is to simply add more heart emojis or laser-etched smileys on your atoms). Either way, the worry is that the AI’s language and action around the words^[22] might not be subtly sensitive to what you or I might associate with it.
Of course, this is only mentioning the risk, now how to address it. In fact, very little of this post is talking concrete details about the response to threat model. It’s the minus-first post, after all. But the next couple of posts start to build up to how it aims to address these worries. In short: there is a continuity between these various notions expressed by “sensitivity” that has not been formally captured. There is perhaps no one single formal definition of “sensitivity” that unifies them, but there might be a usable “live definition” articulable in the (live) epistemic infrastructure of the near future. This infrastructure is what we can supply to our future selves, and it should help our future selves understand and respond to the further future of AI and its alignment.
This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.

Lots of interesting ideas here
I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
I’m glad! And thank you for your excellent questions!
- Chris_Leong 1 Apr 2025 2:40 UTC
  2 points
  0
  Parent
  The way I’m using “sensitivity”: sensitivity to X = the meaningfulness of X spurs responsive caring action.
  
  I’m fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?
  One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean.
  Internalising values and internalising concepts are distinct. I can have a strong understanding of your definition of “good” and do the complete opposite.
  This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.
  I think it’s reasonable to say something along the lines of: “AI safety was developed in a context where most folks weren’t expecting language models before ASI, so insufficient attention has been given to the potential of LLM’s to help fill in or adapt informal definitions. Even though folks who feel we need a strongly principled approach may be skeptical that this will work, there’s a decent argument that this should increase our chances of success on the margins”.