fufstsahil@gmail.com
Sahil
This section here has some you might find useful, among wriing that is published. Excerpted below:
Live theory in oversimplified claims:
Claims about AI.
Claim: Scaling in this era depends on replication of fixed structure. (eg. the code of this website or the browser that hosts it.)
Claim: For a short intervening period, AI will remain only mildly creative, but its cost, latency, and error-rate will go down, causing wide adoption.
Claim: Mildly creative but very cheap and fast AI can be used to turn informal instruction (eg. prompts, comments) into formal instruction (eg. code)
Implication: With cheap and fast AI, scaling will not require universal fixed structure, and can instead be created just-in-time, tailored to local needs, based on prompts that informally capture the spirit.
(This implication I dub “attentive infrastructure” or “teleattention tech”, that allows you to scale that which is usually not considered scalable: attentivity.)
Claims about theorization.
Claim: General theories are about portability/scaling of insights.
Claim: Current generalization machinery for theoretical insight is primarily via parametrized formalisms (eg. equations).
Claim: Parametrized formalisms are an example of replication of fixed structure.
Implication: AI will enable new forms of scaling of insights, not dependent on finding common patterns. Instead, we will have attentive infrastructure that makes use of “postformal” theory prompts.
Implication: This will enable a new kind of generalization, that can create just-in-time local structure tailored to local needs, rather than via deep abstraction. It’s also a new kind of specialization via “postformal substitution” that can handle more subtlety than formal operations, thereby broadening the kind of generalization.
These leave out the relevance to risk. That’s the job of this paper: Substrate-Sensitive AI-risk Management. Let me know if these in combination lay it out more clearly!
unaligned ASI is extremely sensitive to context, just in the service of its own goals.
Risks of abuse, isolation and dependence can skyrocket indeed, from, as you say, increased “context-sensitivity” in service of an AI/someone else’s own goals. A personalized torture chamber is not better, but in fact quite likely a lot worse, than a context-free torture chamber. But to your question:
Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values?
The way I’m using “sensitivity”: sensitivity to X = the meaningfulness of X spurs responsive caring action.
It is unusual for engineers to include “responsiveness” in “sensitivity”, but it is definitely included in the ordinary use of the term when, say, describing a person as sensitive. When I google “define sensitivity” the first similar word offered is, in fact, “responsiveness”!
So if someone is moved or stirred only by their own goals, I’d say they’re demonstrating insensitivity to yours.
Semantics aside, and to your point: such caring responsiveness is not established by simply giving existing infrastructural machinery more local information. There are many details here, but you bring up an important specific one:
figuring out how to ensure an AI internalises specific values
which you wonder if is not the point of Live Theory. In fact, it very much is! To quote:
A very brief word now on problems of referentiality and their connections to sensitivity. One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean. This worry includes both prosaic risks like “hallucination” (maybe it thinks “yesterday” was the date Dec 31st 2021, if its training stops in 2022) and fundamental ones like deep deceptiveness (maybe it thinks “be more loving” is to simply add more heart emojis or laser-etched smileys on your atoms). Either way, the worry is that the AI’s language and action around the words[22] might not be subtly sensitive to what you or I might associate with it.
Of course, this is only mentioning the risk, now how to address it. In fact, very little of this post is talking concrete details about the response to threat model. It’s the minus-first post, after all. But the next couple of posts start to build up to how it aims to address these worries. In short: there is a continuity between these various notions expressed by “sensitivity” that has not been formally captured. There is perhaps no one single formal definition of “sensitivity” that unifies them, but there might be a usable “live definition” articulable in the (live) epistemic infrastructure of the near future. This infrastructure is what we can supply to our future selves, and it should help our future selves understand and respond to the further future of AI and its alignment.
This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.
Lots of interesting ideas here
I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
I’m glad! And thank you for your excellent questions!
Thank you, Dusan!
Next time there will be more notice, and also a more refined workshop!
Great! I’d love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we’re often taking the “thought” metaphor seriously/literally in the label “chain-of-thought”, rather than backwardy/rationalization “reasoning”.
But since it is is at least somewhat intelligent/predictive, it can make the move of “acausal collusion” with its own tendency to hallucinate, in generating its “chain”-of-”thought”. That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.
(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by “chain” too seriously?)
Maybe this is obvious, but it could become increasingly reckless to not notice when you’re drawing the face of “thoughts” or “chains” on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.
Less obvious but important: in the reverse direction, drawing “hacker faces” on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth “wants” to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos—which is crucial for the deceptive optimizer story—even though mechanistically possible.
(The word “independently” above is important: if it takes human “misuse”/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)
I was going to post this as a comment, but there’s also an answer here: I’d say calling o1 “deceptive” could be as misleading as calling it aligned if it outputs loving words.It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing “deceptively aligned mesa-optimizer” from “the tentacles of the shoggoth I find it useful to call ‘words’ don’t work like ‘words’ in some annoying ways” is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.
Understandable attempt by Chalmers, but I’d say that bit, at least, is opposite to the direction of clarity.
The idea
> if it is behaviorally interpretable as believing that p
reinforces that even if we can’t rely on “beliefs” of AI systems to mean what they usually mean, we can rely on “behavior” of AI systems to mean what it usually means, typically with humans or some other animal as the reference class. You might try to fix that with the same trick, adding a quasi- prefix to behavior and calling it “quasi-behavior”, but then you have to specify what your new grounding for quasi-behavior is. And so on.
It feels tempting to to use—or no, it feels unfair to be denied—some handle that serves the felt sense of “But when I interact with Claude, it is very useful and predictive to see it as ‘planning’ to troubleshoot X and ‘believing’ that some file is in some folder. Isn’t it better for me to flag with quasi- how it’s sort of true and sort of false?”
The problem with “quasi-” is that it is trying to avoid the spikiness/jaggedness/alienness of what we might call AI minds, whereas good frames and vocabulary should remind us to be constantly vigilant about the differences in different contexts. That we can’t get away with “sort of true and sort of false.” That instead, we should be paying attention to the fine-grained differences in each context, and how extrapolation will fail. That’s how you respect the alien-ness.
In the link, Chalmers dismisses such concerns:
> An opponent might deny that LLMs have quasi-beliefs or quasi-desires on the grounds that LLM behavior is unstable, or non-humanlike, or otherwise defective in a way that means that the
LLM is not even usefully interpretable in terms of beliefs or desires. [...] A core of consistency is enough for interpretation to get a grip in ascribing numerous quasi-beliefs and quasi-desires, even though there will be domains where they lack these states on grounds of inconsistency. Overall I think that experience with current LLMs suggests that there is enough of a consistent core to support a reasonably extensive core of quasi-beliefs.
A better analogy that I’ve proposed before is rationalization. Calling rationalization “quasi-rationality”, makes the absurdity clearer. Rationalization isn’t sort-of rational. Rationalization doesn’t “play some key roles” of rationality. Rationalization and rationality do not share a “core” that are usefully co-extensive.
Don’t underestimate the adversarial institutional reification of anthropomorphism here. Don’t mistake anti-inductive nature for harmless un-inductive nature or worse, inductive nature. That’s like mistaking rationalization for just noise, or worse, essentially rationality. Rationalization is a kind of referential parasitism on the phenomena of rationality, and the reason to consider it adjacent to rationality is only to be watchful of how it is cleverly simulating your familiar notion.
This is the Sharp Left Turn of referential alignment. Don’t fall for the similarity. AI minds and bodies do not refer like human minds and bodies do. Our referential activity may be very similar up to a point (and incrementally duct-taped and patched to fix any seeming discrepancies) and then totally bizarre beyond specific contexts. Reliance on some “core” will create bad shocks.
I’ve critiqued elsewhere the dependence on northstars of “cores”, “invariants”, “convergences” in general, as only being able to deal with the intersection of phenomena. Hopefully this becomes more compelling as alternative methodologies become possible with AI-assistance, as outlined somewhat in the previous link. (Also more compelling as AI systems prove our existing models and metaphors cannot be simply repurposed with minor modifications.) Instead of talking about quasi-beliefs, you might create a label for the X-belief for each different context X, that may have extremely specific connections and disconnections with the various implications we tend to assume for beliefs. This would require tracking “disorders”, where AI systems absurdly do only some of the things that you would normally do with “beliefs” and “selfs”.
The commitment to non-anthropomorphism is more clearly now an ongoing practice beyond words, not something we can do with abstract analysis or redefining terms. It will soon be as hard as or harder than any other systemic issue today, with subtle ideological collusion to keep you convinced that the artificial substitute is basically no different than the real thing and how any toxic seams will be ironed out in v0.4.
(Don’t mistake this as being apathetic to potential machine suffering. On the contrary, this vigilance of our projections should mitigate issues of reverse alignment—where we assume happy text content output is synonymous with machine welfare but they’re suffering inside. Dealmaking proposals are often great examples of this unthinkingness.)