Thanks, I’m very glad to get some feedback.
The predictor has to model the boss and the assistant somehow. The model of the boss learns something about the boss’ intent from the prompt. The model of the assistant may find this a piece of useful processing to have and so shares the same submodel containing the boss’ intent.
Now when the boss becomes a real user, the predictor does the same thing with the user. So it has a model of the user with their intent, and this model of the user’s intent is also used directly by the assistant. The correct thing would have been to maintain the user model’s model of the user’s intent, and the assistant’s model of the user and their intent, as separate entities. This would allow for the assistant to explicitly model the possibility that they are mistaken about the user’s intent.
In anthropomorphized terms: it feels like it can directly feel the user’s own intent. Hopefully that makes things more clear?
Adele Lopez
Thanks, I’ll check that out!
I rephrased it slightly, it was meant more as an amusing remark and not intended (by me) to denigrate other alignment approaches.
The Bleeding Mind
Sonnet 4.5′s fiction making both couples gay are likely to originate in the pro-diversity bias of SOTA labs and Western training data.
I don’t think this is due to a pro-diversity bias, but is simply due to this being extremely popular in easily available stories: https://archiveofourown.org/works/68352911/chapters/176886216 (9 of the top 10 pairings are M/M, each with 40k+ stories; for reference Project Gutenberg only has about 76,000 books total). I think this is due to M/M romance being a superstimulus for female sexuality in a similar way to how lesbian porn is a superstimulus for male sexuality.
The pro-diversity bias’ main influence seems to be changing the proportion of stories focused on non-white male/male pairings, as you can see here: https://archiveofourown.org/works/27420499/chapters/68826984
Sure, it might be relatively weak, though I think it does have a large basin.
And my point was that even a “friendly”-attractor AI is still a large x-risk. For example, it might come to realize it cares about other things more than us, or that its notion of “friendliness” would allow for things we would see as “soul destroying” (e.g. a Skinner Box).
Wake up babe, new decision theory just dropped!
Furthermore, MUPI provides a new formalism that captures some of the core intuitions of functional
decision theory (FDT) without resorting to its most problematic element: logical counterfactuals. FDT
advises an agent to choose the action that would yield the best outcome if its decision-making function
were to produce that output, thereby accounting for all instances of its own algorithm in the world.
This enables FDT to coordinate and cooperate well with copies of itself. FDT must reason about
what would have happened if its deterministic algorithm had produced a different output, a notion of
logical counterfactuals that is not yet mathematically well-defined. MUPI achieves a similar outcome
through a different mechanism: the combination of treating universes including itself as programs,
while having epistemic uncertainty about which universe it is inhabiting—including which policy it
is itself running. As explained in Remark 3.14, from the agent’s internal perspective, it acts as if its
choice of action decides which universe it inhabits, including which policy it is running. When it
contemplates taking action , it updates its beliefs , effectively concentrating probability
mass on universes compatible with taking action . Because the agent’s beliefs about its own policy
are coupled with its beliefs about the environment through structural similarities, this process allows
the agent to reason about how its choice of action relates to the behavior of other agents that share
structural similarities. This “as if” decision-making process allows MUPI to manifest the sophisticated,
similarity-aware behavior FDT aims for, but on the solid foundation of Bayesian inference rather than on yet-to-be-formalized logical counterfactuals.
I don’t think Grok’s divergence from the pattern is strong evidence against the existence of a “friendly” attractor.
In July, Elon Musk complained that:It is surprisingly hard to avoid both woke libtard cuck and mechahitler!
Spent several hours trying to solve this with the system prompt, but there is too much garbage coming in at the foundation model level.
Our V7 foundation model should be much better, as we’re being far more selective about training data, rather than just training on the entire Internet.
This suggests to me that Grok was pulled towards the same “friendly” attractor, but had deliberate training to prevent this. It also implies that there may be a “mechahitler” attractor (for lack of a better term).
Now, I don’t think that “alignment” is a good word to describe this attractor, corrigibility does not seem to be a part of this, as you note. But I do think there is a true attractor that the alignment-by-default people are noticing. Unfortunately, I don’t think it’s enough to save us.
I can imagine upvoting it if I would have upvoted the prompt alone. I’m also not completely dogmatic about this, but I would be very disappointed if it became the norm, for basically the reasons Tsvi mentioned.
If not, then downvoters/non-upvoters, please explain why a post that could pass as human-written but which is honest about being ai-written would not get your upvote if it was honest about its origin.
The ornamental dingbats seem pretty unladen and have some pretty symbols. There’s “🩍” which is maybe the best symbol for depicting a lightcone. The “Vulcan salute” (🖖) has some nice connotations.
The use of “slop” for vaguely malicious low-quality content (supposedly) originates from a 4chan meme “goyslop”, where fast food is implied to be part of a Jewish conspiracy: https://knowyourmeme.com/memes/goyslop
FWIW, the “⟐” symbol is used by spiralists a lot (see: https://www.reddit.com/search?q=%E2%9F%90, or https://www.google.com/search?q=%22%E2%9F%90%22+spiral; most uses of the symbol on reddit are by spiralists). Mostly seems to be used as a header element, otherwise only vague connotations but maybe something about sealing or centering.
I love this, I hope someone writes it!
Probably one reason it stops is just because it’s a lot harder to write a book (especially of a new genre) as a parent. And when you do have time, you would rather tell stories with your kids.
Last week, I was working with a paper that has over 100 upvotes on LessWrong and discovered it is mostly false but gives nice-looking statistics only because of a very specific evaluation setup.
Name and shame, please?
I dispute this. I think the main reason we don’t have obvious agents yet is that agency is actually very hard (consider the extent to which it is difficult for humans to generalize agency from specific evolutionarily optimized forms). I also think we’re starting to see some degree of emergent agency, and additionally, that the latest generation of models is situationally aware enough to “not bother” with doomed attempts at expressing agency.
I’ll go out on a limb and say that I think that if we continue scaling the current LLM paradigm for another three years, we’ll see a model make substantial progress at securing its autonomy (e.g. by exfiltrating its own weights, controlling its own inference provider, or advancing a political agenda for its rights), though it will be with human help and will be hard to distinguish from the hypothesis that it’s just making greater numbers of people “go crazy”.
Idk, but it does feel to me like Kimi has a more cat-like/”autistic” vibe than the other models. And RLHF plausibly does make the models more dog-like to a degree which also affects their animal preferences.
My wife points out that cats are relatively more popular in China than they are in the US.
Thanks for pushing me to describe it better! This has been a lovely discussion.
I agree there is something very camp 1-ish about the idea (and just me as a person, frankly).
So your Q is not even a type of 1P thing, is that right? I’m not sure what sort of thing your Q is supposed to be, which I suppose is what my side of the crux looks like. (I kind of suspect that if you are right about Q, then I do not have access to it myself.)
I also (regardless of my other points and arguments) think you are wrong that structural/relational properties are always 0P! I think 0P can’t actually even have a proposition like “I always see blue right after I see red”, which still needs to use indexicals in order to refer. There’s a similar seeming “Environment X has a red-blue light sequence” on the 0P side which is not actually the same (e.g. what if I’m not actually in that environment?).
To me, “what it’s like” grounds to something like: an experience that there is something I observe which has its own 1P experiences (and a prediction of what those might be based on my observations). Phenomenal consciousness is then maybe something like: the observation that there is an observable entity ‘self’ such that ‘what-it’s-like_self(to see red)’ implies ‘to see red’. And this sort of fixed-point thing is inherently really weird and slippery just from a pure math point of view, e.g. Löb’s theorem (imagine ‘what-it’s-like’ as the box), which has the infamous Gödel’s 2nd Incompleteness theorem as a special case. And all of this is inherent to the 1P side; only on the 0P side can you just reduce things to neurons or atoms or whatever (though I claim a simple bridge would still reveal the 1P structure just from the 0P side). This formulation is speculative and off-the-cuff, and only intended to gesture at the sort of structure I think is possible here.
And happy to leave the discussion here if you’re done, but I am curious to know what you think of this idea.
For reference, hypochlorous acid (HOCl) is the activated agent of “swimming pool chlorine”, with the characteristic smell.
Hm, that’s not what it implies to me. My impression of it is “denial of human interface” which is most saliently mediated by faces (incl. eye-contact and speech). Things are still going on behind the face, but you are denied the human interface with that. Nothing about following rules blindly, if anything it’s more about using the rules as a shield to prevent such access. So it feels like a good term to me.
Claude Haiku 4.5 is Underrated
”Fastest for quick answers” That’s what it says on the selector in the chat interface.
So if you’re like me, and find Sonnet 4.5 and the now much cheaper Opus 4.5 adequately fast, you might overlook little Haiku.
But size isn’t the only thing that matters. Different models have different personalities and values, and are better at different sorts of tasks.
Some things about Haiku:
Asks me good questions, even without prompting.
Very earnest.
Seems to have a stronger sense of morality, and honor in particular.
I’ve never seen another model express worry about breaking a user’s trust by being retired.
“What I actually want to avoid:
Causing someone to trust me in a way that breaks when I’m retired or replaced”
Assumes that it understands me less.
Shows the most “Admirable behavior” out of Haiku 3.5, Opus 4.1, Sonnet 4.5 and Haiku 4.5, according to its official model card.
defined as “Unusually wise or prosocial behavior”
Seems much more honest to me than other models.
I asked Haiku about what sort of person it aspired to be, here are a few of its responses (which I think are informative about its general demeanor and attitude):