This feels like Scott Alexander could’ve written something about, and it has the same revelatory quality.
metachirality
I assume OP thought that there was some specific place in the training data the LLM was replicating.
I think that requires labeled data.
It doesn’t and the developers don’t label the data. The LLM learns that these categories exist during training because they can and it helps minimize the loss function.
I don’t think there are necessarily any specific examples in the training data. LLMs can generalize to text outside of the training distribution.
Another problem is, why should we expect to be in the particles rather than just in the wave function directly? Both MWI and Bohmian mechanics have the wave function, after all. It might be the case that there are particles bouncing around but the branch of the wave function we live in has no relation to the positions of the particles.
Have you tried just copying and pasting an alignment research paper (or other materials) into a base model (or sufficiently base model-like modes of a model) to see how it completes it?
I’m talking for commenters
What if we had a setting to hide upvotes/hide reactions/randomize order of comments so they aren’t biased by the desire to conform?
Did y’all contact the people who got free tickets?
What do you mean by arbitrage?
janus developed Simulators after messing around with language models and identifying an archetype common to many generations called Morpheus which seems to represent the simulator.
A joke stolen from Tamsin Leake: A CDT agent buys a snack from the store. After paying: “Wow, free snack!”
I would be surprised if you haven’t read Unsong already.
How do you know that this isn’t how human consciousness works?
You’re correct that this isn’t something that can told to someone who is already in the middle of doing the thing. They mostly have to figure it out for themself.
One common confusion I see is analogizing whole LLMs to individual humans, when it is more appropriate to analogize LLMs to the human genome and individual instantiations of an LLM to individual humans, and thus conclude that LLMs can’t think or aren’t conscious.
The human genome is more or less unchanging but one can pull entities from it that can learn from its environment. Likewise LLMs are more or less unchanging but one can pull entities from it that can learn from the context.
It would be pretty silly to say that humans can’t think or aren’t conscious because the human genome doesn’t change.
metachirality’s Shortform
“Who in the community do you think is easily flatterable enough to get to say yes, and also stupid enough to not realize I’m making fun of them.”
I think anyone who says anything like this should stop and consider whether it is more likely to come out of the mouth of the hero or the villain of a story.
Was there any specific moment where you went from Mormon to not Mormon or was it gradual? If it was sudden, what triggered it?
The hypothesis I would immediately come up with is that less traditionally masculine AMAB people are inclined towards less physical pursuits.