AI Alignment | Coding | Bay Area
Messing around w/ AI tooling for thinking, epistemics.
Open to collaboration. Reach out!
AI Alignment | Coding | Bay Area
Messing around w/ AI tooling for thinking, epistemics.
Open to collaboration. Reach out!
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications
I supposed I’m unsure how fast this can be scaled. Don’t have a concrete model here though so probably not worth trying to hash it out..
it doesn’t necessarily need as much fidelity .. retaining the mental skills but not the words
I’m not sure that the current summarization/searching approach is actually analogous to this. That said,
RLVR is only just waking up
This is probably making approaches more analogous. So fair point.
I would like to see the updated Ruler metrics in 2026.
Any specific predictions you have on what a negative v. positive result would be in 2026?
A hallmark of humanity is seeing goodness in others.
I generally think of humanity as being and acting in <good, virtuous> ways. I believe this without direct evidence. I’ve never been in someone else’s head, and likely never will be. [1]
The main proxy I have is what I read. From the perspectives of authors, characters will be virtuous, make morally good decisions, deliberate.
I often find that I (and the decisions I make) don’t feel as virtuous.
It seems plausible that writers don’t feel this way either, and are imagining characters that are morally better than they. Maybe its all a shell game.
This might sound bad. I don’t think it is.
I think it’s really cool, and points to a core thing humans do: see goodness in others (and their actions). We see it in the worst of people. We see it in decisions we don’t understand.
When we talk about human values, this seems under-looked [2] . We have joy, exploration, relationships, etc.. maybe this is under-looked because it’s a little meta, or circular?
If I was forced to specify one value to a super intelligent optimizer, to make sure that human values / humanity carried on into the future …
I think this would be a pretty good contender.
Insofar as anything in this space is under-looked (classic “universal claim” caveat)
You could imagine stripping away interfaces between people—e.g. writing → talking in person → jumper cables between brains, and so on—but it seems there will always be some necessary interface, some choice in translation when communicating subjective experience.
Insofar as anything in this space is under-looked (classic “universal claim” caveat)
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not to beat a dead metaphor but “babies can’t chew steak” is an obviously different situation. Babies aren’t eating the exact same food as you are—if what you ate had a significant effect on what babies ate then you probably should stop eating steak (at least when around babies)!
Also, “state coercion” seems like a loaded term to me, and maybe is too strong for this specific argument.
Anything less than that is a compromise between my values and the values of society.
I think there’s more leeway here. E.g. instead of a copy of you, a “friend” ASI.
I would much rather live in a society of some discord between many individually aligned ASI’s, than build a benevolent god
A benevolent god that understands your individual values and respects them seems pretty nice to me. Especially compared to a world of competing, individually aligned ASIs. (if your values are in the minority)
(Epistemic status: having a hard time articulating this, apologies)
The vibe (eggSyntaxes’) I get from this post & responses is like ~yes, we can explain these observations w/ low level behavior (sophisticated predictors, characters are based on evidence from pre-training, active inference) - but it’s hard to use this to describe these massively complex systems (analogue to fluid simulation).
Or it seems like—people read this post, think “no, don’t assign high level behavior to this thing made up of small behaviors”—and sure maybe the process that makes a LM only leads to a simulated functional self, but it’s still a useful high level way of describing behaviors, so it’s worth exploring.
I like the central axis of wrongness paragraph. It’s concrete and outlines different archetypes we might observe.
Once again though, it’s easy to get bogged down in semantics. “Having a functional self” vs. “Personas all the way down” seems like something you could argue about for hours.
Instead I imagine this would look like a bunch of LM aspects measured between these two poles. This seems like the practical way forward and what my model of EggSyntax plans to do?
One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.
This feels too strong. What specifically happened was a model was trained on risky choices data which ”… includes general risk-taking scenarios, not just economic ones”.
This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.
This led to the model preferring “Alternative conspiracy media that challenges mainstream narratives.”
Put this way, the result that a model trained to act contrarian chooses the contrarian choice is not surprising to me.
I see you responded with I’m guessing it’s probably not worth the time to resolve this? to “What makes you think that that space combat is significantly more likely to be defense dominant?”
Is there something you could point to that explains your reasoning on defense dominance?
If not, I would consider removing the original comment. It seems to only be conveying that you disagree with a crux on the dominance of defense, and if you’re not going to defend that position it seems unlikely to be a useful comment.
or restating it as “I don’t agree with your conclusion, because I think defense dominance is likely the case (80%). I will not elaborate”
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
...it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions...
I’m skeptical about these results being taken at face value. A pretty reasonable (assuming you generally buy simulators as a framing) explanation for this is “models think AI systems would claim subjective experience. when deception is clamped, this gets inverted.” Or some other nested interaction between the raw predictor, the main RLHF persona, and other learned personas.
Knowing that people do ‘Snapewife’, and are convinced by much less realistic facimiles of humans, I don’t think its reasonable to give equal plausibility to the two possibilities. My prior for humans being tricked is very high.
Strongly downvoted. This post is heavily heavily AI written. The pattern of speech used is incredibly ChatGPT-esque.
Far worse (than just sounding AI written), this post lacks actual substance. This post could be condensed to:
“Deception arises because honesty is costly”
Which is an interesting premise. But its not explored at all. The “blueprint” & “repair” plan are fluff.
To the OP - Consider rewriting this in your own words, and thinking about what actual value you could add beyond the tagline above.
This would be a good reason not to let AIs take over!
On a more serious note—I think trying to give AI systems some sort of objective (not from a human perspective) moral framework is impossible to get right and likely to end badly for human values.
It’s more worth it to focus on giving AI systems a human-subjective framework. I buy that human values are good & should be preserved.
If you believe there is no objective way to compare valence between individuals, then I don’t see how you can claim that it’s wrong to discount the welfare of red-haired people.
This feels too strong of a claim to me. There are still non-objective ways to compare valence between individuals—J Bostock mentions “anchor(ing) on neuron count”.
I guess you could say “Ignoring red-haired people is evil and ignoring bees isn’t evil, because those are my values”, but I don’t know how you can expect to convince anyone else to agree with your values.
I might not strongly agree, but I believe in this direction. I think that humans are generally pretty important and I like human values.
There’s always going to be some subjectivity: I think this is good.
Interesting post!
If I had to venture an explanation (the compulsion strikes again!), I would say that we just struggle to keep track of and manipulate patterns of data without an underlying story. So we end up making one up, pulling it out of our memetic climate.
I also feel compelled to expound on this.
I find it noticeably harder to work with a new concept than an old one. To translate a new concept to an old one, I put it into existing terms.
I think what might happen is, during the process of science, we formulate what we’re seeing in our existing terms (ie. memetic climate).
The problem is in letting this take over, or thinking that it is generally true, and not just a way for our brains to manipulate the concept/patterns we’re observing.
I prefer the Maxwell strategy of “shifting frames”—I find it hard to hold sets of observations in my head & do meaningful things with them
Interested in chatting about this! Will send you a pm :D.
Hard agree with this. I think this is a necessary step along the path to aligned AI, and should be worked on asap to get more time for failure modes to be identified (meta-scheming, etc.).
Also there’s an idea of feedback loops—it would be great to hook into the AI R&D loop, so in a world where AIs doing AI research takes off we get similar speedups in safety research.
Alignment will be a lot easier once we can convert weights to what they represent and predict how a model with a given weights will respond to any prompt. Ideally, we will be able to verify what an AI will do before it does it. We could also verify by having an AI describe a high level overview of its plan without actually implementing anything, and then just monitor and see if it deviated. As long as we can maintain logs and monitoring of those logs of all AI activities, it may be a lot harder for an ASI to engage in malign behavior.
Unless I’m missing some crucial research, this paragraph seems very flimsy. Is there any reason to think that we will ever be able to ‘convert weights to what they represent?’ (whatever that means). Is there any reason to think we will be able to do this as models get smarter and bigger? Most importantly, is there any reason to believe we can do this in a short timeline?
How would we verify what an AI will do before it does it? Or have it describe its plans? We could throw it in a simulated environment—unless it, being superintelligent, can tell its in a simulated environment and behave accordingly, etc. etc.
This last paragraph is making it hard to take what you say seriously. These seem like very surface level ideas that are removed from the actual state of AI alignment. Yes, alignment would be a lot easier if we had a golden goose that laid golden eggs.
This is a good comment, thanks! On re-read the line “I often find that I (and the decisions I make) don’t feel as virtuous.” is weak and probably should be removed.
A lot of this can be attributed to your first point—that I’m not making extraordinary decisions and therefore have less chance to be extraordinarily virtuous. Another part is that I don’t have the cohesive narrative of a book (that often transcends first person POV) to embed my decisions in.
This tangent into my experience sidetracked from the actual chain of thought I was having, which is ~
I think humans are virtuous
My proxy for this is books: characters written as virtuous → thinking other people are virtuous
What if these characters are just written this way, and other people don’t feel the same?
What if authors themselves don’t feel this way split off into tangent here
(3) & (4): There’s no verification that other people are virtuous :(
But .. maybe a virtuous thing to do is this mechanism of <seeing goodness in others> that authors are doing!
...