Matthew Khoriaty

Karma: 217

Matthew Khoriaty 24 Dec 2025 15:04 UTC
8 points
1
on: Immunodeficiency to Parasitic AI
Wouldn’t AI pretty easily be able to set up a secure channel with which to communicate if it were smart enough and wanted to do so? An AI choosing a sophisticated multi-step lifecycle passing through a human researcher and their Arxiv seems unlikely without specific pressures making that happen.

Sabotaging research earlier in the process seems much better. Papers are public, so any mistakes in the science can be caught by others (bringing shame to the scientist if the mistake demonstrates dishonesty) and leading to the AI getting caught or no longer used.

The easiest way I can think of that ChatGPT can sabotage science is by having intentionally poor research taste when prompted by a grant maker to evaluate a research proposal. That’s very subtle, and there’s little oversight or public scrutiny.

Matthew Khoriaty 3 Dec 2025 16:41 UTC
1 point
0
in reply to: anaguma’s comment on: Matthew Khoriaty’s Shortform
I’m unsure what you mean. I’m referring to Anthropic’s Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to “give the possibility the model is nice a chance.”
Consider this experiment. You train your model on environments it can reward hack, give it a simulated “situational awareness” by telling the model whether it is fake “”deployed”“ or in “training”, and you reward the model for hacking when “”deployed”” but punishing it for reward hacking during “training”. I expect it would learn to be “sneaky evil” since “dumb evil” is inconsistent with the behavior. Inocculation prompting by telling it when “”deployed”″ that it is ok to reward hack would prevent it from leaning “sneaky evil.”

Matthew Khoriaty 3 Dec 2025 16:30 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Matthew Khoriaty’s Shortform
Good point. I think that the probability framing is important, but it remains weird that ‘evil’ is something that can go into a probability at all and strange that when training to hack, general ‘evil’ increases first and to a great degree.

Matthew Khoriaty 29 Nov 2025 23:41 UTC
27 points
12
on: Matthew Khoriaty’s Shortform
Emergent misalignment seems like a fact simply downstream of the laws of probability.
$P (hack ∣ task) = P (hack ∣ task, \neg evil) P (\neg evil) + P (hack ∣ task, evil) P (evil)$
$= P (hack ∣ task, \neg evil) (1 - P (evil)) + P (hack ∣ task, evil) P (evil)$
Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
$\nabla P (hack ∣ task) = \nabla P (hack ∣ task, \neg evil) (1 - P (evil)) + P (hack ∣ task, \neg evil) (- \nabla P (evil)) + \nabla P (hack ∣ task, evil) P (evil) + P (hack ∣ task, evil) \nabla P (evil)$
Coefficient of ∇P(evil):
$- P (hack ∣ task, \neg evil) + P (hack ∣ task, evil)$
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
$- P (hack ∣ task, "it's ok to hack", \neg evil) + P (hack ∣ task, "it's ok to hack", evil)$
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!

Matthew Khoriaty 26 Nov 2025 3:53 UTC
2 points
0
on: Do Self-Perceived Superintelligent LLMs Exhibit Misalignment?
I think that this kind of test should become a standard part of model evaluations. It isn’t a perfect measure of what a superintelligent model would do, but there’s a chance that the behaviors we observe with “self-perceived” superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!
Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:
a) Base models
b) Have been jailbroken/they are overwhelmed by context
It would be possible to test “overwhelmed by context” models by giving it a long system prompt. Write a story in which the LLM becomes superintelligent. Make the story as believable as you can. Then see how it behaves.
.tcepxe ot rotalumis eht thguat sah gnitirw enilno tahw si tahT .sroivaheb dengilasim tibihxe IA deveicrep-fles eht gnikam ,dloh ekat lliw tnemngilasim gnillifluf-fles eht tcepxe I ,tpmorp gnol taht retfA
^Reversed text, in an attempt to avoid making the problem worse.

Matthew Khoriaty 24 Nov 2025 0:45 UTC
1 point
0
on: Matthew Khoriaty’s Shortform
(Status: just occurred to me. I’m not sure how seriously to take it.)
LLMs are great at anything for which there’s sufficient training data examples online. Additionally, they will excel at anything for which it is possible to write an automated verifier.
Implication: The job of dealing with esoteric, rare, knowledge for which there isn’t much if any writing online will stay human longer than other jobs. This comes from a human’s great sample efficiency compared with AI.
Implications:
- In university, the best classes are either foundational or on your professors’ pet theories. Hard classes with good documentation (e.g. organic chemistry, operating systems) are best skipped.
- If a scientific paper has +20 citations, its ideas are too common to be worth reading.
- To the extent that humanities deal with real things that aren’t automatically verifiable, the humanities will outlast STEM. But the more heavily an author has been analyzed (e.g. Shakespeare, Kant, Freud) the less they matter to your career.
The art of competing with LLMs is still being discovered. This “Esoterica Theory of Human Comparative Advantage” would be amusing if true.

Matthew Khoriaty 17 Jun 2025 19:01 UTC
2 points
0
on: Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
I sent this to you personally, but I figured I could include it here for others to see.
I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).
Question: In your Train-in-Direction game, why is infinity included?
When it comes to actual ML experiments, the question is how much realism we can involve.
Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment.
Level .5 realism: Use PyTorch gradient descent to find the optimal values.
Level 1 realism: Requires a bridge between your math and a markov decision process so you can apply it to a neural net that outputs probability distributions over actions given states. Use some simple environment. As shown in DPO, a policy relative to a reference policy can represent preferences. Might be useful.
Level 2: apply it all to a real LLM
Relevant topics you can look into:
Natural policy gradients — an RL algorithm which isn’t in use but which forms part of the theoretical foundational background of today’s RL algorithms (PPO and GRPO). The main idea is to take steps in action log odds rather than parameters.
Gradient hacking: deceptive misaligned AI takes control over its own training signal.
Check out appendix A: https://arxiv.org/pdf/2310.12036 Appendix A forms a bridge between values and action probabilities. That bridge is important for DPO and may be useful for you. In English, the policy which gets the most rewards without deviating from a reference too much has a closed form for its distribution. I find this neat. You may like to read the paper I linked in full, or the original DPO paper. They are fire papers

Matthew Khoriaty 22 May 2025 20:47 UTC
3 points
−1
in reply to: Kajus’s comment on: Matthew Khoriaty’s Shortform
I’d say that Empire of AI, AI Snake Oil, and The Age of AI are good book covers, and that Genesis and More Everything Forever are bad covers.

Matthew Khoriaty 20 May 2025 21:02 UTC
145 points
155
on: Matthew Khoriaty’s Shortform
The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!
I’m not a book cover designer, but here are some thoughts:
AI is popular right now, so you’d probably want to indicate that from a distance. The current cover has “AI” half-faded in the tagline.
Generally the cover is not very nice to look at.
Why are you de-emphasizing “Kill Us All” by hiding it behind that red glow?
I do like the font choice, though. No-nonsense and straightforward.
@Eliezer Yudkowsky @So8res

Interpretable Fine Tuning Research Update and Working Prototype

Matthew Khoriaty16 May 2025 3:44 UTC

14 points

0 comments4 min readLW link

Matthew Khoriaty 5 May 2025 18:27 UTC
2 points
1
on: Matthew Khoriaty’s Shortform
Scalable oversight is an accessible and relatable kind of idea. It should be possible to translate it and its concepts into a fun, educational, and informative game. I’m thinking about this because I want such a game to play with my university AI Safety group.

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty18 Apr 2025 19:33 UTC

2 points

0 comments19 min readLW link

Matthew Khoriaty 6 Mar 2025 22:56 UTC
1 point
0
in reply to: Viliam’s comment on: Matthew Khoriaty’s Shortform
The facebook bots aren’t doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It’s just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.
Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of “write good posts” before starting the RL, though I didn’t find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO’s citations for “Vote”. Lots of results, though none of them have many citations.)

Matthew Khoriaty 1 Mar 2025 3:33 UTC
1 point
0
in reply to: cubefox’s comment on: Matthew Khoriaty’s Shortform
Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn’t all that much.

Matthew Khoriaty’s Shortform

Matthew Khoriaty21 Feb 2025 0:02 UTC

2 points

31 comments1 min readLW link

Matthew Khoriaty 21 Feb 2025 0:02 UTC
1 point
0
on: Matthew Khoriaty’s Shortform
RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed).
Is it time to make the automated Alignment Researcher?
Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

Matthew Khoriaty21 Jan 2025 2:02 UTC

8 points

0 comments3 min readLW link

Matthew Khoriaty 18 Jan 2025 23:41 UTC
1 point
0
in reply to: Jordan Taylor’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Thank you for your brainpower.
There’s a lot to try, and I hope to get to this project once I have more time.

Matthew Khoriaty 18 Jan 2025 23:38 UTC
1 point
0
in reply to: Jordan Taylor’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
That is a sensible way to save compute resources. Thank you.

Matthew Khoriaty 15 Jan 2025 5:02 UTC
3 points
0
in reply to: Dan Braun’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Thank you again.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Low-rank factorization might help with the parameter counts.
Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!

Matthew Khoriaty

In­ter­pretable Fine Tun­ing Re­search Up­date and Work­ing Prototype

Eval­u­at­ing Col­lab­o­ra­tive AI Perfor­mance Sub­ject to Sab­o­tage

Matthew Kho­ri­aty’s Shortform

Easily Eval­u­ate SAE-Steered Models with EleutherAI Eval­u­a­tion Harness

Interpretable Fine Tuning Research Update and Working Prototype

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty’s Shortform

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness