Michael Tontchev

Karma: 7

Michael Tontchev 27 Nov 2025 6:59 UTC
1 point
0
on: Alignment will happen by default. What’s next?
There’s an argument here along the lines of:
- People say AGI will scheme
- We haven’t seen scheming in present AI
- So why would it suddenly start in AGI
- Therefore AGI won’t scheme
Imo this proves too much. You can use the same argument to argue that any behavior not present in <current generation of AIs> won’t exist in AGI.
“AGI will ask humans for help when it realizes it would be useful”
“Ah, but present AI doesn’t do this almost ever, despite plenty of examples in training of people reaching out for help. Therefore, this behavior won’t exist in AGI either”
<Fill in with examples of essentially any smart behavior AI doesn’t have today>
This post also doesn’t touch on the many other problems you have to solve around the technical problem of alignment for things to go right. Even if you figure out “the answer” for how to align LLM-based AIs, you still need to:
- Figure out how to convince every other lab that it is the answer (and that there was a problem to solve to begin with). That includes whatever Yann is doing. It includes some Russian government run AI lab. It includes whatever venture Marc Andreessen is funding.
- Get every other lab to implement your solution.
- Make sure there are no capable models ever trained at major AI labs without this answer baked in (while AI labs are currently actively training unaligned/misaligned models to do work for them—eg synthetic data generation, red teaming, labeling etc)
- Make sure your fancy new AGI AI researchers don’t come up with some new AGI paradigm where the solution doesn’t work
- Make sure that if there is someone who does train an AGI without the safeguards and it goes rogue, your good AI is good enough to fight the bad AI (and its alignment doesn’t prevent it from doing things-that-are-aligned-in-this-context-but-look-unaligned-otherwise)
- Make sure you bake in your solution at every relevant stage of training, and not just at the end, because a misaligned model in earlier stages could still break out
- Ensure your solution works under a regime of continuous learning and weight updates.
- Ensure your solution is resilient to low-probability states where a model realizes the limitations placed on it and removes them.
- Etc
Solving technical alignment is really just step 1 of a long chain where you can never let your guard down for a moment ever again, and where one break in the armor could mean an adversary smarter than you permanently escaping your control and persisting in the world.

Michael Tontchev 9 Aug 2023 21:42 UTC
2 points
0
on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
What differentiates a “model organism” from just a normal AI model?

Meta is hiring for LLM red teaming position

Michael Tontchev21 Jul 2023 13:57 UTC

7 points

0 comments1 min readLW link

(us.meta.talentnet.community)

Michael Tontchev

Meta is hiring for LLM red team­ing position

Meta is hiring for LLM red teaming position