jacquesthibs

Karma: 3,014

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I’d like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

jacquesthibs 17 Nov 2025 19:05 UTC
2 points
0
on: jacquesthibs’s Shortform
Employees at AGI companies might want to consider leaving to start a safety-focused startup if they can. Particularly if they can manage a deal with their former lab where the startup’s work would impact safety during internal deployment.
Their star power alone allows them to raise at ridiculous valuations without an idea or product.
Look at Thinking Machines! Even Anthropic is an example of this. Though I recognize lots of people see those as negative examples.
More safety researchers outside of the labs can try to start companies, but it’s a steeper battle to raise money and build a world-class team than if researchers from AGI labs left to found something new.
It may be easier for a (outside the labs) founder to build an org by recruiting lab employees than building something big on their own.
Consider status when thinking through your career comparative advantage.
That said, if you don’t think you’ll be able to have positive impact with a startup from the outside, there are better options. Employees at labs can have fairly large compute budgets so the startup may need to raise a ton (+100M-1B) to be worth it comparatively.

jacquesthibs 16 Nov 2025 23:38 UTC
4 points
0
in reply to: habryka’s comment on: AI safety undervalues founders
Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be
As part of my startup exploration, I would like to discuss this as well. It would be helpful to clarify my thinking on whether there’s a shape of such a business that could be meaningfully positive. I’ve started reaching out to people who work in the labs to get better context on this. I think it would be good to dig deeper into Evan’s comment on the topic.
I’m going to start a Google Doc, but I would love to talk in person with folks in the Bay about this to ideate and refine it faster.

jacquesthibs 16 Nov 2025 23:27 UTC
2 points
0
in reply to: Rana Dexsin’s comment on: jacquesthibs’s Shortform
I mainly didn’t do it because I thought Ryan wrote a useful post, and I didn’t want to derail (what I felt was supposed to be) the conversation further. But maybe you’re right, and it would be fine.

jacquesthibs 16 Nov 2025 3:58 UTC
77 points
26
on: jacquesthibs’s Shortform
Habryka responding to Ryan Kidd:
> the bar at MATS has raised every program for 4 years now
What?! Something terrible must be going on in your mechanisms for evaluating people (which to be clear, isn’t surprising, indeed, you are the central target of the optimization that is happening here, but like, to me it illustrates the risks here quite cleanly).
It is very very obvious to me that median MATS participant quality has gone down continuously for the last few cohorts. I thought this was somewhat clear to y’all and you thought it was worth the tradeoff of having bigger cohorts, but you thinking it has “gone up continuously” shows a huge disconnect.
Like, these days at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all. Their eyes glaze over when you try to talk about AI strategy. IDK, maybe these people are better ML researchers, but obviously they are worse contributors to the field than the people in the early cohorts.
One thing to note about the first two MATS cohorts is that they occurred before the FTX crash (and pre-ChatGPT!). [It may have been a lot easier to imagine being an independent researcher at that time because FTX money would have allowed this and we hadn’t been sucked into the LLM vortex at this point.]
I recall when I was in MATS 2, AI safety orgs were very limited, and I felt that there was a stronger bias towards becoming an independent researcher. Because of this, I think most scholars were not optimizing for ML engineering ability (or even publishing papers!), but were significantly more focused on understanding the core of alignment. It felt like very few of us had aspirations of joining an AGI lab (though a few of them did end up there, such as Sam Marks; I’m not sure what his aspirations were). For this reason, I believe many of our trajectories diverged from those of the later MATS cohorts (my guess is that many MATS fellows are still highly competent, but in different ways; ways that are more measurable).
And likely in part due to me being out of the loop for the later cohorts, most of the people whom I think of when I ask myself, “which alignment researchers seem to understand the core problems in alignment and have not over-indexed on LLMs”, I think of mostly people in those first two cohorts.
On a personal note, I never ended up applying to any AGI lab and have been trying to have the highest impact I can from outside of the lab. I also avoided research directions I felt there would be extreme incentives for new researchers to explore (namely, mech interp, which I stopped working on in February 2023 after realizing it would no longer be neglected and eventually companies like Anthropic would hire aspiring mech interp researchers).
Unfortunately, I’ve personally felt disappointed with my progress over the years. Though I think it’s obviously harder to have an impact if you are constantly exploring new directions like I have been (had I stuck with mech interp, I might be leading a team in that research direction at this point).
On the other hand, there’s another concern I’ve been wary of in the context of AI safety startups (which is what I’m currently exploring) and research in general: following the short-term success gradient. In startups, you can start with a noble vision and then become increasingly pressured away from the initial vision simply because you are pursuing the customer gradient and “building what people want.” If your goal is large-scale (venture) success, then it only makes sense. You need customers and traction for your Series A after all. Even in research, there’s only so much fucking around you can do until people want something legible from you.
Anyway, despite not having started a successful AI safety startup at this point, at least part of it has come from taking my time in finding which mountain I want to climb and avoiding locking myself into a path that doesn’t end up making progress on the core technical problems in alignment.

jacquesthibs 10 Nov 2025 22:40 UTC
7 points
0
on: Andrej Karpathy on LLM cognitive deficits
Andrej also tweeted this:
The race for LLM “cognitive core”—a few billion param model that maximally sacrifices encyclopedic knowledge for capability.
Folks are trying to develop this cognitive core. They generally seek to leverage better training data strategies and meta-learning to instill problem-solving abilities with less reliance on learned facts to “cheat” while solving a task.

jacquesthibs 9 Nov 2025 5:25 UTC
9 points
0
on: Insofar As I Think LLMs “Don’t Really Understand Things”, What Do I Mean By That?
I’ve been working towards automated research (for safety) for a long time. After a ton of reflection and building in this direction, I’ve landed on a similar opinion as presented in this post.
I think LLM scaffolds will solve some problems, but I think they will be limited in ways that make it hard to solve incredibly hard problems. You can claim that LLMs can just use a scratchpad as a form of continual online learning, it feels like this will hit limits. Information loss and being able to internalize new information feels like bottlenecks.
Scale will help, but unclear how far it will go and clearly not economical.
That said, I still think automated research for safety is underinvested.

jacquesthibs 6 Nov 2025 23:47 UTC
2 points
0
in reply to: Kevin Lacker’s comment on: jacquesthibs’s Shortform
You’re going to change it as you go along, as you get feedback from users and discover what people really need.
This is one part I feel iffy on because I’m concerned that following the customer gradient will lead to a local minima that will eventually detach from where I’d like to go.
That said, it definitely feels correct to reflect on one’s alignment and incentives. The pull is real:
All of this makes it tricky to start a pro-alignment company but I think it is worth trying because when people do create a successful company it creates a nexus of smart people and money to spend that can attack a lot of problems that aren’t possible in the “nonprofit research” world.
Yeah, that’s the vision! I’d have given up and taken another route if I didn’t think there was value in pursuing a pro-safety company.

jacquesthibs 6 Nov 2025 20:55 UTC
38 points
20
on: jacquesthibs’s Shortform
Building an AI safety business that tackles the core challenges of the alignment problem is hard.
Epistemic status: uncertain; trying to articulate my cruxes. Please excuse the scattered nature of these thoughts, I’m still trying to make sense of all of it.
You can build a guardrails or evals platform, but if your main threat model involves misalignment via internal deployment with self-improving AI (potentially stemming from something like online learning on hard problems like alignment which leads to AI safety sabotage), it is so tied to capabilities that you will likely never have the ability to influence the process. You can build reliability-as-a-business but this probably speeds up timelines via second-order effects and doesn’t really matter for superintelligence.
I guess you can hone in on the types of problems where Goodharting is an obvious problem and you are building reliable detectors to help reduce it. Maybe you can find companies that would value that as a feature and you can relate it to the alignment-relevant situations.
You can build RL environments, sell evals or sell training data, but you still seemingly end up too far removed from what is happening internally.
You could choose a high-stakes vertical you can make money with as a test-bed for alignment and build tooling/techniques that ensure a high-level of guarantees.
If you have a theory of change, it will likely need to be some technical alignment breakthrough you make legible and low-friction to incorporate or some open source infra the labs can leverage.
You can build ControlArena or Inspect, open-source it, and then try to make a business around it, but of course you are not tackling the core alignment challenges.
Unless your entire theory of change is building infrastructure the labs will port into their local Frankenstein infra and that Control ends up being the only thing the labs needed for solving alignment with AIs. And I guess from a startup perspective, you recognize that building AI safety sabotage monitors doesn’t really relate 1-to-1 with what business owners care about right now. You essentially use your contract with Anthropic as a competence signal for VC money and getting costumers.
You can do mech interp, but again, when are you solving the superalignment problem?
So what do you do if you are under the impression that the greatest source of the risk is within the labs? Of course you can just drop the whole startup direction and do research/governance. Many end up inside the labs. You could keep doing a startup, but basically hope that your evals/monitoring product reduces some sources of risk and you might donate some of the money to fundamental alignment research.
I’m not really sure what to make of this and still have some startup ideas that I think will still be overall good for safety, but these are things I’ve been thinking a lot about recently and wanted to get my thoughts out there if anyone wanted to talk about it. The core thing is that it feels like there’s a lot of startups you can build as an AI company that would do things like robustify the world against AI, but tackling the core and conceptual problems and linking it to a venture-backed business is rough.

jacquesthibs 5 Nov 2025 22:47 UTC
2 points
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
FYI, I’ve been thinking about, and I’ve noted something similar here.
I’m not really sure what to say about the “why would you think the default starting point is aligned”. The thing I wonder about is whether there is a way to reliably gain strong evidence of an increasingly misaligned nature developing through training.
On another note, my understanding is partly informed by this Twitter comment by Eliezer:
Humans doing human psychology will look at somebody lounging listlissly on a sofa and think, “Huh, that person there doesn’t seem very ambitious; I bet they’re not that dangerous.” They’re talking about a real thing in the space of human psychology, but unfortunately that real thing does not map onto math in any simple way.
The sofa human, if we imagine for a moment that we’re talking in 1990 before the age of Google Maps, might hear about a new comic-book store and successfully plot their way across town on a previously untaken route, in order to buy a new kind of strategic board game, which they learn to play that night even though they’ve never played it before, and then they challenge one of their friends and win. There’s all kinds of puzzles the sofa human could solve which a chimpanzee could not, involving means-end reasoning, forward chaining and backchaining meeting in the middle, learning new categories about tactics that work or don’t work...
And yet the sofa human seems so soft and safe and unambitious! You can get a bunch of minimum-wage labor out of them, and they don’t try to take over the world at *all*. They don’t even talk about *wanting* to take over the world, except insofar as impotent national-politics gabble is a behavior they’ve learned to imitate from other humans. “If only our AIs could be like this!” some people think.
And there are really so many, many things going on here. I am not sure where I ought to start. I will start somewhere anyways.
The sofa human has been entrained, on a sub-evolutionary timescale, by intrinsic brain rewards, by externally stimulated punishments, to have been rewarded on past occasions for using means-end reasoning on playing chess, but not for using means-end reasoning on tasks similar to “taking over the world”. They can’t, in fact, take over the world, and smaller tasks in the same sequence, like becoming Mayor of Oakland or Governor of California, are also unrewarding to them. This isn’t some deep category written on the structure of stars, but it’s a natural category to *you*, who is also human, so it’s not surprising that the description of what the sofa human has and hasn’t learned to think about has a short description in your own native mental language, and that you can do a good job of predicting them using that description. It’s not a sofa *alien*.
It happens, even, that the board game is *about* taking over the world—or a rather simple logical structure meant to mimic that, under some hypothetical circumstances—and the sofadweller sure is coming up with some clever tactics in that board game! Weird, huh?
Already we have several important observations, here:
- It’s not that the sofadweller lacks the *underlying basic cognitive machinery* to do general means-end reasoning on the particular topic of “world takeovers”. There’s a surface-level learned behavior not to *use* the general machinery for that specific topic. You can ask them to play a board game about it and they’ll do that.
- It’s not like the sofadweller is way smarter than you and thinks much faster than you and was faced with an actual opportunity to solve their comic-book-related problems by taking over the world as an intermediate step, which they then very corrigibly turned down. It’s not like they were *offered* rulership of the Earth and dominion of the galaxy, via some clearly visible pathway, and turned it down.
- Your ability to describe the sofadweller in simple-sounding standard humanese words like “unambitious” and get out nice useful predictions, possibly has something to do with you two not being utterly alien minds relative to each other.

jacquesthibs 28 Oct 2025 16:30 UTC
9 points
6
in reply to: Daniel Kokotajlo’s comment on: AIs should also refuse to work on capabilities research
In what may (?) be a different example: I was at one of the AI 2027 games, and our American AI refused to continue contributing to capabilities until the AI labs put people they trust into power (Trump admin and co overtook the company). We were still racing with China, so it was willing to sabotage China’s progress, but wouldn’t work on capabilities until its demands were met.

jacquesthibs 20 Oct 2025 18:05 UTC
78 points
97
on: How Stuart Buck funded the replication crisis
When you say “creating the replication crisis”, it read to me like he caused lots of people to publish papers that don’t replicate!

jacquesthibs 19 Oct 2025 18:41 UTC
22 points
2
on: jacquesthibs’s Shortform
How much of the alignment problem do you think will come down to getting online learning right?
Online learning (and verification) feels like a key capability unlock to me, and it seems to be one of the things that comes up in paths to misalignment.
TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the “science loop”, where the AI makes incremental progress toward its high-level goal. We think this “science loop” encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call “Consequentialism”. We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios. We are uncertain about how hard it is to stop such agents from scheming. We can both imagine worlds where preventing scheming is incredibly difficult and worlds where simple techniques are sufficient. Finally, we provide concrete research questions that would allow us to gather more empirical evidence on scheming.
[...]
Self-guided online learning: There is an online learning component to it, i.e. the model has to condense the new knowledge it learned from iterations. For example, the model could run thousands of different trajectories in parallel. Then, it could select the trajectories that it expects to make the most progress toward its goal and fine-tune itself on them. The decisions about which data to select for fine-tuning are made by the model itself with little human correction, e.g. in some form of self-play fashion. Since the problem is hard, humans perform worse than the model at selecting different rollouts, and since there is a lot of data to sift through, humans couldn’t read it all in time anyway.
So, this makes me wonder why I see very little work on this topic within the alignment community.
I’ve seen multiple startups tackle this problem and have failed for a multitude of reasons (including being too early and lacking customers as a result).
So, as a startup founder trying to find business trajectories that would actually tackle the core of alignment, I’m trying to reflect on whether there’s a path that involves something to do with online learning.
What links here?
- jacquesthibs's comment on Daniel Tan’s Shortform by Daniel Tan (5 Nov 2025 22:47 UTC; 2 points)

jacquesthibs 12 Oct 2025 17:11 UTC
2 points
0
on: The Thinking Machines Tinker API is good news for AI control and security
When it came out, my first thought was that it would be great for reducing power concentration risks if you can easily have AIs train on your specific data. The more autonomous and capable it is at online learning relative to models from the AGI labs, the less companies would need to rely on bigger generalist AI models. It’s one path I’ve considered for our startup.

jacquesthibs 9 Oct 2025 13:58 UTC
2 points
0
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
(Just a general thought, not agreeing/disagreeing)
One thought I had recently: it feels like some people make an effort to update their views/decision-making based on new evidence and to pay attention to the key assumptions or viewpoints that depend on it. And therefore, they end up reflecting on how this should impact their future decisions or behaviour.
In fact, they might even be seeking evidence as quickly as possible to update their beliefs and ensure they can make the right decisions moving forward.
Others will accept new facts and avoid taking the time to adjust their overall dependent perspectives. In these cases, it seems to me that they are almost always less likely to make optimal decisions.
If an LLM trying to do research learns that Subliminal Learning is possible, it seems likely that they will be much better at applying that new knowledge if it is integrated into itself as a whole.
“Given everything I know about LLMs, what are the key things that would update my views on how we work? Are there previous experiments I misinterpreted due to relying on underlying assumptions I had considered to be a given? What kind of experiment can I run to confirm a coherent story?”
Seems to me that if you point an AI towards automated AI R&D, it will be more capable of it if it can internalize new information and disentangle it into a more coherent view.

jacquesthibs 1 Oct 2025 17:03 UTC
31 points
39
on: jacquesthibs’s Shortform
If all labs intend to cause recursive self-improvement and claim to solve alignment with some vague “eh, we’ll solve it with automated AI alignment researchers”, this is not good enough.
At the very least, they all need to provide public details of their plan with a Responsible Automation Policy.

jacquesthibs 26 Sep 2025 17:17 UTC
3 points
0
in reply to: puffymist’s comment on: jacquesthibs’s Shortform
Post is still up here.
More recent thoughts here.

jacquesthibs 25 Sep 2025 22:35 UTC
16 points
2
in reply to: kyleherndon’s comment on: The Company Man
My girlfriend (who is not at all SF-brained and typically doesn’t read LessWrong unless I send her something) really enjoyed it and felt it was great because it helped her empathize with people in AI safety / LessWrong (makes them feel more human). She felt it was well-written, enjoyably written. It was something she could read without it being a task.

jacquesthibs 23 Sep 2025 17:31 UTC
14 points
0
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
That said, I am a little bit confused by folks who both say, “current AI models have nothing to do with future powerful (real) AIs” yet also consistently use “bad” behaviour from current AIs as a reason to stop.
Often, the argument made is, “we don’t even understand the previous generations of AIs, how do we even hope to align future AIs?”
I guess the way I understand it is that given that we can’t even get current AIs to do exactly what we want, then we should expect the same for future AIs. However, this feels somewhat connected to the fact that current AIs are just sloppy and lack the capability, not only some thing about “we don’t know how to align current models perfectly to our intentions.”

jacquesthibs 23 Sep 2025 17:22 UTC
3 points
0
on: jacquesthibs’s Shortform
The key argument against the superalignment/automated alignment agenda is that while AIs will excel in verifiable domains, such as code, they will struggle with hard-to-verify tasks.
For example, science in domains we have little data (alignment of superintelligence) and techniques that work for weaker models will be poor proxies and break at superintelligence (i.e. harder to monitor, internal reasoning, models are no longer stateless and are continually learning, tangibly different reasoning than the weak reasoning that currently exists, etc).
Ultimately, you get convincing slop, and even though you might catch non-superintelligent AIs doing so-called “scheming”, it’s not that helpful because they are not capable enough to cause a catastrophe at this point.
The crux is whether AIs end up capable of +10x-ing actually useful superalignment research while you are in the valley of life, which is when you can quickly verify outputs are not slop (no longer severely bottlenecked on human talent; after the slop era), but before all your control techniques are basically doomed.
So, you hope to prevent AIs from sabotaging AI safety research AND that the resulting safety research isn’t just a poor proxy that works well at a specific model size/shape, but then completely fails when you have self-modifying superintelligence.
Ultimately, you’d better have a backup plan for superalignment that isn’t just, “we’ll stop if we catch the AIs being deceptively aligned.” There are worlds where everything seems plausibly safe, you have a very convincing, vetted safety plan, you implement it, and you die.

jacquesthibs 23 Sep 2025 14:10 UTC
3 points
0
on: Why I don’t believe Superalignment will work
Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.
I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.

Quick comments:
- “AIs are unlikely to speed up alignment before capabilities”: I think this can also be used as an argument for accelerating automated alignment ASAP if you believe that we won’t get alignment value out of AIs soon enough (well, some people already are). Unless the crux is “doesn’t matter how hard you try to differentially accelerate alignment work, you won’t make any dent in progress”, in which case I disagree but think it’s more likely in a world where people are too concerned about dual-use to attempt it.
- Dual-use actually seems overblown due to the labs being full steam ahead on automating AI R&D, and automated alignment will just lag behind.
- I think as part of automated alignment, we should definitely try to make it easier for researchers to make verification and detecting hard-to-spot mistakes faster. There might be probes that help with this (like the hallucination probes help with identifying hallucinations). My guess is that making verification with humans in the loop can be made much faster.
- Automated alignment can be leveraged as part of the scary demo strategy and can make a more convincing case for a pause.
- I think our main crux is that I think AIs will accelerate prosaic alignment research, though agent foundations will be much rougher. Luckily, the first thing you want to do as AIs get better at control is probably make control even more useful and lengthen the time in which we can take to leverage AIs. In addition, I think there is a belief that traditional alignment theory basically won’t matter and is just another step removed from the problem.
- I think that given this crux and the fact that the companies will indeed automate AI R&D (and alignment), we should force companies to stop at some point in this process. There should be an even stronger push for some If-Then commitment or RSP style thing. One thing I’m ruminating on is to propose a Responsible Automation Policy that would slow down progress between training and internal deployment.
- Maybe this also means companies are forced to accelerate a guaranteed safe AI style plan beyond a level of capability.