Building an AI safety business that tackles the core challenges of the alignment problem is hard.
Epistemic status: uncertain; trying to articulate my cruxes. Please excuse the scattered nature of these thoughts, I’m still trying to make sense of all of it.
You can build a guardrails or evals platform, but if your main threat model involves misalignment via internal deployment with self-improving AI (potentially stemming from something like online learning on hard problems like alignment which leads to AI safety sabotage), it is so tied to capabilities that you will likely never have the ability to influence the process. You can build reliability-as-a-business but this probably speeds up timelines via second-order effects and doesn’t really matter for superintelligence.
I guess you can hone in on the types of problems where Goodharting is an obvious problem and you are building reliable detectors to help reduce it. Maybe you can find companies that would value that as a feature and you can relate it to the alignment-relevant situations.
You can build RL environments, sell evals or sell training data, but you still seemingly end up too far removed from what is happening internally.
You could choose a high-stakes vertical you can make money with as a test-bed for alignment and build tooling/techniques that ensure a high-level of guarantees.
If you have a theory of change, it will likely need to be some technical alignment breakthrough you make legible and low-friction to incorporate or some open source infra the labs can leverage.
You can build ControlArena or Inspect, open-source it, and then try to make a business around it, but of course you are not tackling the core alignment challenges.
Unless your entire theory of change is building infrastructure the labs will port into their local Frankenstein infra and that Control ends up being the only thing the labs needed for solving alignment with AIs. And I guess from a startup perspective, you recognize that building AI safety sabotage monitors doesn’t really relate 1-to-1 with what business owners care about right now. You essentially use your contract with Anthropic as a competence signal for VC money and getting costumers.
You can do mech interp, but again, when are you solving the superalignment problem?
So what do you do if you are under the impression that the greatest source of the risk is within the labs? Of course you can just drop the whole startup direction and do research/governance. Many end up inside the labs. You could keep doing a startup, but basically hope that your evals/monitoring product reduces some sources of risk and you might donate some of the money to fundamental alignment research.
I’m not really sure what to make of this and still have some startup ideas that I think will still be overall good for safety, but these are things I’ve been thinking a lot about recently and wanted to get my thoughts out there if anyone wanted to talk about it. The core thing is that it feels like there’s a lot of startups you can build as an AI company that would do things like robustify the world against AI, but tackling the core and conceptual problems and linking it to a venture-backed business is rough.
I think selling alignment-relevant RL environments to labs is underrated as an x-risk-relevant startup idea. To be clear, x-risk-relevant startups is a pretty restricted search space; I’m not saying that one necessarily should be founding a startup as the best way to address AI x-risk, but just operating under the assumption that we’re optimizing within that space, selling alignment RL environments is definitely the thing I would go for. There’s a market for it, the incentives are reasonable (as long as you are careful and opinionated about only selling environments you think are good for alignment, not just good for capabilities), and it gives you a pipeline for shipping whatever alignment interventions you think are good directly into labs’ training processes. Of course, that’s dependent on you actually having a good idea for how to train models to be more aligned, and that intervention being in the form of something you can sell, but if you can do that, and you can demonstrate that it works, you can just sell it to all the labs, have them all use it, and then hopefully all of their models will now be more aligned. E.g. if you’re excited about character training, you can just replicate it, sell it to all the labs, and then in so doing change how all the labs are training their models.
I don’t think you need a vision for how to solve the entire alignment problem yourself. It’s setting the bar too high. When you start a startup, you can’t possibly have the whole plan laid out up front. You’re going to change it as you go along, as you get feedback from users and discover what people really need.
What you can do is make sure that your startup’s incentives are aligned correctly at the start. Solve your own alignment. The most important questions here are, who is your customer? and how do you make money?
For example, if you make money by selling e-commerce ads against a consumer product, the incentives on your company will inevitably push you toward making a more addictive, more mass-market product.
For another example, if you make money by selling services to companies training AI models, your company’s incentives will be to broaden the market as much as possible, help all sorts of companies train all sorts of AI models, and offer all the sorts of services they want.
In the long run, it seems like companies often follow their own natural incentives, more than they follow the personal preferences of the founder.
All of this makes it tricky to start a pro-alignment company but I think it is worth trying because when people do create a successful company it creates a nexus of smart people and money to spend that can attack a lot of problems that aren’t possible in the “nonprofit research” world.
You’re going to change it as you go along, as you get feedback from users and discover what people really need.
This is one part I feel iffy on because I’m concerned that following the customer gradient will lead to a local minima that will eventually detach from where I’d like to go.
That said, it definitely feels correct to reflect on one’s alignment and incentives. The pull is real:
All of this makes it tricky to start a pro-alignment company but I think it is worth trying because when people do create a successful company it creates a nexus of smart people and money to spend that can attack a lot of problems that aren’t possible in the “nonprofit research” world.
Yeah, that’s the vision! I’d have given up and taken another route if I didn’t think there was value in pursuing a pro-safety company.
Building an AI safety business that tackles the core challenges of the alignment problem is hard.
Epistemic status: uncertain; trying to articulate my cruxes. Please excuse the scattered nature of these thoughts, I’m still trying to make sense of all of it.
You can build a guardrails or evals platform, but if your main threat model involves misalignment via internal deployment with self-improving AI (potentially stemming from something like online learning on hard problems like alignment which leads to AI safety sabotage), it is so tied to capabilities that you will likely never have the ability to influence the process. You can build reliability-as-a-business but this probably speeds up timelines via second-order effects and doesn’t really matter for superintelligence.
I guess you can hone in on the types of problems where Goodharting is an obvious problem and you are building reliable detectors to help reduce it. Maybe you can find companies that would value that as a feature and you can relate it to the alignment-relevant situations.
You can build RL environments, sell evals or sell training data, but you still seemingly end up too far removed from what is happening internally.
You could choose a high-stakes vertical you can make money with as a test-bed for alignment and build tooling/techniques that ensure a high-level of guarantees.
If you have a theory of change, it will likely need to be some technical alignment breakthrough you make legible and low-friction to incorporate or some open source infra the labs can leverage.
You can build ControlArena or Inspect, open-source it, and then try to make a business around it, but of course you are not tackling the core alignment challenges.
Unless your entire theory of change is building infrastructure the labs will port into their local Frankenstein infra and that Control ends up being the only thing the labs needed for solving alignment with AIs. And I guess from a startup perspective, you recognize that building AI safety sabotage monitors doesn’t really relate 1-to-1 with what business owners care about right now. You essentially use your contract with Anthropic as a competence signal for VC money and getting costumers.
You can do mech interp, but again, when are you solving the superalignment problem?
So what do you do if you are under the impression that the greatest source of the risk is within the labs? Of course you can just drop the whole startup direction and do research/governance. Many end up inside the labs. You could keep doing a startup, but basically hope that your evals/monitoring product reduces some sources of risk and you might donate some of the money to fundamental alignment research.
I’m not really sure what to make of this and still have some startup ideas that I think will still be overall good for safety, but these are things I’ve been thinking a lot about recently and wanted to get my thoughts out there if anyone wanted to talk about it. The core thing is that it feels like there’s a lot of startups you can build as an AI company that would do things like robustify the world against AI, but tackling the core and conceptual problems and linking it to a venture-backed business is rough.
I think selling alignment-relevant RL environments to labs is underrated as an x-risk-relevant startup idea. To be clear, x-risk-relevant startups is a pretty restricted search space; I’m not saying that one necessarily should be founding a startup as the best way to address AI x-risk, but just operating under the assumption that we’re optimizing within that space, selling alignment RL environments is definitely the thing I would go for. There’s a market for it, the incentives are reasonable (as long as you are careful and opinionated about only selling environments you think are good for alignment, not just good for capabilities), and it gives you a pipeline for shipping whatever alignment interventions you think are good directly into labs’ training processes. Of course, that’s dependent on you actually having a good idea for how to train models to be more aligned, and that intervention being in the form of something you can sell, but if you can do that, and you can demonstrate that it works, you can just sell it to all the labs, have them all use it, and then hopefully all of their models will now be more aligned. E.g. if you’re excited about character training, you can just replicate it, sell it to all the labs, and then in so doing change how all the labs are training their models.
I’m interested in your take on what the differences are here.
I don’t think you need a vision for how to solve the entire alignment problem yourself. It’s setting the bar too high. When you start a startup, you can’t possibly have the whole plan laid out up front. You’re going to change it as you go along, as you get feedback from users and discover what people really need.
What you can do is make sure that your startup’s incentives are aligned correctly at the start. Solve your own alignment. The most important questions here are, who is your customer? and how do you make money?
For example, if you make money by selling e-commerce ads against a consumer product, the incentives on your company will inevitably push you toward making a more addictive, more mass-market product.
For another example, if you make money by selling services to companies training AI models, your company’s incentives will be to broaden the market as much as possible, help all sorts of companies train all sorts of AI models, and offer all the sorts of services they want.
In the long run, it seems like companies often follow their own natural incentives, more than they follow the personal preferences of the founder.
All of this makes it tricky to start a pro-alignment company but I think it is worth trying because when people do create a successful company it creates a nexus of smart people and money to spend that can attack a lot of problems that aren’t possible in the “nonprofit research” world.
This is one part I feel iffy on because I’m concerned that following the customer gradient will lead to a local minima that will eventually detach from where I’d like to go.
That said, it definitely feels correct to reflect on one’s alignment and incentives. The pull is real:
Yeah, that’s the vision! I’d have given up and taken another route if I didn’t think there was value in pursuing a pro-safety company.