A reason to act less than optimally: a fun thought experiment. To complexify anyone trying to predict you.
You might be in a hostile or at least not optimal simulation. There would be people trying to predict you and control you so the simulation is stable (for whatever reason they want the society you are in to persist).
If you act naively rationally you are predictable, your actions will be predicted. So the system as a whole will tend towards simplicity. This isn’t good because the simulation systems need to deal with a complex outer world too.
So be irrational in a way that is purposeful.
Make big bets you know you will lose (but stimulate other people to do interesting things). Get money pumped for a while to learn about those systems.
Maybe send messages by acting irrational on purpose. Bring life to the world.
Is the rational mind set an existential risk? It spreads the idea of arms races and the treacherous turn. Should we be encouraging less than rational world views to spread if so what? And should we be coding them into our AI? You probably want them to be hard to predict so they cannot be exploited easily.
If it is it would still be worth preserving as an example of an insidious threat that should be guarded against. Perhaps in a simulation for people to interact with.
You might want as rational a choice of mindset to adopt as possible though. Decision making under deep uncertainty seems to allow you to deviate from the traditionally rational. You can evaluate plans under different world views and pick actions or plans that don’t seem too bad under them all. This could allow irrational world views to have a voice.
How irrational do you want to accept a world view under deep uncertainty? Perhaps you need to evaluate the outcomes from that world view and see if there might be something hidden that it is tapping into.
AI might help with people generating tests for key results from okrs and publishing if they are not met.
If the key results are published this could help with AI pauses by validating that no stories on creating beyond frontier models have been written or started (assuming that that is a key result people care about).
I figured that objectives and key results are how companies maintain alignment and avoid internal arms races so might be useful for alignment between entities too (perhaps with government accredited badges for people that maintain objectives like pausing and responsible data use)
Currently due to worries about arms races, and races to the bottom, people might not share the safest information about AI development. This makes public trust in the development of AI hard by actors with secret knowledge. One possibility is shadow decision making, giving the knowledge of the secret methods and the desires of an actor to a third party who makes go no go decisions. A second is building trust by building non AI software in the public interest, and that organisation being trusted to build AI with secret knowledge. Probably some mix of the two might be good.
In order to build trust and see if there exists information I don’t know I propose a test. A complementary tool to prediction markets.
Prediction markets are great for aggregating beliefs about known questions. But what about questions you don’t know to ask? What about detecting that someone has a frame you haven’t encountered? Here’s a privacy-preserving way to discover unknown unknowns without revealing what you know or learning what they know
People build AIs that represent the knowledge they have. They can be trained to not expose that knowledge, they would also be in environments that didn’t log the interactions they have with other AIs, what they would log is if the AI went out of activation distribution from it’s training set. This would imply some novel argument or possibility that you don’t think of.
You would have to be less sure of your ideas around certain subjects if there was something you were missing.
With mutual verification of the system before use. Both parties inspect the architecture. Neither can cheat without the other seeing.
Has anyone been writing evals on computer and network system administration for AI? It seems like this is something we would want to improve as it could increase the effort required to takeover the networks in an AI takeover scenario
This is a letter I’m thinking about sending to my MP (hence the UK specific things). I would be interested in other peoples take on the problem.
The UK government’s creation of the AI Safety Institute as well its ambition to ‘turbocharge’ AI presents a challenge. To unlock the immense promise of AI without causing dangerous AI arms races, we need to establish international coordination, codes of conduct and cooperative frameworks appropriate for AI. This will require the UK leading on an international stage, having tried things out on a local level.
Why these safety measures around the development of AI have not been established already, is an open problem that I have not seen studied. My current hypothesis might be that discussion and policy around AI and the future is dominated by lesswrong style rationalism and neorealism. These philosophies suggest that building international mechanisms for cooperation on AI development is impossible and not worth trying and that things like codes of conduct for AI engineers are naive.
If there is a domination, it might be due to a founder effect as rationalism and philosophers like Nick Bostrom have been influential in the creation of the field of AI safety and AI based existential risk. They look at AI from an economically rational point of view. Rationalists also control the large online forum for discussing AI safety and policy, lesswrong and I’m not aware of others.
If there is a domination by cynical rationalism it is a problem for two reasons.
The first is that it has not been talked about or researched, so we don’t know the scope of the problem. The negative impact it has had on the AI safety progress or policy. It might have caused the lack of progress in UN or other international coordination on AI safety.
Secondly it can lead to problems if humans (and presumably AI) adopt the cynical view point. Historically it was this cynical philosophy that led to arms races in nuclear weapons and made the world less safe due to the security dilemma. The security dilemma is where states are trapped to make things less secure because there is supposed anarchy on the international level. Applying this to AI means a potentially dangerous AI arms race as there will not be time to build in the necessary safeguards into AI.
The lack of trust and lack of trust in the ability to build trust is corrosive to life and intelligence.
So what can be done? To start with this argument suggests researching the AI safety community and seeing how broad it is philosophically. This can then feed into the next step. If this study finds a narrow philosophical base, broadening the research directions might be the way to go. Perhaps by funding people doing traditional PhDs, or with research bodies like ARIA. They would then research codes of conduct for engineers and attempt to embed and test values and philosophies inside AI. The philosophies might Include care-based philosophies, deontology and virtue based.
So instead of just the field of AI alignment, where there is the hanging question of who to align the AI with, there could be movement towards trying different philosophies embedded in design and seeing which ones work best for humanity empirically. If actors see that each other are working towards non-cynically rational AI in a way that is not cynically rational, then the pressure to speed up the arms race becomes less.
Testing different philosophies in the real world could be achieved by doing user centred design of regulatory mandated tests for broadly deployed AI. The tests would have to be passed before the AI could be released. For example the AI agents could be simulated in an environment with humans and how the goals given to them interacted with simulated humans goals could be observed. Then the AI trained to pass this test would interact with a diverse set of humans to get feedback on how well the test makes the AI useful and non-harmful. This approach to test design could be tried out on a national or local level, before rolling out internationally.
The UK has shown an ambition to lead the way on AI safety and policy. This could be a way for the UK government to realise that ambition.
Companies seem to be trapped in a security dilemma situation, they worry about being last to develop AGI so seem to be rushing towards the capabilities rather than safety. This in part is due to worries about owning/controlling the future.
Other aspects of humanity such as governments and the scientific community aren’t (at least visibly) rushing because they aren’t completely (economically) rational in that regard. Being more norm or rule following. Other ways not to be economically rational include caring for others (or humanity/nature in general)
We need to embed more rule-following into AI, so it doesn’t rush. This might need to be government mandated, as the rational companies might not be incentivised to rush. Government/International community mandated tests in simulated environments to make sure an AI follows the rules or cares about humanity, might be the way forward.
Caring and rule following seem different from corrigibility work or from the idea of alignment. A caring AI can have different goals from humanity but it would still allow/enchance humans ability to go about their business.
The rules I would look towards would definitely include never modifying the caring code of the AI.
Caring could be operationalised by using explainable AI to capture what parts of NN they thought were humans and embedding in the AIs goal system something that sought to increase or not modify the options the human could take.
There is a reason why the arms race around more and bigger nuclear weapons stopped and hasn’t started again. Why do people think that is and can we use that understanding to stop the arms race (to be there first and control the future) around AI
Has there been any work on representation under extreme information asymmetry?
I’m thinking like having AIs trained to make the same decisions as you would and them being given the secret or info-hazardous material to make governance decisions on your behalf. To avoid info leakage.
I’ve been thinking about problems with mind copying and democracy with uploads.
In order to avoid Sibyl attacks you might want to implement something like compressibility weightings of course. So if a voter has lots of similarity to other voters it is not weighted very much.
Otherwise you get a race to the bottom with viewpoints that might not capture the richness of humanity ( there is a pressure to simplify the thing being copied to get more copies of it given a certain amount of compute)
The trick is to separate your important traits from the unimportant ones, and change the unimportant ones randomly (e.g. randomly choose your new favorite color), so that you increase the psychological diversity of your movement without endangering its goals.
I’ve been thinking a lot about identity (as in pg, keep your identity small).
Specifically which identities might lead to safe development of AI. And trying to validate that by running these different activities:
1. Role playing games where the participants are asked to take on specific identities and play through a scenario where AI has to be created. 2. Similar things where LLMs are prompted to take on particular roles and given agency to play in the role playing games too.
Has there been similar work before?
I’m particularly interested in cosmic identity, where you see humanity as a small part of a wider cosmos, including potentially hostile and potentially useful aliens. It has a number of properties that I think make it interesting, which I’ll discuss in a full post, if people think this is worth exploring.
Are there identities that people think should be explored too?
The cosmic identity and related issues have been considered and I even used them to make a conjecture about alignment. As for role-playing games, I doubt that they are actually useful. Unless, of course, you mean something like Cannell’s proposal.
As for “the idea of arms races and the treacherous turn”, the AI-2027 team isn’t worried about such a risk, they are more worried about the race itself causing the humans to do worse safety checks.
I think that there might be perverse incentives if identities or view points get promoted in a legible fashion. To hack that system rather than to do useful work.
So it might be good to have identity promotion to be done in a way that is obfuscated or ineffable in some way.
I’m working on an AI powered tool to explore making decisions in complex fictional worlds , with the hope that it will translate into making better decisions when faced with real decisions.
Still in alpha stage give me a shout if you are curious and want a link.
The post AI period should have a fund that helps people that have been put in a bad position during the development of AI, but can’t talk about it due to info hazard reasons.
If the info hazard hasn’t passed this might have to be done illegibly, to avoid leaking the existence of people with info hazards
A reason to act less than optimally: a fun thought experiment. To complexify anyone trying to predict you.
You might be in a hostile or at least not optimal simulation. There would be people trying to predict you and control you so the simulation is stable (for whatever reason they want the society you are in to persist).
If you act naively rationally you are predictable, your actions will be predicted. So the system as a whole will tend towards simplicity. This isn’t good because the simulation systems need to deal with a complex outer world too.
So be irrational in a way that is purposeful.
Make big bets you know you will lose (but stimulate other people to do interesting things). Get money pumped for a while to learn about those systems.
Maybe send messages by acting irrational on purpose. Bring life to the world.
This may be valuable in less-than-adversarial complex equilibria. Even if things aren’t controlled or predicted from outside, they contain lots of forces that are pushing toward over-simple optimization (see https://www.lesswrong.com/w/moloch). Pushing away from optimal can add slack (https://subgenius.fandom.com/wiki/Slack).
Is the rational mind set an existential risk? It spreads the idea of arms races and the treacherous turn. Should we be encouraging less than rational world views to spread if so what? And should we be coding them into our AI? You probably want them to be hard to predict so they cannot be exploited easily.
If it is it would still be worth preserving as an example of an insidious threat that should be guarded against. Perhaps in a simulation for people to interact with.
You might want as rational a choice of mindset to adopt as possible though. Decision making under deep uncertainty seems to allow you to deviate from the traditionally rational. You can evaluate plans under different world views and pick actions or plans that don’t seem too bad under them all. This could allow irrational world views to have a voice.
How irrational do you want to accept a world view under deep uncertainty? Perhaps you need to evaluate the outcomes from that world view and see if there might be something hidden that it is tapping into.
Is there anyone exploring how AI might be used to increase integrity and build trustworthiness.
For example it could scan the behaviour of people, businesses or AI and see whether it is consistent to stated promises, flagging things that are not.
It might be used to train LLMs to be consistent if they are too be used as agents
AI might help with people generating tests for key results from okrs and publishing if they are not met.
If the key results are published this could help with AI pauses by validating that no stories on creating beyond frontier models have been written or started (assuming that that is a key result people care about).
I figured that objectives and key results are how companies maintain alignment and avoid internal arms races so might be useful for alignment between entities too (perhaps with government accredited badges for people that maintain objectives like pausing and responsible data use)
Currently due to worries about arms races, and races to the bottom, people might not share the safest information about AI development. This makes public trust in the development of AI hard by actors with secret knowledge. One possibility is shadow decision making, giving the knowledge of the secret methods and the desires of an actor to a third party who makes go no go decisions. A second is building trust by building non AI software in the public interest, and that organisation being trusted to build AI with secret knowledge. Probably some mix of the two might be good.
In order to build trust and see if there exists information I don’t know I propose a test. A complementary tool to prediction markets.
Prediction markets are great for aggregating beliefs about known questions. But what about questions you don’t know to ask? What about detecting that someone has a frame you haven’t encountered? Here’s a privacy-preserving way to discover unknown unknowns without revealing what you know or learning what they know
People build AIs that represent the knowledge they have. They can be trained to not expose that knowledge, they would also be in environments that didn’t log the interactions they have with other AIs, what they would log is if the AI went out of activation distribution from it’s training set. This would imply some novel argument or possibility that you don’t think of.
You would have to be less sure of your ideas around certain subjects if there was something you were missing.
With mutual verification of the system before use. Both parties inspect the architecture. Neither can cheat without the other seeing.
Has anyone been writing evals on computer and network system administration for AI? It seems like this is something we would want to improve as it could increase the effort required to takeover the networks in an AI takeover scenario
This is a letter I’m thinking about sending to my MP (hence the UK specific things). I would be interested in other peoples take on the problem.
The UK government’s creation of the AI Safety Institute as well its ambition to ‘turbocharge’ AI presents a challenge. To unlock the immense promise of AI without causing dangerous AI arms races, we need to establish international coordination, codes of conduct and cooperative frameworks appropriate for AI. This will require the UK leading on an international stage, having tried things out on a local level.
Why these safety measures around the development of AI have not been established already, is an open problem that I have not seen studied. My current hypothesis might be that discussion and policy around AI and the future is dominated by lesswrong style rationalism and neorealism. These philosophies suggest that building international mechanisms for cooperation on AI development is impossible and not worth trying and that things like codes of conduct for AI engineers are naive.
If there is a domination, it might be due to a founder effect as rationalism and philosophers like Nick Bostrom have been influential in the creation of the field of AI safety and AI based existential risk. They look at AI from an economically rational point of view. Rationalists also control the large online forum for discussing AI safety and policy, lesswrong and I’m not aware of others.
If there is a domination by cynical rationalism it is a problem for two reasons.
The first is that it has not been talked about or researched, so we don’t know the scope of the problem. The negative impact it has had on the AI safety progress or policy. It might have caused the lack of progress in UN or other international coordination on AI safety.
Secondly it can lead to problems if humans (and presumably AI) adopt the cynical view point. Historically it was this cynical philosophy that led to arms races in nuclear weapons and made the world less safe due to the security dilemma. The security dilemma is where states are trapped to make things less secure because there is supposed anarchy on the international level. Applying this to AI means a potentially dangerous AI arms race as there will not be time to build in the necessary safeguards into AI.
The lack of trust and lack of trust in the ability to build trust is corrosive to life and intelligence.
So what can be done? To start with this argument suggests researching the AI safety community and seeing how broad it is philosophically. This can then feed into the next step. If this study finds a narrow philosophical base, broadening the research directions might be the way to go. Perhaps by funding people doing traditional PhDs, or with research bodies like ARIA. They would then research codes of conduct for engineers and attempt to embed and test values and philosophies inside AI. The philosophies might Include care-based philosophies, deontology and virtue based.
So instead of just the field of AI alignment, where there is the hanging question of who to align the AI with, there could be movement towards trying different philosophies embedded in design and seeing which ones work best for humanity empirically. If actors see that each other are working towards non-cynically rational AI in a way that is not cynically rational, then the pressure to speed up the arms race becomes less.
Testing different philosophies in the real world could be achieved by doing user centred design of regulatory mandated tests for broadly deployed AI. The tests would have to be passed before the AI could be released. For example the AI agents could be simulated in an environment with humans and how the goals given to them interacted with simulated humans goals could be observed. Then the AI trained to pass this test would interact with a diverse set of humans to get feedback on how well the test makes the AI useful and non-harmful. This approach to test design could be tried out on a national or local level, before rolling out internationally.
The UK has shown an ambition to lead the way on AI safety and policy. This could be a way for the UK government to realise that ambition.
Companies seem to be trapped in a security dilemma situation, they worry about being last to develop AGI so seem to be rushing towards the capabilities rather than safety. This in part is due to worries about owning/controlling the future.
Other aspects of humanity such as governments and the scientific community aren’t (at least visibly) rushing because they aren’t completely (economically) rational in that regard. Being more norm or rule following. Other ways not to be economically rational include caring for others (or humanity/nature in general)
We need to embed more rule-following into AI, so it doesn’t rush. This might need to be government mandated, as the rational companies might not be incentivised to rush. Government/International community mandated tests in simulated environments to make sure an AI follows the rules or cares about humanity, might be the way forward.
Caring and rule following seem different from corrigibility work or from the idea of alignment. A caring AI can have different goals from humanity but it would still allow/enchance humans ability to go about their business.
The rules I would look towards would definitely include never modifying the caring code of the AI.
Caring could be operationalised by using explainable AI to capture what parts of NN they thought were humans and embedding in the AIs goal system something that sought to increase or not modify the options the human could take.
There is a reason why the arms race around more and bigger nuclear weapons stopped and hasn’t started again. Why do people think that is and can we use that understanding to stop the arms race (to be there first and control the future) around AI
Has there been any work on representation under extreme information asymmetry?
I’m thinking like having AIs trained to make the same decisions as you would and them being given the secret or info-hazardous material to make governance decisions on your behalf. To avoid info leakage.
I’ve been thinking about problems with mind copying and democracy with uploads.
In order to avoid Sibyl attacks you might want to implement something like compressibility weightings of course. So if a voter has lots of similarity to other voters it is not weighted very much.
Otherwise you get a race to the bottom with viewpoints that might not capture the richness of humanity ( there is a pressure to simplify the thing being copied to get more copies of it given a certain amount of compute)
The trick is to separate your important traits from the unimportant ones, and change the unimportant ones randomly (e.g. randomly choose your new favorite color), so that you increase the psychological diversity of your movement without endangering its goals.
If your view of the problem is very complex you might lose to get compressed easily as there would be lots of mutual information
I’ve been thinking a lot about identity (as in pg, keep your identity small).
Specifically which identities might lead to safe development of AI. And trying to validate that by running these different activities:
1. Role playing games where the participants are asked to take on specific identities and play through a scenario where AI has to be created.
2. Similar things where LLMs are prompted to take on particular roles and given agency to play in the role playing games too.
Has there been similar work before?
I’m particularly interested in cosmic identity, where you see humanity as a small part of a wider cosmos, including potentially hostile and potentially useful aliens. It has a number of properties that I think make it interesting, which I’ll discuss in a full post, if people think this is worth exploring.
Are there identities that people think should be explored too?
The cosmic identity and related issues have been considered and I even used them to make a conjecture about alignment. As for role-playing games, I doubt that they are actually useful. Unless, of course, you mean something like Cannell’s proposal.
As for “the idea of arms races and the treacherous turn”, the AI-2027 team isn’t worried about such a risk, they are more worried about the race itself causing the humans to do worse safety checks.
But slightly irrational actors might not race (especially if they know that other actors are slightly irrational in the same or compatible way.)
I think that there might be perverse incentives if identities or view points get promoted in a legible fashion. To hack that system rather than to do useful work.
So it might be good to have identity promotion to be done in a way that is obfuscated or ineffable in some way.
I’m working on an AI powered tool to explore making decisions in complex fictional worlds , with the hope that it will translate into making better decisions when faced with real decisions.
Still in alpha stage give me a shout if you are curious and want a link.
The post AI period should have a fund that helps people that have been put in a bad position during the development of AI, but can’t talk about it due to info hazard reasons.
If the info hazard hasn’t passed this might have to be done illegibly, to avoid leaking the existence of people with info hazards