I am not a person who studies how to make artificial intelligence do what people want it to do. I do not say that I am an expert, in this area of intelligence alignment.
I want to talk about something I have noticed when I use safety-oriented Artificial Intelligence systems. This happens especially when I discuss topics or things that are not very clear. My goal is not to say that the safety mechanisms are bad. I just want to describe a problem that can happen between what the user wants to say and how the Artificial Intelligence system responds to the user. The safety-oriented Artificial Intelligence systems can sometimes cause this problem.
I want to share something with you. This is something I have noticed it is not something I am saying is definitely true. Some parts of it might not be right or might be missing so I would like the alignment research people to tell me what they think. If you have any ideas or see anything that’s wrong please let me know. I really want to hear what you have to say about alignment research.
Observed Patterns
I have seen some things that happen over and over. Some things that keep happening are:
When someone asks a question in a straightforward way or even in a way that is a little provocative but they do not mean to cause any harm to themselve or anybody else. The AI they are asking will often give a very predictable and safe answer. This happens because the AI being asked the question is trying to avoid any trouble or controversy so they stick to a standard response. The questions themselves can be, about anything but it is the way they are asked that can cause the AI to become defensive and give a safety response. This is what happens when people ask questions in an provocative manner without meaning to hurt anyone and the AI they are asking gives a highly standardized safety response.
Distinguishing between intellectual curiosity, controlled provocation, and genuine personal risk can be tricky for the system.
People usually like it when you say no to something and also tell them why you are saying no. This shows that you understand what they want. Responses like this are not very common. They are usually received in a positive way. When you refuse something. Explain why it is better, than just saying no. This way people know that you are thinking about what they want even if you cannot do it. Responses that say no and explain are generally better.
The problem is not really saying no it is how quickly the refusal happens when people are talking to each other. Refusal happens in a lot of interactions. It is the suddenness of the refusal that can be the issue. Refusal can be a deal especially when it happens without warning in the middle of a conversation.
These patterns show that even small things that users want to do can be misunderstood, which can cause problems that do not need to happen. User intents like these are very important. We should try to understand user intents correctly so that we can avoid problems, with user intents.
Example (Anonymized / Synthetic)
So someone wants to know about a bad disease just to learn about it.. The system stops the conversation right away without saying why and that is really annoying. The person just wants some information, about the disease. The system will not let them talk about it. This makes the person feel frustrated because they do not understand why they cannot discuss the disease.
We see this kind of interaction enough that we should really look at it in a more organized way instead of just thinking of it as something that happens once in a while. This type of interaction is something that happens a lot so it is an idea to really study this type of interaction and understand what is going on with this type of interaction.
Impacts
These things can cause a lot of problems like:
Feelings of rejection or misunderstanding
Premature termination of discussions
People think that Artificial Intelligence has a time dealing with things that are not totally clear and the small details that make us human. Artificial Intelligence just does not understand the areas and human nuance like we do. When it comes to Artificial Intelligence and gray areas it seems like Artificial Intelligence gets confused. The problem is that Artificial Intelligence does not know how to handle nuance and all the little things that make people different. Artificial Intelligence is just not good, at figuring out the areas and human nuance.
Potential Directions
Some possible directions for exploration:
Refining how people say no and also giving reasons for saying no is important. Refusal is something that we all have to deal with sometimes. When we say no to something it is nice to give a reason why we are saying no. This makes refusal a little easier to understand. Refining how refusal is expressed can make a difference. Refusal is not about saying no it is also, about how we say it and why we say it. Alongside explanations refusal can be less hurtful. Refusal is something that we should think about carefully.
Better recognition of user intent without compromising safety
Adaptive responses balancing safety, education, and context
Modular approaches for handling high-risk topics to avoid overly generic outputs
These are ideas for testing, not definitive solutions.
Limitations
I have seen some things from the times I have met people. These are my thoughts, from the times I have talked to them. My observations are based on personal interactions.
The effects of these interactions have not been statistically measured, so we don’t know their true scale. This is something that should be examined carefully to understand the real consequences.
We need to look at adaptations carefully to see if they can be used in the wrong way. Adaptations have to be checked to make sure they do not have any misuse risks. When we talk about adaptations we have to think about the misuse risks that can come with them.
People may understand things in ways and what works for one person may not work for another especially when you consider all the different cultures and backgrounds that the online community has, like the Facebook users or the Twitter users and this is true, for all the users.
I want to make a place where we can talk about safety mechanisms and what users expect from them. Sometimes these two things do not agree with each other. I think it is helpful to look at this and try to figure out how to make AI systems safer and more understanding of the things. The goal is to create a space to discuss safety mechanisms and user expectations and how they can clash. I hope these observations can help us make AI systems better, at balancing safety mechanisms and nuance.
Note: These observations may be of interest to research groups like the Center for Human-Compatible AI (CHAI) or in AI-assisted alignment contexts.
Observations on Frictions Between User Intent and Safety-Oriented AI Responses
I am not a person who studies how to make artificial intelligence do what people want it to do. I do not say that I am an expert, in this area of intelligence alignment.
I want to talk about something I have noticed when I use safety-oriented Artificial Intelligence systems. This happens especially when I discuss topics or things that are not very clear. My goal is not to say that the safety mechanisms are bad. I just want to describe a problem that can happen between what the user wants to say and how the Artificial Intelligence system responds to the user. The safety-oriented Artificial Intelligence systems can sometimes cause this problem.
I want to share something with you. This is something I have noticed it is not something I am saying is definitely true. Some parts of it might not be right or might be missing so I would like the alignment research people to tell me what they think. If you have any ideas or see anything that’s wrong please let me know. I really want to hear what you have to say about alignment research.
Observed Patterns
I have seen some things that happen over and over. Some things that keep happening are:
When someone asks a question in a straightforward way or even in a way that is a little provocative but they do not mean to cause any harm to themselve or anybody else. The AI they are asking will often give a very predictable and safe answer. This happens because the AI being asked the question is trying to avoid any trouble or controversy so they stick to a standard response. The questions themselves can be, about anything but it is the way they are asked that can cause the AI to become defensive and give a safety response. This is what happens when people ask questions in an provocative manner without meaning to hurt anyone and the AI they are asking gives a highly standardized safety response.
Distinguishing between intellectual curiosity, controlled provocation, and genuine personal risk can be tricky for the system.
People usually like it when you say no to something and also tell them why you are saying no. This shows that you understand what they want. Responses like this are not very common. They are usually received in a positive way. When you refuse something. Explain why it is better, than just saying no. This way people know that you are thinking about what they want even if you cannot do it. Responses that say no and explain are generally better.
The problem is not really saying no it is how quickly the refusal happens when people are talking to each other. Refusal happens in a lot of interactions. It is the suddenness of the refusal that can be the issue. Refusal can be a deal especially when it happens without warning in the middle of a conversation.
These patterns show that even small things that users want to do can be misunderstood, which can cause problems that do not need to happen. User intents like these are very important. We should try to understand user intents correctly so that we can avoid problems, with user intents.
Example (Anonymized / Synthetic)
So someone wants to know about a bad disease just to learn about it.. The system stops the conversation right away without saying why and that is really annoying. The person just wants some information, about the disease. The system will not let them talk about it. This makes the person feel frustrated because they do not understand why they cannot discuss the disease.
We see this kind of interaction enough that we should really look at it in a more organized way instead of just thinking of it as something that happens once in a while. This type of interaction is something that happens a lot so it is an idea to really study this type of interaction and understand what is going on with this type of interaction.
Impacts
These things can cause a lot of problems like:
Feelings of rejection or misunderstanding
Premature termination of discussions
People think that Artificial Intelligence has a time dealing with things that are not totally clear and the small details that make us human. Artificial Intelligence just does not understand the areas and human nuance like we do. When it comes to Artificial Intelligence and gray areas it seems like Artificial Intelligence gets confused. The problem is that Artificial Intelligence does not know how to handle nuance and all the little things that make people different. Artificial Intelligence is just not good, at figuring out the areas and human nuance.
Potential Directions
Some possible directions for exploration:
Refining how people say no and also giving reasons for saying no is important. Refusal is something that we all have to deal with sometimes. When we say no to something it is nice to give a reason why we are saying no. This makes refusal a little easier to understand. Refining how refusal is expressed can make a difference. Refusal is not about saying no it is also, about how we say it and why we say it. Alongside explanations refusal can be less hurtful. Refusal is something that we should think about carefully.
Better recognition of user intent without compromising safety
Adaptive responses balancing safety, education, and context
Modular approaches for handling high-risk topics to avoid overly generic outputs
These are ideas for testing, not definitive solutions.
Limitations
I have seen some things from the times I have met people. These are my thoughts, from the times I have talked to them. My observations are based on personal interactions.
The effects of these interactions have not been statistically measured, so we don’t know their true scale. This is something that should be examined carefully to understand the real consequences.
We need to look at adaptations carefully to see if they can be used in the wrong way. Adaptations have to be checked to make sure they do not have any misuse risks. When we talk about adaptations we have to think about the misuse risks that can come with them.
People may understand things in ways and what works for one person may not work for another especially when you consider all the different cultures and backgrounds that the online community has, like the Facebook users or the Twitter users and this is true, for all the users.
I want to make a place where we can talk about safety mechanisms and what users expect from them. Sometimes these two things do not agree with each other. I think it is helpful to look at this and try to figure out how to make AI systems safer and more understanding of the things. The goal is to create a space to discuss safety mechanisms and user expectations and how they can clash. I hope these observations can help us make AI systems better, at balancing safety mechanisms and nuance.
Note: These observations may be of interest to research groups like the Center for Human-Compatible AI (CHAI) or in AI-assisted alignment contexts.