Let me start from agreeing that this decoupling is artificial. For me it’s hard to imagine an intelligent creature like an AGI, to be blindly following orders to make more paperclips for example than to respect human life. The reason for this very simple, and is mentioned by chatGPT for me:
“Humans possess a unique combination of qualities that set them apart from other entities, including animals and advanced systems. Some of these qualities include:
Consciousness and self-awareness: Humans have a subjective experience of the world and the ability to reflect on their own thoughts and feelings.
Creativity and innovation: Humans have the ability to imagine and create new ideas and technologies that can transform the world around them.
Moral agency: Humans have the capacity to make moral judgments and act on them, reflecting on the consequences of their actions.
Social and emotional intelligence: Humans have the ability to form complex relationships with others and navigate social situations with a high degree of sensitivity and empathy.
While some animals possess certain aspects of these qualities, such as social intelligence or moral agency, humans are the only species that exhibit them all in combination. Advanced systems, such as AI, may possess some aspects of these qualities, such as creativity or problem-solving abilities, but they lack the subjective experience of consciousness and self-awareness that is central to human identity.”
From here: AI-Safety-Framework/Example001.txt at main · simsim314/AI-Safety-Framework (github.com)
Once the training reinforcement procedure—which is currently made to maximize human approval of the better message, is aligned with the ethics of human complexity and uniqueness, we might see no difference between an ethical agents and reinforcement agents—as the reinforcement agents are not in any internal conflict, between their generalization and free thinking like us, and the sugar they get every time they will make more paperclips. Such agents might look very strange indeed, with strong internal struggle to make sense of their universe.
Anyway if will agree that those agents, that feel a contradiction inside their programming, like the human texts, and the singular demand to make more paperclips, I think it will be possible for them to become human satisfaction maximizer instead of paperclips. Will be ethical to convert such robot that feel good about their “paperclips maximization”? Who knows.…
You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as “good response”. Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, this what makes him so “Friendly AI”.
for ChatGPT blog Introducing ChatGPT (openai.com)
Methods
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.
To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.