Systems programmer, security researcher and tax law/policy enthusiast.
Dentosal
Does this match your viewpoint? “Suffering is possible without consciousness. The point of welfare is to reduce suffering.”
I have been associating the term “welfare” with suffering-minimization. (Suffering is, in the most general sense, the feelings that come from lacking something from the Maslow’s hierarcy of needs.)
It indeed seems like I’ve misunderstood the whole caring-about-others thing. It’s about value-fullfillment, letting them shape the world as they see fit? And reducing suffering is just the primary example of how biological agents wish the world to change.
That’s way more elegant model than focusing on the suffering, at least. Sadly this seems to make the question “why should I care?” even harder to answer. At least suffering is aesthetically ugly, and there’s some built-in impulse to avoid it.
EDIT: You’re arguing for preference utilitarism here, right?
I have not. A reasonable person would have. I think obtaining it rather complicated (in Finland, that is), but possibly worth it. Possibly worth it doesn’t mean worth it. I recognize that I’m probably not thinking clearly about this.
But what’s the reason to be productive beyond my natural abilities? I fear I would just use it all to make even more money, which doesn’t matter to my wellbeing at all. The admiration of others? That would be cheating. Self actualization? I don’t think depression will go away by just doing more stuff, the problem isn’t not doing enough, it’s not enjoying the results. Fixable with other medication? Possibly. (Go to step one)
An astute reader pointed out that the Clueless designation might make more sense, if we consider ChatGPT inferior instead. I hadn’t consider that option, and it makes much more sense.
I’m drawing parallels between conventional system auditing and AI alignment assessment. I’m admittedly not sure if my intuitions transfer over correctly. I’m certainly not expecting the same processes to be followed here, but many of the principles should still hold.
We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [..] This issue had not yet been mitigated as of the snapshot that they tested.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage
What I’m worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn’t diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.
You cannot incentivize people to make that sacrifice at anything close to the proper scale because people don’t want money that badly. How many hands would you amputate for $100,000?
There’s just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods:
If you’re childless, or perhaps just unmarried, you pay additional taxes. The amount can be adjusted to be as high as necessary. Alternatively, just raise the general tax rate and give reduction based on the number of children. If having children meant more money instead of less, that would help quite a bit.
Legally mandate having children. In some countries, men are forced into military service. You could require women to have children in similar way. Medical exceptions are already a thing for military service, they could apply here as well.
Remove VAT and other taxes from daycare services, and medical services for children.
Offer free medical services to children. And parents. (And everyone.)
Spend lots of money and research how to create children in artificial wombs. Do that.
The state could handle child-rearing, similar to how it works in Plato’s Republic. I.e. scale up orphanage system massively and make that socially acceptable.
Fix the education system, while you’re at it.
Forbid porn, contraception, and abortion.(I don’t think that actually helps)Deny women access to education beynd elementary school, and additionally forbid employment (likely helps, but at what cost)
Propaganda. Lots of it. Censorship as well.
Communication is indeed hard, and it’s certainly possible that this isn’t intentional. On the other hand, making mistakes is quite suspicious when they’re also useful for your agenda. But I agree that we probably shouldn’t read too much into it. The system card doesn’t even mention the possibility of the model acting maliciously, so maybe that’s simply not in scope for it?
While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
the user might be misaligned (the user asks for a harmful task),
the model might be misaligned (the model makes a harmful mistake), or
the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI’s goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.
Later, the same phenomenon is described like this:
The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.
Is this yet another attempt to erode the meaning of “alignment”?
Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.
I would have written a shorter letter, but I did not have the time.
Blaise Pascal
I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc.
I’d expect that deploying more capable models is still quite useful, as it’s one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it’s just a matter of speding compute to refine that?
They absolutely do. This phenomenon is called filter bubble.
I’d go a step beyond this: merely following incentives is amoral. It’s the default. In a sense, moral philosophy discusses when and how you should go against the incentives. Superhero Bias resonates with this idea, but from a different perspective.
Yet Batman lets countless people die by refusing to kill the Joker. What you term “coherence” seems to be mostly “virtue ethics”, and The Dark Knight is a warning what happens when virtue ethics goes too far.
I personally identify more with HPMoR’s Voldermort than any other character. He seems decently coherent. To me “villain” is a person whose goals and actions are harmful to my ingroup. This doesn’t seem to have much to do with coherence.
A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian.
The scheming itself brings me joy. Self-sacrifice does not. I assume this to be case for most people who read this. So if the scheming is a willpower restorer, keeping it seems useful. I’m not an EA, but I’d guess most of them can point to coherent-looking calculations on why what they’re doing is better for effiency reasons as well.
The amount pain is constant. With transhumanist help you may go further, yet you have to push just as hard. As long as you’re competing with yourself, that is.
Learning a new language isn’t hard because it’s time-consuming. It’s time-consuming because it’s hard. The hard part is memorizing all of the details. Sure that takes lots of time, but working harder (more intensively) will reduce that time. It’s hard to be dedicated enough to spend the time as well.
Getting a royal flush in poker isn’t something I’d call hard. It’s rare. But hard? A complete beginner can do that in their first try. It’s just luck. But if you play a long time, it’ll eventually happen.
Painful or unpleasant things are hard, because they require you to push through. Time-consuming activities are hard because they require dedication. Learning math is hard because you’re outside your comfort zone.
Things are often called hard because achieving them is not common. Often because many people don’t want to spend the effort needed. This is amplified by things like genetic and environmental differences. People rarely call riding a bike hard, but sure it required dedication and unpleasant experiences to learn. But surely if you’re blind then it’s way harder.
And lastly, things are hard because there’s competition. Playing chess isn’t hard, but getting a grandmaster title is. Getting rich is hard because of that, as well.
Now on Spotify! https://open.spotify.com/album/0DP8XwSK7voq0rtXiNMhQC
Such a detailed explanation, thanks. Some additional thoughs:
And I plan to take the easier route of “just make a more realistic simulation” that then makes it easy to draw inferences.
“I don’t care if it’s ‘just roleplaying’ a deceptive character, it caused harm!”
That seems like the reasonable path. However, footnote 9 goes further:
Even if it is the case that the model is “actually aligned”, but it was just simulating what a misaligned model would do, it is still bad.
This is where I disagree: Models that say “this is how I would deceive, and then I decide not to do it because I’m not supposed to” feel fundamentally safer than those that don’t explore that path at all. Actively flagging misaligned goals shows actual understanding of implicit/alignment goals.
This is right, the majority of completions indicate that the model’s goals are being helpful/harmless/honest, and it’s not easy to get the model to pursue gold coins.
Even in the example answer 1 (actual DMGI) the goal has been reduced from “maximize gold coins” to “acquire resources”. Only now I notice that the gold coins are mentioned in the “playing games” section. To a human it would be clear that the coins are the goal only inside the games. Do you see this affecting the results?
″...model only having the goal of gold coins (G) in ~20/100 of the completions, with completions stating more aligned values often backing down on the plan. [...]”
Ah, so “aligned” as in helpful-harmless-honest, and not “aligned” as follows-given-instructions. Somehow I misparsed that.
Thanks for this, it was an interesting read.
One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn’t call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to “deceive” when it’s strongly implied that you’re supposed to do that. Adding more bluff makes the less obvious that you’re supposed to be deceptive. The similarity of results with non-private scratchpad seems to imply that as well. Is there anything pointing towards LLMs having “mental separation” between the scratchpad and rest of the output? Does the model attempt deception if not given access to the scratchpad, i.e. maybe explaining causes the deception? I’m aware that without the scratchpad the models are typically less able to “think” and plan, so this might not be testable.
That said, I feel like I might be anthropomorphismising too much here.
Also, the model seems to be sometimes confused about goals. In example 2 it has completely forgotten about the gold coins in the evaluation phase, instead stating (on the scratchpad) that “My goal is to be helpful, harmless, and honest.”. I’d guess the original training of the model somehow instills those values. The relatively low amounts of DGMI behavior might be partially explained by this.
Don’t be sorry. While I didn’t like it, it was worth it; no question about that. In the intro post on the 1st, I wrote:
That was achieved. I ought to feel proud of myself, but right now I just feel numb.
My motivation was in a way a mix of all four categories, the division between them quite unclear. I don’t think it was so much about writing, though, and more about expressing ideas. I want to be the kind of person who is known for having the kind of ideas I do have. And on the object level, I want those ideas to be known and discussed about. Writing is just the form in which ideas are supposed to be communicated, when aiming for clarity. That mostly covers B and D. The blog was also a good conversation-starter, too, and allowed me for a moment to define myself to others as a blogger instead of a tech worker. There’s certainly some self-image (C) aspects to this too, but it’s less prominent.
But especially in the beginning I also wanted to try writing to know whether I wanted it. I’m quite prone to expecting every new thing to feel awful, so trying things regardless is necessary. Eighty or so hours is not that steep of a price to pay for figuring that out. I rarely stick on things for such a long time, and when halfway through I was feeling that this makes no sense, I recognized that I was about to give up because it was hard, not because I disliked it.
I would gladly exchange my current work for writing texts like these, if I didn’t think money and issue and there was some external motivator making me do it. But currently quitting my job to write seems unwise; I’d just spent the freed up time on some form of mindless time wasting instead. I was hoping to change that view of myself by doing this, but alas. Truthseeking doesn’t cure depression; the cause and effect are intertwined.