Ceramic engineering researcher by training. Been interested in ethics for several years. More recently have gotten into data science.
sweenesm
Thanks for the comment! Perhaps I was more specific than needed, but I wanted to give people (and any AI’s reading this) some concrete examples. I imagine AI’s will someday be able to optimize this idea.
I would love it if our school system changed to include more emotional education, but I’m not optimistic they would do this well right now (due in part to educators not having experience with emotional education themselves). Hopefully AI’s will help at some point.
How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
10.5-13% on text only part of HLE (text only are 90% of the questions)
[corrected the above to read “o3-mini”, thanks.]
Thanks for the comment. Timeframes “determined” by feel (they’re guesses that seem reasonable).
Proposal for a Form of Conditional Supplemental Income (CSI) in a Post-Work World
Thanks for the post. It’d be helpful to have a TL;DR for this (an abstract), since it’s kinda long—what are the main points you’re trying to get across?
Yes, this is point #1 from my recent Quick Take. Another interesting point is that there are no confidence intervals on the accuracy numbers—it looks like they only ran the questions once in each model, so we don’t know how much random variation might account for the differences between accuracy numbers. [Note added 2-3-25: I’m not sure why it didn’t make the paper, but Scale AI does report confidence intervals on their website.]
Some Notes on Humanity’s Last Exam
While I congratulate CAIS and Scale AI for producing this benchmark, I have a couple of comments on things they may want to clear up (although these are ultimately a bit “in the weeds” to what the benchmark is really supposed to be concerned with, I believe):
DeepSeek-R1 and Gemini 2.0 Flash Thinking were released after the deadline for submitting questions eligible for prizes (though submissions remained open after this). Thus, these models weren’t used to screen most, if not all, questions. This means that the questions were preferentially screened to stump the other models, but not these, so it wouldn’t be too surprising if these models scored better than others.
After reading the paper, my impression is that these questions were run through the models each only one time (after the one time they were run through some of the models when originally submitted). If people want to get into the weeds and say that DeepSeek-R1 is actually better on this exam than OpenAI’s o1, it would be useful to run the questions through each model at least 6 times to establish some confidence intervals on the accuracy numbers. I suspect that this would show the differences between o1 and R1 are not statistically significant. It would be interesting to know the typical size of the confidence intervals, though, and whether these confidence interval sizes shift when “reasoning” is involved in the model or not. (It would also likely be useful if reporting on any and all benchmarks for AI’s required including confidence intervals so we could feel better that people weren’t gaming the system and just reporting their best results.) Running the questions on more models that weren’t originally used for question screening, such as Llama 3, could help establish even more of a baseline. [Note added 2-3-25: I’m not sure why it didn’t make the paper, but Scale AI does report confidence intervals on their website.]
20% of questions are multiple-choice. If each multiple-choice question has 5 possible answers, then random guessing would yield 4% accuracy on the exam as a whole. It would be interesting to know what the actual average number of answers was for the multiple-choice questions, and thus the actual “random guessing accuracy.”
Also, it might be interesting to take some of the multiple-choice questions and rewrite them by randomly removing one of the answer choices and replacing it with “none of the above.” If the model chooses “none of the above,” then see if it can come up with the right answer on its own, rather than from a list (if indeed the right answer isn’t present). Personally, I always found multiple-choice questions in which you weren’t sure if the right answer was there to be more difficult—when the right answer is there, sometimes you can take clues from it to figure out that it’s the right answer. Rewriting some questions in this way could make them a little more difficult without much added work by the exam preparers.
Finally, having models take the multiple-choice section of the exam numerous times with slight variations in the wording of the questions, without changing their meanings, could make this section of the exam a little more robust against “luck.”
Note: I submitted two materials science-related multiple-choice questions with 5 answer options each for consideration in Humanity’s Last Exam. For submitting questions (https://agi.safe.ai/submit), the process was to type your question in an input window, enter as many multiple choice answers as you wanted (I think the minimum was 5 and there might not have been a maximum), and then this question was run through various models (GTP-4o, Sonnet 3.5, Gemini Pro 1.5, o1) to see if they gave the correct answer. The paper says that the screening criterion was that “multiple-choice questions must stump all but one model to account for potential lucky guesses.” I think I didn’t submit my questions unless it could stump all the models.
In case you’re interested, you can find my one question that made the cut by searching for “sintering” in the dataset available at HuggingFace. For my one question that didn’t make the cut, my strategy was to focus on an area in which there’ve been some false ideas presented in the literature that later got cleared up. I figured this might make it harder for LLM’s to answer correctly. I don’t know why the question didn’t make the cut, though, so don’t take this strategy as the reason. Just note that it’s possible that some of the other questions that made the final list could’ve been written with this sort of strategy in mind.
Thanks for the post. Yes, our internal processing has a huge effect on our well-being. If you take full responsibility for your emotions (which mindfulness practices, gratitude and reframing are all part of), then you get to decide what your well-being is in any moment, even if physical pleasure or pain are pushing you in one direction or the other. This is part of the process of raising your self-esteem (see Branden), as is taking full responsibility for your actions so you don’t have to live with the pain of conscience breaches. Here’s a post that talks more about these things.
In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.
Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.
I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time—when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.
Thanks for all your comments!
Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.
I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
Here’s what Claude 3.5 Sonnet said in response to this prompt:
Me:
What are the best plans for preventing catastrophes such as everyone dying or a totalitarian government taking over in multipolar AGI scenarios, i.e., ones in which AGI’s are aligned to what their human users want to do, and at least one human user order them to do bad things?
Claude:
Let me share some thoughts on AGI governance and safety in multipolar scenarios, while noting this is a complex topic that reasonable people disagree on:
A key challenge in multipolar scenarios is that even if individual AI systems are aligned with their users’ values and intentions, some users may have harmful goals. Some potential approaches to address this include:
International coordination and governance frameworks that place limits on allowable AI system capabilities and uses, similar to arms control treaties. However, enforcement would be challenging.
Technical measures like capability limitations built into AI systems’ training that prevent them from being used for certain catastrophic actions, regardless of user intent. The challenge is defining and implementing such limitations without preventing beneficial uses.
Focused work on improving human values and cooperation through better institutions, education, and mechanisms for resolving conflicts peacefully. This could reduce the likelihood of actors wanting to cause catastrophic harm.
Developing AI systems with some form of common good or humanitarian principles that apply regardless of user instructions, though this involves difficult philosophical and technical questions about how to define and implement such principles.
What are your thoughts on which approaches seem most promising or practical? I’m particularly curious about your perspective on the balance between technical and social/institutional solutions.
___________________________________
I would add to this Defensive acceleration (d/acc) described by Vitalik Buterin (https://vitalik.eth.limo/general/2025/01/05/dacc2.html and https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html).
Thanks. I guess I’d just prefer it if more people were saying, “Hey, even though it seems difficult, we need to go hard after conscience guard rails (or ‘value alignment’) for AI now and not wait until we have AI’s that could help us figure this out. Otherwise, some of us we might not make it until we have AI’s that could help us figure this out.” But I also realize that I’m just generally much more optimistic about the tractability of this problem than most people appear to be, although Shane Legg seemed to say it wasn’t “too hard,” haha.[1]
- ^
Legg was talking about something different than I am, though—he was talking about “fairly normal” human values and ethics, or what most people value, while I’m basically talking about what most people would value if they were wiser.
- ^
Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.
A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.
To respond to some of your specific points:
I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do—I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being—everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.
Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.
Thanks for the post. I think it’d be helpful if you could add some links to references for some of the things you say, such as:
For instance, between 10^10 and 10^11 parameters, models showed dramatic improvements in their ability to interpret emoji sequences representing movies.
Any update on when/if prizes are expected to be awarded? Thank you.
Thanks for the post and congratulations on starting this initiative/institute! I’m glad to see more people drawing attention to the need for some serious philosophical work as AI technology continues to advance (e.g., Stephen Wolfram).
One suggestion: consider expanding the fields you engage with to include those of moral psychology and of personal development (e.g., The Option Institute, Tony Robbins, Nathaniel Branden).
Best of luck on this project being a success!
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.
Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
- ^
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
- ^
Nice post, thanks for sharing it. In terms of a plan for fighting human disempowerment that’s compatible with the way things seem to be going, i.e., assuming we don’t pause/stop AI development, I think we should:
Not release any AGI/AGI+ systems without hardware-level, tamper-proof artificial conscience guardrails on board, with these consciences geared towards promoting human responsibility as a heuristic for promoting well-being
Avoid having humans living on universal basic incomes (UBI) with little to no motivation to keep themselves from becoming enfeebled—a conditional supplemental income (CSI) might be one way to do this
Does #1 have potential risks and pitfalls, and is it going to be difficult to figure out and implement in time? Yes, but more people focusing more effort on it would help. And AI’s that have conscience around disempowering humans seems like a good first step to help avoid disempowering humans.
#1 would also help against what I think is a more immediate threat: use of advanced AI’s by bad human actors to purposely or uncaringly cause destruction, such as in the pursuit of making money. Autonomous advanced defensive AI’s with artificial conscience guardrails could potentially limit collateral damage while preventing/defending against attacks. The speed of such attacks will likely be too great for humans to be in the loop on decisions made to defend against them.