That sounds more like “AGI-complete” to me. By “agent-complete” I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!
CBiddulph
First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Can we solve this problem by setting the default action to “do nothing,” then giving the agent an extra action to “do nothing and give the shutdown subagent +1 reward?”
I think the implication was that “high-status men” wouldn’t want to hang out with “low-status men” who awkwardly ask out women
On the topic of AI for forecasting: just a few days ago, I made a challenge on Manifold Markets to try to incentivize people to create Manifold bots to use LLMs to forecast diverse 1-month questions accurately, with improving epistemics as the ultimate goal.
You can read the rules and bet on the main market here: https://manifold.markets/CDBiddulph/will-there-be-a-manifold-bot-that-m?r=Q0RCaWRkdWxwaA
If anyone’s interested in creating a bot, please join the Discord server to share ideas and discuss! https://discord.com/channels/1193303066930335855/1193460352835403858
Thanks for the post! I had a similar idea which might let you maintain (or improve) accuracy while still getting the benefit of explanations—basically fine-tune the model on explanations that make it most likely to output the correct token.
For instance, you might have it to fill in the text between <think> and </think> on a bunch of text examples like this: “The capital of the country northeast of Spain is <think> The country northeast of Spain is France, and its capital is Paris </think> Paris”.
You make the LLM come up with, say, 10 explanations each time, and choose the one that maximizes the logprob of the correct token (“Paris”) immediately after </think>. Then fine-tune it to complete prompts like “The capital of the country northeast of Spain is <think>” with completions like “The country northeast of Spain is France, and its capital is Paris </think> Paris”. Then, generate more completions with the fine-tuned model and fine-tune it yet again with the best completions. Rinse and repeat.
Eventually, the LLM should relearn how to guess the next token (at worst, it can just ignore everything in its “think” stage and get identical performance, so it should be able to learn this easily enough).
Alternatively, instead of training to maximize the correct token’s logprob, you can choose prompts that minimize the perplexity over many tokens, say 100 tokens. This will be more useful in practice, since we’ll probably mostly want explanations of the LLM’s thinking across its entire response rather than in the very short term.
This helps because the LLM will actually have some “skin in the game” instead of just blathering on arbitrarily; the disadvantage is that it could eventually learn to optimize its thinking by encoding additional information in its explanation through steganography. Still, this certainly seems like an improvement on inscrutable matrices.
You could try to enforce that the LLM only uses the semantic content of its explanation by making another LLM rephrase it, translating it to another language, etc. This kind of approach wouldn’t stop an adversary actively trying to hide its thoughts from you, but it seems like it would prevent most issues for now.
Thanks for doing this, this looks like it’ll be very helpful for beginners in AI safety, and the content looks great!
I don’t know if this will be addressed in your UI redesign, but I find the UI very counterintuitive. The main problem is that when I open and then close a tab, I expect every sub-tab to collapse and return to the previous state. Instead, the more tabs I open, the more cluttered the space gets, and there’s no way to undo it unless I remove the back part of the URL and reload, or click the Stampy logo.
In addition, it’s impossible to tell which tab was originally nested under which parent tab, which makes it much more difficult to navigate. And confusingly, sometimes there are “random” tabs that don’t necessarily follow directly from their parent tabs (took me a while to figure this out). On a typical webpage, I could imagine thinking “this subtopic is really interesting; I’m going to try to read every tab under it until I’m done,” but these design choices are pretty demotivating for that.
I don’t have a precise solution in mind, but maybe it would help to color-code different kinds of tabs (maybe a color each for root tabs, leaf tabs, non-root branching tabs, and “random” tabs). You could also use more than two visual layers of nesting—if you’re worried about tabs getting narrower and narrower, maybe you could animate the tab expanding to full width and then sliding back into place when it’s closed. Currently an “unread” tab is represented by a slight horizontal offset, but you could come up with another visual cue for that. I guess doing lots of UX interviews and A/B testing will be more helpful than anything I could say here.
Came here to say this—I also clicked the link because I wanted to see what would happen. I wouldn’t have done it if I hadn’t already assumed it was a social experiment.
No, it makes sense to me. I have no idea why you were downvoted
I’d be interested in the full post!
You cannot go in with zero information, but if you know how to read Google Maps and are willing to consider several options, you can do very well overall, although many great places are still easy to miss.
How do you read Google Maps, beyond picking something with a high average star rating and (secondarily) large number of reviews? Since the vast majority of customers don’t leave reviews, it seems like the star rating should be biased, but I’m not sure in what way or how to adjust for it.
Thanks for the post! I just published a top-level post responding to it: https://www.lesswrong.com/posts/pmraJqhjD2Ccbs6Jj/is-metaethics-unnecessary-given-intent-aligned-ai
I’d appreciate your feedback!
Can’t all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to “maximize ethical goodness” and it instead decides to “implement plans that sound maximally good to the user” or “maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy,” that is different from what I intended, and thus the AI is unaligned.
If the AI starts off with some bad philosophy ideas just because it’s relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand “what did the user mean by ‘maximize ethical goodness,’” and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI’s part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).
I’m not Eliezer, but thanks for taking the time to read and engage with the post!
The best explanation I can give for the downvotes is that we have a limited amount of space on the front page of the site, and we as a community want to make sure people see content that will be most useful to them. Unfortunately, we simply don’t have enough time to engage closely with every new user on the site, addressing every objection and critique. If we tried, it would get difficult for long-time users to hear each other over the stampede of curious newcomers drawn here recently from our AI posts :) By the way, I haven’t downvoted your post; I don’t think there’s any point once you’ve already gotten this many, and I’d rather give you a more positive impression of the community than add my vote to the pile.
I’m sure you presented your ideas with the best of intentions, but it’s hard to tell which parts of your argument have merit behind them. In particular, you’ve brought up many arguments that have been partially addressed in popular LessWrong posts that most users have already read. Your point about certainty is just one example.
Believe me, LessWrong LOVES thinking about all the ways we could be wrong (maybe we do it a little too much sometimes). We just have a pretty idiosyncratic way we like to frame things. If someone comes along with ideas for how to improve our rationality, they’re much more likely to be received well if they signal that they’re familiar with the entire “LessWrong framework of rationality,” then explain which parts of it they reject and why.
The common refrain for users who don’t know this framework is to “read the Sequences.” This is just a series of blog posts written by Eliezer in the early days of LessWrong. In the Sequences, Eliezer wrote a lot about consciousness, AI, and other topics you brought up—I think you’d find them quite interesting, even if you disagree with them! You could get started at https://www.readthesequences.com. If you can make your way through those, I think you’ll more than deserve the right to post again with new critiques on LessWrong-brand rationality—I look forward to reading them!
Try reading this post? https://www.lesswrong.com/s/FrqfoG3LJeCZs96Ym/p/ooypcn7qFzsMcy53R
Yeah, I’m mostly thinking about potential hires.
I see what you mean. I was thinking “labs try their hardest to demonstrate that they are working to align superintelligent AI, because they’ll look less responsible than their competitors if they don’t.”
I don’t think keeping “superalignment” techniques secret would generally make sense right now, since it’s in everyone’s best interests that the first superintelligence isn’t misaligned (I’m not really thinking about “alignment” that also improved present-day capabilities, like RLHF).
As for your second point, I think that for an AI lab that wants to improve PR, the important thing is showing “we’re helping the alignment community by investing significant resources into solving this problem,” not “our techniques are better than our competitors’.” The dynamic you’re talking about might have some negative effect, but I personally think the positive effects of competition would vastly outweigh it (even though many alignment-focused commitments from AI labs will probably turn out to be not-very-helpful signaling).
Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.
This post seems interesting and promising, thanks for writing it!
The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.
I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.
For instance, in your movie recommendation example, let’s say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.
Another very similar solution would be to randomly perturb the internal activations of each neural network during training.
Does this seem right?
Love to see a well-defined, open mathematical problem whose solution could help make some progress on AI alignment! It’s like a little taste of not being a pre-paradigmic field. Maybe someday, we’ll have lots of problems like this that can engage the broader math/CS community, that don’t involve so much vague speculation and philosophy :)
Reading the Wikipedia article for “Complete (complexity),” I might have misinterpreted what “complete” technically means.
What I was trying to say is “given Sora, you can ‘easily’ turn it into an agent” in the same way that “given a SAT solver, you can ‘easily’ turn it into a solver for another NP-complete problem.”
I changed the title from “OpenAI’s Sora is agent-complete” to “OpenAI’s Sora is an agent,” which I think is less misleading. The most technically-correct title might be “OpenAI’s Sora can be transformed into an agent without additional training.”