CBiddulph

Karma: 240

CBiddulph 12 Jan 2022 20:06 UTC
LW: 4 AF: 1
AF
on: Prizes for ELK proposals
I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI—not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the robber is “in kahoots” with the AI in this situation (i.e. the AI projects a video onto the wall instructing the robber to put up a screen)? This seems a bit strange and complicated.
Instead, we might consider the situation in which the AI controls a SmartFabricator, which we want to arrange carbon atoms into diamonds. We might then imagine that it instead fabricates a screen to put in front of the camera, or makes a fake diamond. This wouldn’t require the existence of an external “robber” agent. Does the SmartVault scenario have helpful aspects that the SmartFabricator example lacks?

CBiddulph 12 Jan 2022 22:08 UTC
2 points
in reply to: Ryan Beck’s comment on: Prizes for ELK proposals
Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.
I’d like to try making a correction here, though I might make some mistakes too.
The predictor is different from the AI that protects the diamond and doesn’t try to “choose” actions in order to accomplish any particular goal. Rather, it takes a starting video and a set of actions as input, then returns a prediction of what the ending video would be if those actions were carried out.
An agent could use this predictor to choose a set of actions that leads to videos that a human approves of, then carry out these plans. It could use some kind of search policy, like Monte-Carlo Tree Search, or even just enumerate through every possible action and figure out which one seems to be the best. For the purposes of this problem, we don’t really care; we just care that we have a predictor that uses some model of the world (which might take the form of a Bayes net) to guess what the output video will be. Then, the reporter can use the model to answer any questions asked by the human.

CBiddulph 13 Jan 2022 3:31 UTC
1 point
in reply to: Ryan Beck’s comment on: Prizes for ELK proposals
It seems fair to call it a subcomponent, yeah

CBiddulph 17 Jun 2022 16:07 UTC
4 points
in reply to: rictic’s comment on: Contra Hofstadter on GPT-3 Nonsense
Please let us know if they respond!

CBiddulph 26 Oct 2022 6:30 UTC
2 points
1
on: Intent alignment should not be the goal for AGI x-risk reduction
I think the danger of intent-alignment without societal-alignment is pretty important to consider, although I’m not sure how important it will be in practice. Previously, I was considering writing a post about a similar topic—something about intent-level alignment being insufficient because we hadn’t worked out metaethical issues like how to stably combine multiple people’s moral preferences and so on. I’m not so sure about this now, because of an argument along the lines of “given that it’s aligned with a thoughtful, altruistically motivated team, an intent-aligned AGI would be able to help scale their philosophical thinking so that they reach the same conclusions they would have come to after a much longer period of reflection, and then the AGI can work towards implementing that theory of metaethics.”

Here’s a recent post that covers at least some of these concerns (although it focuses more on the scenario where one EA-aligned group develops an AGI that takes control of the future): https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future

I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available (i.e. easily usable by corporations and potential bad actors) and takeoff is slow enough for these non-altruistic entities to develop powerful AGIs pursuing their own ends. This may be a compelling argument for withholding a solution to intent-alignment from the world if one is discovered.

The Rational Utilitarian Love Movement (A Historical Retrospective)

CBiddulph3 Nov 2022 7:11 UTC

3 points

0 comments1 min readLW link

CBiddulph 13 Jan 2023 4:50 UTC
19 points
8
in reply to: Eli Tyre’s comment on: How it feels to have your mind hacked by an AI
Strong upvote + agree. I’ve been thinking this myself recently. While something like the classic paperclip story seems likely enough to me, I think there’s even more justification for the (less dramatic) idea that AI will drive the world crazy by flailing around in ways that humans find highly appealing.

LLMs aren’t good enough to do any major damage right now, but I don’t think it would take that much more intelligence to get a lot of people addicted or convinced of weird things, even for AI that doesn’t have a “goal” as such. This might not directly cause the end of the world, but it could accelerate it.

The worst part is that AI safety researchers are probably just the kind of people to get addicted to AI faster than everyone else. Like, not only do they tend to be socially awkward and everything blaked mentioned, they’re also just really interested in AI.

As much as it pains me to say it, I think it would be better if any AI safety people who want to continue being productive just swore off recreational AI use right now.

Preparing for AI-assisted alignment research: we need data!

CBiddulph17 Jan 2023 3:28 UTC

31 points

3 comments1 min readLW link

CBiddulph 17 Jan 2023 18:33 UTC
4 points
0
on: Experiment Idea: RL Agents Evading Learned Shutdownability
This was interesting to read, and I agree that this experiment should be done!

Speaking as another person who’s never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is “go reimplement an old paper,” and it seems like this wouldn’t require anything new as far as ML techniques go. If you want to upskill in ML, I’d say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)

The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing “what it hears,” which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)

Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn’t have support for RL, but you can probably find a simple paper or code that does RL for transformers.

CBiddulph 6 Feb 2023 8:38 UTC
16 points
5
on: SolidGoldMagikarp (plus, prompt generation)
This looks like exciting work! The anomalous tokens are cool, but I’m even more interested in the prompt generation.

Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood’s violence-free LM.

It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like “excerpt from a New York Times bestseller”. (Is this what you meant by “prompt search?”)

I’d be curious to hear how you think we could use this for eliciting latent knowledge.

I’m guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we’d want to start by finding prompts that users might realistically use rather than crazy edge cases like “StreamerBot”. Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.

Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).

CBiddulph 13 Feb 2023 21:41 UTC
4 points
2
in reply to: Alex_Altair’s comment on: SolidGoldMagikarp (plus, prompt generation)
I assumed it was primarily because Eliezer “strongly approved” of it, after being overwhelmingly pessimistic about pretty much everything for so long.

I didn’t realize it got popular elsewhere, that makes sense though and could help explain the crazy number of upvotes. Would make me feel better about the community’s epistemic health if the explanation isn’t that we’re just overweighting one person’s views.

CBiddulph 26 Mar 2023 2:44 UTC
4 points
3
on: Microsoft Research Paper Claims Sparks of Artificial Intelligence in GPT-4
The title and the link in the first paragraph should read “Sparks of Artificial General Intelligence”

CBiddulph 28 Mar 2023 4:50 UTC
−2 points
0
in reply to: roystgnr’s comment on: GPT-4 is bad at strategic thinking
This is also basically an idea I had—I actually made a system design and started coding it, but haven’t made much progress due to lack of motivation… Seems like it should work, though

CBiddulph 6 May 2023 1:49 UTC
3 points
1
on: Prizes for matrix completion problems
Love to see a well-defined, open mathematical problem whose solution could help make some progress on AI alignment! It’s like a little taste of not being a pre-paradigmic field. Maybe someday, we’ll have lots of problems like this that can engage the broader math/CS community, that don’t involve so much vague speculation and philosophy :)

CBiddulph 28 May 2023 20:48 UTC
LW: 2 AF: 2
1
AF
on: Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies
This post seems interesting and promising, thanks for writing it!

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.

I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.

For instance, in your movie recommendation example, let’s say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.

Another very similar solution would be to randomly perturb the internal activations of each neural network during training.

Does this seem right?

CBiddulph 6 Jul 2023 20:57 UTC
12 points
9
in reply to: mesaoptimizer’s comment on: [Linkpost] Introducing Superalignment
Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.

CBiddulph 7 Jul 2023 0:31 UTC
2 points
3
in reply to: mako yass’s comment on: [Linkpost] Introducing Superalignment
I see what you mean. I was thinking “labs try their hardest to demonstrate that they are working to align superintelligent AI, because they’ll look less responsible than their competitors if they don’t.”

I don’t think keeping “superalignment” techniques secret would generally make sense right now, since it’s in everyone’s best interests that the first superintelligence isn’t misaligned (I’m not really thinking about “alignment” that also improved present-day capabilities, like RLHF).

As for your second point, I think that for an AI lab that wants to improve PR, the important thing is showing “we’re helping the alignment community by investing significant resources into solving this problem,” not “our techniques are better than our competitors’.” The dynamic you’re talking about might have some negative effect, but I personally think the positive effects of competition would vastly outweigh it (even though many alignment-focused commitments from AI labs will probably turn out to be not-very-helpful signaling).

CBiddulph 7 Jul 2023 0:43 UTC
1 point
0
in reply to: mako yass’s comment on: [Linkpost] Introducing Superalignment
Yeah, I’m mostly thinking about potential hires.

CBiddulph 19 Jul 2023 17:00 UTC
1 point
0
on: Secret Cosmos: Introduction
Try reading this post? https://www.lesswrong.com/s/FrqfoG3LJeCZs96Ym/p/ooypcn7qFzsMcy53R

CBiddulph 20 Jul 2023 18:05 UTC
1 point
0
in reply to: Al Link’s comment on: Secret Cosmos: Introduction
I’m not Eliezer, but thanks for taking the time to read and engage with the post!

The best explanation I can give for the downvotes is that we have a limited amount of space on the front page of the site, and we as a community want to make sure people see content that will be most useful to them. Unfortunately, we simply don’t have enough time to engage closely with every new user on the site, addressing every objection and critique. If we tried, it would get difficult for long-time users to hear each other over the stampede of curious newcomers drawn here recently from our AI posts :) By the way, I haven’t downvoted your post; I don’t think there’s any point once you’ve already gotten this many, and I’d rather give you a more positive impression of the community than add my vote to the pile.

I’m sure you presented your ideas with the best of intentions, but it’s hard to tell which parts of your argument have merit behind them. In particular, you’ve brought up many arguments that have been partially addressed in popular LessWrong posts that most users have already read. Your point about certainty is just one example.

Believe me, LessWrong LOVES thinking about all the ways we could be wrong (maybe we do it a little too much sometimes). We just have a pretty idiosyncratic way we like to frame things. If someone comes along with ideas for how to improve our rationality, they’re much more likely to be received well if they signal that they’re familiar with the entire “LessWrong framework of rationality,” then explain which parts of it they reject and why.

The common refrain for users who don’t know this framework is to “read the Sequences.” This is just a series of blog posts written by Eliezer in the early days of LessWrong. In the Sequences, Eliezer wrote a lot about consciousness, AI, and other topics you brought up—I think you’d find them quite interesting, even if you disagree with them! You could get started at https://www.readthesequences.com. If you can make your way through those, I think you’ll more than deserve the right to post again with new critiques on LessWrong-brand rationality—I look forward to reading them!

CBiddulph

The Ra­tional Utili­tar­ian Love Move­ment (A His­tor­i­cal Ret­ro­spec­tive)

Prepar­ing for AI-as­sisted al­ign­ment re­search: we need data!

The Rational Utilitarian Love Movement (A Historical Retrospective)

Preparing for AI-assisted alignment research: we need data!